This section describes how to use xpressive to accomplish text manipulation and parsing tasks. If you are looking for detailed information regarding specific components in xpressive, check the Reference section. What is xpressive?xpressive is a regular expression template library. Regular expressions (regexes) can be written as strings that are parsed dynamically at runtime (dynamic regexes), or as expression templates [2] that are parsed at compile-time (static regexes). Dynamic regexes have the advantage that they can be accepted from the user as input at runtime or read from an initialization file. Static regexes have several advantages. Since they are C++ expressions instead of strings, they can be syntax-checked at compile-time. Also, they can naturally refer to code and data elsewhere in your program, giving you the ability to call back into your code from within a regex match. Finally, since they are statically bound, the compiler can generate faster code for static regexes. xpressive's dual nature is unique and powerful. Static xpressive is a bit like the Spirit Parser Framework. Like Spirit, you can build grammars with static regexes using expression templates. (Unlike Spirit, xpressive does exhaustive backtracking, trying every possibility to find a match for your pattern.) Dynamic xpressive is a bit like Boost.Regex. In fact, xpressive's interface should be familiar to anyone who has used Boost.Regex. xpressive's innovation comes from allowing you to mix and match static and dynamic regexes in the same program, and even in the same expression! You can embed a dynamic regex in a static regex, or vice versa, and the embedded regex will participate fully in the search, back-tracking as needed to make the match succeed. Hello, world!Enough theory. Let's have a look at Hello World, xpressive style: #include <iostream> #include <boost/xpressive/xpressive.hpp> using namespace boost::xpressive; int main() { std::string hello( "hello world!" ); sregex rex = sregex::compile( "(\\w+) (\\w+)!" ); smatch what; if( regex_match( hello, what, rex ) ) { std::cout << what[0] << '\n'; // whole match std::cout << what[1] << '\n'; // first capture std::cout << what[2] << '\n'; // second capture } return 0; } This program outputs the following: hello world! hello world
The first thing you'll notice about the code is that all the types in xpressive
live in the
Next, you'll notice the type of the regular expression object is Notice how the regex object is initialized: sregex rex = sregex::compile( "(\\w+) (\\w+)!" );
To create a regular expression object from a string, you must call a factory
method such as sregex rex = (s1= +_w) >> ' ' >> (s2= +_w) >> '!'; This describes the same regular expression, except it uses the domain-specific embedded language defined by static xpressive.
As you can see, static regexes have a syntax that is noticeably different
than standard Perl syntax. That is because we are constrained by C++'s syntax.
The biggest difference is the use of abc
But in C++, there must be an operator separating sub-expressions: a >> b >> c
In Perl, parentheses
You'll also notice that the one-or-more repetition operator "\\w+"
is the same as: +_w We'll cover all the other differences later. Getting xpressiveThere are three ways to get xpressive. The first and simplest is to download the latest version of Boost. Just go to http://sf.net/projects/boost and follow the “Download” link. The second way is by downloading xpressive.zip at the Boost File Vault in the “Strings - Text Processing” directory. In addition to the source code and the Boost license, this archive contains a copy of this documentation in PDF format. This version will always be stable and at least as current as the version in the latest Boost release. It may be more recent. The version in the File Vault is always guaranteed to work with the latest official Boost release. The third way is by directly accessing the Boost Subversion repository. Just go to http://svn.boost.org/trac/boost/ and follow the instructions there for anonymous Subversion access. The version in Boost Subversion is unstable. Building with xpressive
Xpressive is a header-only template library, which means you don't need to
alter your build scripts or link to any separate lib file to use it. All
you need to do is
If you would also like to use semantic actions or custom assertions with
your static regexes, you will need to additionally include RequirementsXpressive requires Boost version 1.34.1 or higher. Supported CompilersCurrently, Boost.Xpressive is known to work on the following compilers:
Check the latest tests results at Boost's Regression Results Page.
You don't need to know much to start being productive with xpressive. Let's begin with the nickel tour of the types and algorithms xpressive provides. Table?25.1.?xpressive's Tool-Box
Now that you know a bit about the tools xpressive provides, you can pick the right tool for you by answering the following two questions:
Know Your Iterator TypeMost of the classes in xpressive are templates that are parameterized on the iterator type. xpressive defines some common typedefs to make the job of choosing the right types easier. You can use the table below to find the right types based on the type of your iterator. Table?25.2.?xpressive Typedefs vs. Iterator Types
You should notice the systematic naming convention. Many of these types are
used together, so the naming convention helps you to use them consistently.
For instance, if you have a If you are not using one of those four iterator types, then you can use the templates directly and specify your iterator type. Know Your TaskDo you want to find a pattern once? Many times? Search and replace? xpressive has tools for all that and more. Below is a quick reference: Table?25.3.?Tasks and Tools
These algorithms and classes are described in excruciating detail in the Reference section.
When using xpressive, the first thing you'll do is create a OverviewThe feature that really sets xpressive apart from other C/C++ regular expression libraries is the ability to author a regular expression using C++ expressions. xpressive achieves this through operator overloading, using a technique called expression templates to embed a mini-language dedicated to pattern matching within C++. These "static regexes" have many advantages over their string-based brethren. In particular, static regexes:
Since we compose static regexes using C++ expressions, we are constrained by the rules for legal C++ expressions. Unfortunately, that means that "classic" regular expression syntax cannot always be mapped cleanly into C++. Rather, we map the regex constructs, picking new syntax that is legal C++. Construction and Assignment
You create a static regex by assigning one to an object of type sregex re = '$' >> +_d >> '.' >> _d >> _d; Assignment works similarly. Character and String Literals
In static regexes, character and string literals match themselves. For
instance, in the regex above, When using literals in static regexes, you must take care that at least one operand is not a literal. For instance, the following are not valid regexes: sregex re1 = 'a' >> 'b'; // ERROR! sregex re2 = +'a'; // ERROR!
The two operands to the binary sregex re1 = as_xpr('a') >> 'b'; // OK sregex re2 = +as_xpr('a'); // OK Sequencing and Alternation
As you've probably already noticed, sub-expressions in static regexes must
be separated by the sequencing operator, // Match an 'a' followed by a digit sregex re = 'a' >> _d;
Alternation works just as it does in Perl with the // match a digit character or a word character one or more times sregex re = +( _d | _w ); Grouping and Captures
In Perl, parentheses "<(\\w+)>.*?</\\1>"
In static xpressive, this would be: '<' >> (s1= +_w) >> '>' >> -*_ >> "</" >> s1 >> '>'
Notice how you capture a back-reference by assigning to
Case-Insensitivity and Internationalization
Perl lets you make part of your regular expression case-insensitive by
using the sregex re = "this" >> icase( "that" );
In this regular expression,
Case-insensitive regular expressions raise the issue of internationalization:
how should case-insensitive character comparisons be evaluated? Also, many
character classes are locale-specific. Which characters are matched by
std::locale my_locale = /* initialize a std::locale object */; sregex re = imbue( my_locale )( +alpha >> +digit );
This regular expression will evaluate Static xpressive Syntax Cheat SheetThe table below lists the familiar regex constructs and their equivalents in static xpressive. Table?25.4.?Perl syntax vs. Static xpressive syntax
OverviewStatic regexes are dandy, but sometimes you need something a bit more ... dynamic. Imagine you are developing a text editor with a regex search/replace feature. You need to accept a regular expression from the end user as input at run-time. There should be a way to parse a string into a regular expression. That's what xpressive's dynamic regexes are for. They are built from the same core components as their static counterparts, but they are late-bound so you can specify them at run-time. Construction and Assignment
There are two ways to create a dynamic regex: with the
Here is an example of using sregex re = sregex::compile( "this|that", regex_constants::icase );
Here is the same example using sregex_compiler compiler; sregex re = compiler.compile( "this|that", regex_constants::icase );
Dynamic xpressive SyntaxSince the dynamic syntax is not constrained by the rules for valid C++ expressions, we are free to use familiar syntax for dynamic regexes. For this reason, the syntax used by xpressive for dynamic regexes follows the lead set by John Maddock's proposal to add regular expressions to the Standard Library. It is essentially the syntax standardized by ECMAScript, with minor changes in support of internationalization. Since the syntax is documented exhaustively elsewhere, I will simply refer you to the existing standards, rather than duplicate the specification here. Internationalization
As with static regexes, dynamic regexes support internationalization by
allowing you to specify a different std::locale my_locale = /* initialize your locale object here */; sregex_compiler compiler; compiler.imbue( my_locale ); sregex re = compiler.compile( "\\w+|\\d+" );
This regex will use Overview
Once you have created a regex object, you can use the Seeing if a String Matches a Regex
The
The input can be a bidirectional range such as cregex cre = +_w; // this regex can match C-style strings sregex sre = +_w; // this regex can match std::strings if( regex_match( "hello", cre ) ) // OK { /*...*/ } if( regex_match( std::string("hello"), sre ) ) // OK { /*...*/ } if( regex_match( "hello", sre ) ) // ERROR! iterator mis-match! { /*...*/ }
The cmatch what; cregex cre = +(s1= _w); // store the results of the regex_match in "what" if( regex_match( "hello", what, cre ) ) { std::cout << what[1] << '\n'; // prints "o" }
The std::string str("hello"); sregex sre = bol >> +_w; // match_not_bol means that "bol" should not match at [begin,begin) if( regex_match( str.begin(), str.end(), sre, regex_constants::match_not_bol ) ) { // should never get here!!! }
Click here
to see a complete example program that shows how to use Searching for Matching Sub-Strings
Use
In all other regards,
Click here
to see a complete example program that shows how to use Overview
Sometimes, it is not enough to know simply whether a match_results
So, you've passed a
The table below shows how to access the information stored in a Table?25.5.?match_results<> Accessors
There is more you can do with the sub_match
When you index into a template< class BidirectionalIterator > struct sub_match : std::pair< BidirectionalIterator, BidirectionalIterator > { bool matched; // ... };
Since it inherits publicaly from
The following table shows how you might access the information stored in
a Table?25.6.?sub_match<> Accessors
Results Invalidation
Results are stored as iterators into the input sequence. Anything which invalidates
the input sequence will invalidate the match results. For instance, if you
match a
Regular expressions are not only good for searching text; they're good at
manipulating it. And one of the most common text manipulation
tasks is search-and-replace. xpressive provides the regex_replace()
Performing search-and-replace using std::string input("This is his face"); sregex re = as_xpr("his"); // find all occurrences of "his" ... std::string format("her"); // ... and replace them with "her" // use the version of regex_replace() that operates on strings std::string output = regex_replace( input, re, format ); std::cout << output << '\n'; // use the version of regex_replace() that operates on iterators std::ostream_iterator< char > out_iter( std::cout ); regex_replace( out_iter, input.begin(), input.end(), re, format ); The above program prints out the following: Ther is her face Ther is her face
Notice that all the occurrences of
Click here
to see a complete example program that shows how to use Replace Options
The Table?25.7.?Format Flags
These flags live in the The ECMA-262 Format SequencesWhen you haven't specified a substitution string dialect with one of the format flags above, you get the dialect defined by ECMA-262, the standard for ECMAScript. The table below shows the escape sequences recognized in ECMA-262 mode. Table?25.8.?Format Escape Sequences
Any other sequence beginning with The Sed Format Sequences
When specifying the Table?25.9.?Sed Format Escape Sequences
The Perl Format Sequences
When specifying the Table?25.10.?Perl Format Escape Sequences
The Boost-Specific Format Sequences
When specifying the ?Ntrue-expression:false-expression
where N is a decimal digit representing a sub-match.
If the corresponding sub-match participated in the full match, then the substitution
is true-expression. Otherwise, it is false-expression.
In this mode, you can use parens Formatter Objects
Format strings are not always expressive enough for all your text substitution
needs. Consider the simple example of wanting to map input strings to output
strings, as you may want to do with environment variables. Rather than a
format string, for this you would use a formatter object.
Consider the following code, which finds embedded environment variables of
the form #include <map> #include <string> #include <iostream> #include <boost/xpressive/xpressive.hpp> using namespace boost; using namespace xpressive; std::map<std::string, std::string> env; std::string const &format_fun(smatch const &what) { return env[what[1].str()]; } int main() { env["X"] = "this"; env["Y"] = "that"; std::string input("\"$(X)\" has the value \"$(Y)\""); // replace strings like "$(XYZ)" with the result of env["XYZ"] sregex envar = "$(" >> (s1 = +_w) >> ')'; std::string output = regex_replace(input, envar, format_fun); std::cout << output << std::endl; }
In this case, we use a function, "this" has the value "that" The formatter need not be an ordinary function. It may be an object of class type. And rather than return a string, it may accept an output iterator into which it writes the substitution. Consider the following, which is functionally equivalent to the above. #include <map> #include <string> #include <iostream> #include <boost/xpressive/xpressive.hpp> using namespace boost; using namespace xpressive; struct formatter { typedef std::map<std::string, std::string> env_map; env_map env; template<typename Out> Out operator()(smatch const &what, Out out) const { env_map::const_iterator where = env.find(what[1]); if(where != env.end()) { std::string const &sub = where->second; out = std::copy(sub.begin(), sub.end(), out); } return out; } }; int main() { formatter fmt; fmt.env["X"] = "this"; fmt.env["Y"] = "that"; std::string input("\"$(X)\" has the value \"$(Y)\""); sregex envar = "$(" >> (s1 = +_w) >> ')'; std::string output = regex_replace(input, envar, fmt); std::cout << output << std::endl; }
The formatter must be a callable object -- a function or a function object
-- that has one of three possible signatures, detailed in the table below.
For the table, Table?25.11.?Formatter Signatures
Formatter Expressions
In addition to format strings and formatter objects,
#include <map> #include <string> #include <iostream> #include <boost/xpressive/xpressive.hpp> #include <boost/xpressive/regex_actions.hpp> using namespace boost::xpressive; int main() { std::map<std::string, std::string> env; env["X"] = "this"; env["Y"] = "that"; std::string input("\"$(X)\" has the value \"$(Y)\""); sregex envar = "$(" >> (s1 = +_w) >> ')'; std::string output = regex_replace(input, envar, ref(env)[s1]); std::cout << output << std::endl; }
In the above, the formatter expression is
Overview
You initialize a
As you can see, Example 1: Simple Tokenization
This example uses std::string input("This is his face"); sregex re = +_w; // find a word // iterate over all the words in the input sregex_token_iterator begin( input.begin(), input.end(), re ), end; // write all the words to std::cout std::ostream_iterator< std::string > out_iter( std::cout, "\n" ); std::copy( begin, end, out_iter ); This program displays the following: This is his face Example 2: Simple Tokenization, Reloaded
This example also uses std::string input("This is his face"); sregex re = +_s; // find white space // iterate over all non-white space in the input. Note the -1 below: sregex_token_iterator begin( input.begin(), input.end(), re, -1 ), end; // write all the words to std::cout std::ostream_iterator< std::string > out_iter( std::cout, "\n" ); std::copy( begin, end, out_iter ); This program displays the following: This is his face Example 3: Simple Tokenization, Revolutions
This example also uses std::string input("01/02/2003 blahblah 04/23/1999 blahblah 11/13/1981"); sregex re = sregex::compile("(\\d{2})/(\\d{2})/(\\d{4})"); // find a date // iterate over all the years in the input. Note the 3 below, corresponding to the 3rd sub-expression: sregex_token_iterator begin( input.begin(), input.end(), re, 3 ), end; // write all the words to std::cout std::ostream_iterator< std::string > out_iter( std::cout, "\n" ); std::copy( begin, end, out_iter ); This program displays the following: 2003 1999 1981 Example 4: Not-So-Simple Tokenization
This example is like the previous one, except that instead of tokenizing
just the years, this program turns the days, months and years into tokens.
When we pass an array of integers std::string input("01/02/2003 blahblah 04/23/1999 blahblah 11/13/1981"); sregex re = sregex::compile("(\\d{2})/(\\d{2})/(\\d{4})"); // find a date // iterate over the days, months and years in the input int const sub_matches[] = { 2, 1, 3 }; // day, month, year sregex_token_iterator begin( input.begin(), input.end(), re, sub_matches ), end; // write all the words to std::cout std::ostream_iterator< std::string > out_iter( std::cout, "\n" ); std::copy( begin, end, out_iter ); This program displays the following: 02 01 2003 23 04 1999 13 11 1981
The OverviewOne of the key benefits of representing regexes as C++ expressions is the ability to easily refer to other C++ code and data from within the regex. This enables programming idioms that are not possible with other regular expression libraries. Of particular note is the ability for one regex to refer to another regex, allowing you to build grammars out of regular expressions. This section describes how to embed one regex in another by value and by reference, how regex objects behave when they refer to other regexes, and how to access the tree of results after a successful parse. Embedding a Regex by Value
The Consider a text editor that has a regex-find feature with a whole-word option. You can implement this with xpressive as follows: find_dialog dlg; if( dialog_ok == dlg.do_modal() ) { std::string pattern = dlg.get_text(); // the pattern the user entered bool whole_word = dlg.whole_word.is_checked(); // did the user select the whole-word option? sregex re = sregex::compile( pattern ); // try to compile the pattern if( whole_word ) { // wrap the regex in begin-word / end-word assertions re = bow >> re >> eow; } // ... use re ... } Look closely at this line: // wrap the regex in begin-word / end-word assertions re = bow >> re >> eow; This line creates a new regex that embeds the old regex by value. Then, the new regex is assigned back to the original regex. Since a copy of the old regex was made on the right-hand side, this works as you might expect: the new regex has the behavior of the old regex wrapped in begin- and end-word assertions.
Embedding a Regex by ReferenceIf you want to be able to build recursive regular expressions and context-free grammars, embedding a regex by value is not enough. You need to be able to make your regular expressions self-referential. Most regular expression engines don't give you that power, but xpressive does.
Consider the following code, which uses the sregex parentheses; parentheses // A balanced set of parentheses ... = '(' // is an opening parenthesis ... >> // followed by ... *( // zero or more ... keep( +~(set='(',')') ) // of a bunch of things that are not parentheses ... | // or ... by_ref(parentheses) // a balanced set of parentheses ) // (ooh, recursion!) ... >> // followed by ... ')' // a closing parenthesis ;
Matching balanced, nested tags is an important text processing task, and
it is one that "classic" regular expressions cannot do. The Building a GrammarOnce we allow self-reference in our regular expressions, the genie is out of the bottle and all manner of fun things are possible. In particular, we can now build grammars out of regular expressions. Let's have a look at the text-book grammar example: the humble calculator. sregex group, factor, term, expression; group = '(' >> by_ref(expression) >> ')'; factor = +_d | group; term = factor >> *(('*' >> factor) | ('/' >> factor)); expression = term >> *(('+' >> term) | ('-' >> term));
The regex
Let's take a closer look at this regular expression grammar. Notice that
it is cyclic:
Dynamic Regex Grammars
Using
You can create a named dynamic regex by prefacing your regex with Below is a code fragment that uses dynamic regex grammars to implement the calculator example from above. using namespace boost::xpressive; using namespace regex_constants; sregex expr; { sregex_compiler compiler; syntax_option_type x = ignore_white_space; compiler.compile("(? $group = ) \\( (? $expr ) \\) ", x); compiler.compile("(? $factor = ) \\d+ | (? $group ) ", x); compiler.compile("(? $term = ) (? $factor )" " ( \\* (? $factor ) | / (? $factor ) )* ", x); expr = compiler.compile("(? $expr = ) (? $term )" " ( \\+ (? $term ) | - (? $term ) )* ", x); } std::string str("foo 9*(10+3) bar"); smatch what; if(regex_search(str, what, expr)) { // This prints "9*(10+3)": std::cout << what[0] << std::endl; } As with static regex grammars, nested regex invocations create nested match results (see Nested Results below). The result is a complete parse tree for string that matched. Unlike static regexes, dynamic regexes are always embedded by reference, not by value. Cyclic Patterns, Copying and Memory Management, Oh My!The calculator examples above raises a number of very complicated memory-management issues. Each of the four regex objects refer to each other, some directly and some indirectly, some by value and some by reference. What if we were to return one of them from a function and let the others go out of scope? What becomes of the references? The answer is that the regex objects are internally reference counted, such that they keep their referenced regex objects alive as long as they need them. So passing a regex object by value is never a problem, even if it refers to other regex objects that have gone out of scope.
Those of you who have dealt with reference counting are probably familiar
with its Achilles Heel: cyclic references. If regex objects are reference
counted, what happens to cycles like the one created in the calculator examples?
Are they leaked? The answer is no, they are not leaked. The Nested Regexes and Sub-Match ScopingNested regular expressions raise the issue of sub-match scoping. If both the inner and outer regex write to and read from the same sub-match vector, chaos would ensue. The inner regex would stomp on the sub-matches written by the outer regex. For example, what does this do? sregex inner = sregex::compile( "(.)\\1" ); sregex outer = (s1= _) >> inner >> s1; The author probably didn't intend for the inner regex to overwrite the sub-match written by the outer regex. The problem is particularly acute when the inner regex is accepted from the user as input. The author has no way of knowing whether the inner regex will stomp the sub-match vector or not. This is clearly not acceptable.
Instead, what actually happens is that each invocation of a nested regex
gets its own scope. Sub-matches belong to that scope. That is, each nested
regex invocation gets its own copy of the sub-match vector to play with,
so there is no way for an inner regex to stomp on the sub-matches of an outer
regex. So, for example, the regex Nested Results
If nested regexes have their own sub-matches, there should be a way to access
them after a successful match. In fact, there is. After a Take as an example the regex for balanced, nested parentheses we saw earlier: sregex parentheses; parentheses = '(' >> *( keep( +~(set='(',')') ) | by_ref(parentheses) ) >> ')'; smatch what; std::string str( "blah blah( a(b)c (c(e)f (g)h )i (j)6 )blah" ); if( regex_search( str, what, parentheses ) ) { // display the whole match std::cout << what[0] << '\n'; // display the nested results std::for_each( what.nested_results().begin(), what.nested_results().end(), output_nested_results() ); } This program displays the following: ( a(b)c (c(e)f (g)h )i (j)6 ) (b) (c(e)f (g)h ) (e) (g) (j) Here you can see how the results are nested and that they are stored in the order in which they are found.
Filtering Nested Results
Sometimes a regex will have several nested regex objects, and you want to
know which result corresponds to which regex object. That's where
To make this a bit easier, xpressive provides a predicate to make it simple
to iterate over just the results that correspond to a certain nested regex.
It is called sregex name = +alpha; sregex integer = +_d; sregex re = *( *_s >> ( name | integer ) ); smatch what; std::string str( "marsha 123 jan 456 cindy 789" ); if( regex_match( str, what, re ) ) { smatch::nested_results_type::const_iterator begin = what.nested_results().begin(); smatch::nested_results_type::const_iterator end = what.nested_results().end(); // declare filter predicates to select just the names or the integers sregex_id_filter_predicate name_id( name.regex_id() ); sregex_id_filter_predicate integer_id( integer.regex_id() ); // iterate over only the results from the name regex std::for_each( boost::make_filter_iterator( name_id, begin, end ), boost::make_filter_iterator( name_id, end, end ), output_result ); std::cout << '\n'; // iterate over only the results from the integer regex std::for_each( boost::make_filter_iterator( integer_id, begin, end ), boost::make_filter_iterator( integer_id, end, end ), output_result ); }
where marsha jan cindy 123 456 789 Overview
Imagine you want to parse an input string and build a Semantic Actions
Consider the following code, which uses xpressive's semantic actions to parse
a string of word/integer pairs and stuffs them into a #include <string> #include <iostream> #include <boost/xpressive/xpressive.hpp> #include <boost/xpressive/regex_actions.hpp> using namespace boost::xpressive; int main() { std::map<std::string, int> result; std::string str("aaa=>1 bbb=>23 ccc=>456"); // Match a word and an integer, separated by =>, // and then stuff the result into a std::map<> sregex pair = ( (s1= +_w) >> "=>" >> (s2= +_d) ) [ ref(result)[s1] = as<int>(s2) ]; // Match one or more word/integer pairs, separated // by whitespace. sregex rx = pair >> *(+_s >> pair); if(regex_match(str, rx)) { std::cout << result["aaa"] << '\n'; std::cout << result["bbb"] << '\n'; std::cout << result["ccc"] << '\n'; } return 0; } This program prints the following: 1 23 456
The regular expression
How does this work? Just as the rest of the static regular expression, the
part between brackets is an expression template. It encodes the action and
executes it later. The expression
In addition to the sub-match placeholders int i = 0; // Here, _ refers back to all the // characters matched by (+_d) sregex rex = (+_d)[ ref(i) = as<int>(_) ]; Lazy Action ExecutionWhat does it mean, exactly, to attach an action to part of a regular expression and perform a match? When does the action execute? If the action is part of a repeated sub-expression, does the action execute once or many times? And if the sub-expression initially matches, but ultimately fails because the rest of the regular expression fails to match, is the action executed at all?
The answer is that by default, actions are executed lazily.
When a sub-expression matches a string, its action is placed on a queue,
along with the current values of any sub-matches to which the action refers.
If the match algorithm must backtrack, actions are popped off the queue as
necessary. Only after the entire regex has matched successfully are the actions
actually exeucted. They are executed all at once, in the order in which they
were added to the queue, as the last step before For example, consider the following regex that increments a counter whenever it finds a digit. int i = 0; std::string str("1!2!3?"); // count the exciting digits, but not the // questionable ones. sregex rex = +( _d [ ++ref(i) ] >> '!' ); regex_search(str, rex); assert( i == 2 );
The action Immediate Action Execution
When you want semantic actions to execute immediately, you can wrap the sub-expression
containing the action in a int i = 0; std::string str("1!2!3?"); // count all the digits. sregex rex = +( keep( _d [ ++ref(i) ] ) >> '!' ); regex_search(str, rex); assert( i == 3 );
We have wrapped the sub-expression
Lazy FunctionsSo far, we've seen how to write semantic actions consisting of variables and operators. But what if you want to be able to call a function from a semantic action? Xpressive provides a mechanism to do this.
The first step is to define a function object type. Here, for instance, is
a function object type that calls struct push_impl { // Result type, needed for tr1::result_of typedef void result_type; template<typename Sequence, typename Value> void operator()(Sequence &seq, Value const &val) const { seq.push(val); } };
The next step is to use xpressive's // Global "push" function object. function<push_impl>::type const push = {{}};
The initialization looks a bit odd, but this is because std::stack<int> ints; // Match digits, cast them to an int // and push it on the stack. sregex rex = (+_d)[push(ref(ints), as<int>(_))]; You'll notice that doing it this way causes member function invocations to look like ordinary function invocations. You can choose to write your semantic action in a different way that makes it look a bit more like a member function call: sregex rex = (+_d)[ref(ints)->*push(as<int>(_))];
Xpressive recognizes the use of the
When your function object must return a type that depends on its arguments,
you can use a // Function object that returns the // first element of a pair. struct first_impl { template<typename Sig> struct result {}; template<typename This, typename Pair> struct result<This(Pair)> { typedef typename remove_reference<Pair> ::type::first_type type; }; template<typename Pair> typename Pair::first_type operator()(Pair const &p) const { return p.first; } }; // OK, use as first(s1) to get the begin iterator // of the sub-match referred to by s1. function<first_impl>::type const first = {{}}; Referring to Local Variables
As we've seen in the examples above, we can refer to local variables within
an actions using sregex bad_voodoo() { int i = 0; sregex rex = +( _d [ ++ref(i) ] >> '!' ); // ERROR! rex refers by reference to a local // variable, which will dangle after bad_voodoo() // returns. return rex; } When writing semantic actions, it is your responsibility to make sure that all the references do not dangle. One way to do that would be to make the variables shared pointers that are held by the regex by value. sregex good_voodoo(boost::shared_ptr<int> pi) { // Use val() to hold the shared_ptr by value: sregex rex = +( _d [ ++*val(pi) ] >> '!' ); // OK, rex holds a reference count to the integer. return rex; }
In the above code, we use
It can be tedious to wrap all your variables in Table?25.12.?reference<> and value<>
As you can see, when using Table?25.13.?local<> vs. reference<>
We can use local<int> i(0); std::string str("1!2!3?"); // count the exciting digits, but not the // questionable ones. sregex rex = +( _d [ ++i ] >> '!' ); regex_search(str, rex); assert( i.get() == 2 );
Notice that we use Referring to Non-Local Variables
In the beginning of this section, we used a regex with a semantic action
to parse a string of word/integer pairs and stuff them into a // Define a placeholder for a map object: placeholder<std::map<std::string, int> > _map; // Match a word and an integer, separated by =>, // and then stuff the result into a std::map<> sregex pair = ( (s1= +_w) >> "=>" >> (s2= +_d) ) [ _map[s1] = as<int>(s2) ]; // Match one or more word/integer pairs, separated // by whitespace. sregex rx = pair >> *(+_s >> pair); // The string to parse std::string str("aaa=>1 bbb=>23 ccc=>456"); // Here is the actual map to fill in: std::map<std::string, int> result; // Bind the _map placeholder to the actual map smatch what; what.let( _map = result ); // Execute the match and fill in result map if(regex_match(str, what, rx)) { std::cout << result["aaa"] << '\n'; std::cout << result["bbb"] << '\n'; std::cout << result["ccc"] << '\n'; } This program displays: 1 23 456
We use
The syntax for late-bound action arguments is a little different if you are
using // Define a placeholder for a map object: placeholder<std::map<std::string, int> > _map; // Match a word and an integer, separated by =>, // and then stuff the result into a std::map<> sregex pair = ( (s1= +_w) >> "=>" >> (s2= +_d) ) [ _map[s1] = as<int>(s2) ]; // The string to parse std::string str("aaa=>1 bbb=>23 ccc=>456"); // Here is the actual map to fill in: std::map<std::string, int> result; // Create a regex_iterator to find all the matches sregex_iterator it(str.begin(), str.end(), pair, let(_map=result)); sregex_iterator end; // step through all the matches, and fill in // the result map while(it != end) ++it; std::cout << result["aaa"] << '\n'; std::cout << result["bbb"] << '\n'; std::cout << result["ccc"] << '\n'; This program displays: 1 23 456 User-Defined Assertions
You are probably already familiar with regular expression assertions.
In Perl, some examples are the There are a couple of ways to define a custom assertion. The simplest is to use a function object. Let's say that you want to ensure that a sub-expression matches a sub-string that is either 3 or 6 characters long. The following struct defines such a predicate: // A predicate that is true IFF a sub-match is // either 3 or 6 characters long. struct three_or_six { bool operator()(ssub_match const &sub) const { return sub.length() == 3 || sub.length() == 6; } }; You can use this predicate within a regular expression as follows: // match words of 3 characters or 6 characters. sregex rx = (bow >> +_w >> eow)[ check(three_or_six()) ] ;
The above regular expression will find whole words that are either 3 or 6
characters long. The
Custom assertions can also be defined inline using the same syntax as for semantic actions. Below is the same custom assertion written inline: // match words of 3 characters or 6 characters. sregex rx = (bow >> +_w >> eow)[ check(length(_)==3 || length(_)==6) ] ;
In the above, Once you get the hang of writing custom assertions inline, they can be very powerful. For example, you can write a regular expression that only matches valid dates (for some suitably liberal definition of the term “valid”). int const days_per_month[] = {31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 31, 31}; mark_tag month(1), day(2); // find a valid date of the form month/day/year. sregex date = ( // Month must be between 1 and 12 inclusive (month= _d >> !_d) [ check(as<int>(_) >= 1 && as<int>(_) <= 12) ] >> '/' // Day must be between 1 and 31 inclusive >> (day= _d >> !_d) [ check(as<int>(_) >= 1 && as<int>(_) <= 31) ] >> '/' // Only consider years between 1970 and 2038 >> (_d >> _d >> _d >> _d) [ check(as<int>(_) >= 1970 && as<int>(_) <= 2038) ] ) // Ensure the month actually has that many days! [ check( ref(days_per_month)[as<int>(month)-1] >= as<int>(day) ) ] ; smatch what; std::string str("99/99/9999 2/30/2006 2/28/2006"); if(regex_search(str, what, date)) { std::cout << what[0] << std::endl; } The above program prints out the following: 2/28/2006
Notice how the inline custom assertions are used to range-check the values
for the month, day and year. The regular expression doesn't match Overview
Symbol tables can be built into xpressive regular expressions with just a
Symbol Tables
An xpressive symbol table is just a int result; std::map<std::string, int> map1; // ... (fill the map) sregex rx = ( a1 = map1 ) [ ref(result) = a1 ]; Consider the following example code, which translates number names into integers. It is described below. #include <string> #include <iostream> #include <boost/xpressive/xpressive.hpp> #include <boost/xpressive/regex_actions.hpp> using namespace boost::xpressive; int main() { std::map<std::string, int> number_map; number_map["one"] = 1; number_map["two"] = 2; number_map["three"] = 3; // Match a string from number_map // and store the integer value in 'result' // if not found, store -1 in 'result' int result = 0; cregex rx = ((a1 = number_map ) | *_) [ ref(result) = (a1 | -1)]; regex_match("three", rx); std::cout << result << '\n'; regex_match("two", rx); std::cout << result << '\n'; regex_match("stuff", rx); std::cout << result << '\n'; return 0; } This program prints the following: 3 2 -1
First the program builds a number map, with number names as string keys and
the corresponding integers as values. Then it constructs a static regular
expression using an attribute
A more complete version of this example can be found in
Symbol table matches are case sensitive by default, but they can be made
case-insensitive by enclosing the expression in Attributes
Up to nine attributes can be used in a regular expression. They are named
Attributes are properly scoped, so you can do crazy things like:
Overview
Matching a regular expression against a string often requires locale-dependent
information. For example, how are case-insensitive comparisons performed?
The locale-sensitive behavior is captured in a traits class. xpressive provides
three traits class templates: Setting the Default Regex Trait
By default, xpressive uses Using Custom Traits with Dynamic Regexes
To create a dynamic regex that uses a custom traits object, you must use
// Declare a regex_compiler that uses the global C locale regex_compiler<char const *, c_regex_traits<char> > crxcomp; cregex crx = crxcomp.compile( "\\w+" ); // Declare a regex_compiler that uses a custom std::locale std::locale loc = /* ... create a locale here ... */; regex_compiler<char const *, cpp_regex_traits<char> > cpprxcomp(loc); cregex cpprx = cpprxcomp.compile( "\\w+" );
The Using Custom Traits with Static Regexes
If you want a particular static regex to use a different set of traits, you
can use the special // Define a regex that uses the global C locale c_regex_traits<char> ctraits; sregex crx = imbue(ctraits)( +_w ); // Define a regex that uses a customized std::locale std::locale loc = /* ... create a locale here ... */; cpp_regex_traits<char> cpptraits(loc); sregex cpprx1 = imbue(cpptraits)( +_w ); // A shorthand for above sregex cpprx2 = imbue(loc)( +_w );
The // ERROR! Cannot imbue() only part of a regex sregex error = _w >> imbue(loc)( _w );
Searching
Non-Character Data With
|
Expression |
Return type |
Assertion / Note / Pre- / Post-condition |
---|---|---|
|
|
Default constructor (must be trivial). |
|
|
Copy constructor (must be trivial). |
|
|
Assignment operator (must be trivial). |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
In the following table X
denotes a traits class defining types and functions for the character container
type CharT
; u
is an object of type X
;
v
is an object of type const X
;
p
is a value of type const CharT*
; I1
and I2
are Input Iterators
;
c
is a value of type const CharT
;
s
is an object of type X::string_type
;
cs
is an object of type
const X::string_type
;
b
is a value of type bool
; i
is a value of type int
; F1
and F2
are values of type const CharT*
; loc
is an object of type X::locale_type
; and ch
is an object of const char
.
Table?25.15.?Traits Requirements
Expression |
Return type |
Assertion / Note |
---|---|---|
|
|
The character container type used in the implementation of class template
|
|
|
|
|
Implementation defined |
A copy constructible type that represents the locale used by the traits class. |
|
Implementation defined |
A bitmask type representing a particular character classification. Multiple values of this type can be bitwise-or'ed together to obtain a new valid value. |
|
|
Yields a value between |
|
|
Widens the specified |
|
|
For any characters |
|
|
For characters |
|
|
Returns a character such that for any character |
|
|
For all characters |
|
|
Returns a sort key for the character sequence designated by the iterator
range |
|
|
Returns a sort key for the character sequence designated by the iterator
range |
|
|
Converts the character sequence designated by the iterator range |
|
|
Returns a sequence of characters that represents the collating element
consisting of the character sequence designated by the iterator range
|
|
|
Returns |
|
|
Returns the value represented by the digit |
|
|
Imbues |
|
|
Returns the current locale used by |
This section is adapted from the equivalent page in the Boost.Regex documentation and from the proposal to add regular expressions to the Standard Library.
Below you can find six complete sample programs.
This is the example from the Introduction. It is reproduced here for your convenience.
#include <iostream> #include <boost/xpressive/xpressive.hpp> using namespace boost::xpressive; int main() { std::string hello( "hello world!" ); sregex rex = sregex::compile( "(\\w+) (\\w+)!" ); smatch what; if( regex_match( hello, what, rex ) ) { std::cout << what[0] << '\n'; // whole match std::cout << what[1] << '\n'; // first capture std::cout << what[2] << '\n'; // second capture } return 0; }
This program outputs the following:
hello world! hello world
Notice in this example how we use custom mark_tag
s
to make the pattern more readable. We can use the mark_tag
s
later to index into the
.
match_results<>
#include <iostream> #include <boost/xpressive/xpressive.hpp> using namespace boost::xpressive; int main() { char const *str = "I was born on 5/30/1973 at 7am."; // define some custom mark_tags with names more meaningful than s1, s2, etc. mark_tag day(1), month(2), year(3), delim(4); // this regex finds a date cregex date = (month= repeat<1,2>(_d)) // find the month ... >> (delim= (set= '/','-')) // followed by a delimiter ... >> (day= repeat<1,2>(_d)) >> delim // and a day followed by the same delimiter ... >> (year= repeat<1,2>(_d >> _d)); // and the year. cmatch what; if( regex_search( str, what, date ) ) { std::cout << what[0] << '\n'; // whole match std::cout << what[day] << '\n'; // the day std::cout << what[month] << '\n'; // the month std::cout << what[year] << '\n'; // the year std::cout << what[delim] << '\n'; // the delimiter } return 0; }
This program outputs the following:
5/30/1973 30 5 1973 /
The following program finds dates in a string and marks them up with pseudo-HTML.
#include <iostream> #include <boost/xpressive/xpressive.hpp> using namespace boost::xpressive; int main() { std::string str( "I was born on 5/30/1973 at 7am." ); // essentially the same regex as in the previous example, but using a dynamic regex sregex date = sregex::compile( "(\\d{1,2})([/-])(\\d{1,2})\\2((?:\\d{2}){1,2})" ); // As in Perl, $& is a reference to the sub-string that matched the regex std::string format( "<date>$&</date>" ); str = regex_replace( str, date, format ); std::cout << str << '\n'; return 0; }
This program outputs the following:
I was born on <date>5/30/1973</date> at 7am.
The following program finds the words in a wide-character string. It uses
wsregex_iterator
. Notice
that dereferencing a wsregex_iterator
yields a wsmatch
object.
#include <iostream> #include <boost/xpressive/xpressive.hpp> using namespace boost::xpressive; int main() { std::wstring str( L"This is his face." ); // find a whole word wsregex token = +alnum; wsregex_iterator cur( str.begin(), str.end(), token ); wsregex_iterator end; for( ; cur != end; ++cur ) { wsmatch const &what = *cur; std::wcout << what[0] << L'\n'; } return 0; }
This program outputs the following:
This is his face
The following program finds race times in a string and displays first the
minutes and then the seconds. It uses
.
regex_token_iterator<>
#include <iostream> #include <boost/xpressive/xpressive.hpp> using namespace boost::xpressive; int main() { std::string str( "Eric: 4:40, Karl: 3:35, Francesca: 2:32" ); // find a race time sregex time = sregex::compile( "(\\d):(\\d\\d)" ); // for each match, the token iterator should first take the value of // the first marked sub-expression followed by the value of the second // marked sub-expression int const subs[] = { 1, 2 }; sregex_token_iterator cur( str.begin(), str.end(), time, subs ); sregex_token_iterator end; for( ; cur != end; ++cur ) { std::cout << *cur << '\n'; } return 0; }
This program outputs the following:
4 40 3 35 2 32
The following program takes some text that has been marked up with html and
strips out the mark-up. It uses a regex that matches an HTML tag and a
that returns the parts of the string that do not match
the regex.
regex_token_iterator<>
#include <iostream> #include <boost/xpressive/xpressive.hpp> using namespace boost::xpressive; int main() { std::string str( "Now <bold>is the time <i>for all good men</i> to come to the aid of their</bold> country." ); // find a HTML tag sregex html = '<' >> optional('/') >> +_w >> '>'; // the -1 below directs the token iterator to display the parts of // the string that did NOT match the regular expression. sregex_token_iterator cur( str.begin(), str.end(), html, -1 ); sregex_token_iterator end; for( ; cur != end; ++cur ) { std::cout << '{' << *cur << '}'; } std::cout << '\n'; return 0; }
This program outputs the following:
{Now }{is the time }{for all good men}{ to come to the aid of their}{ country.}
Here is a helper class to demonstrate how you might display a tree of nested results:
// Displays nested results to std::cout with indenting struct output_nested_results { int tabs_; output_nested_results( int tabs = 0 ) : tabs_( tabs ) { } template< typename BidiIterT > void operator ()( match_results< BidiIterT > const &what ) const { // first, do some indenting typedef typename std::iterator_traits< BidiIterT >::value_type char_type; char_type space_ch = char_type(' '); std::fill_n( std::ostream_iterator<char_type>( std::cout ), tabs_ * 4, space_ch ); // output the match std::cout << what[0] << '\n'; // output any nested matches std::for_each( what.nested_results().begin(), what.nested_results().end(), output_nested_results( tabs_ + 1 ) ); } };