123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473 |
- <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
- "//www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
- <html>
- <head>
- <title>LPeg.re - Regex syntax for LPEG</title>
- <link rel="stylesheet"
- href="//www.inf.puc-rio.br/~roberto/lpeg/doc.css"
- type="text/css"/>
- <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
- </head>
- <body>
- <div id="container">
-
- <div id="product">
- <div id="product_logo">
- <a href="//www.inf.puc-rio.br/~roberto/lpeg/">
- <img alt="LPeg logo" src="lpeg-128.gif"/>
- </a>
- </div>
- <div id="product_name"><big><strong>LPeg.re</strong></big></div>
- <div id="product_description">
- Regex syntax for LPEG
- </div>
- </div> <!-- id="product" -->
- <div id="main">
-
- <div id="navigation">
- <h1>re</h1>
- <ul>
- <li><a href="#basic">Basic Constructions</a></li>
- <li><a href="#func">Functions</a></li>
- <li><a href="#ex">Some Examples</a></li>
- <li><a href="#license">License</a></li>
- </ul>
- </li>
- </ul>
- </div> <!-- id="navigation" -->
- <div id="content">
- <h2><a name="basic"></a>The <code>re</code> Module</h2>
- <p>
- The <code>re</code> module
- (provided by file <code>re.lua</code> in the distribution)
- supports a somewhat conventional regex syntax
- for pattern usage within <a href="lpeg.html">LPeg</a>.
- </p>
- <p>
- The next table summarizes <code>re</code>'s syntax.
- A <code>p</code> represents an arbitrary pattern;
- <code>num</code> represents a number (<code>[0-9]+</code>);
- <code>name</code> represents an identifier
- (<code>[a-zA-Z][a-zA-Z0-9_]*</code>).
- Constructions are listed in order of decreasing precedence.
- <table border="1">
- <tbody><tr><td><b>Syntax</b></td><td><b>Description</b></td></tr>
- <tr><td><code>( p )</code></td> <td>grouping</td></tr>
- <tr><td><code>& p</code></td> <td>and predicate</td></tr>
- <tr><td><code>! p</code></td> <td>not predicate</td></tr>
- <tr><td><code>p1 p2</code></td> <td>concatenation</td></tr>
- <tr><td><code>p1 / p2</code></td> <td>ordered choice</td></tr>
- <tr><td><code>p ?</code></td> <td>optional match</td></tr>
- <tr><td><code>p *</code></td> <td>zero or more repetitions</td></tr>
- <tr><td><code>p +</code></td> <td>one or more repetitions</td></tr>
- <tr><td><code>p^num</code></td>
- <td>exactly <code>num</code> repetitions</td></tr>
- <tr><td><code>p^+num</code></td>
- <td>at least <code>num</code> repetitions</td></tr>
- <tr><td><code>p^-num</code></td>
- <td>at most <code>num</code> repetitions</td></tr>
- <tr><td>(<code>name <- p</code>)<sup>+</sup></td> <td>grammar</td></tr>
- <tr><td><code>'string'</code></td> <td>literal string</td></tr>
- <tr><td><code>"string"</code></td> <td>literal string</td></tr>
- <tr><td><code>[class]</code></td> <td>character class</td></tr>
- <tr><td><code>.</code></td> <td>any character</td></tr>
- <tr><td><code>%name</code></td>
- <td>pattern <code>defs[name]</code> or a pre-defined pattern</td></tr>
- <tr><td><code>name</code></td><td>non terminal</td></tr>
- <tr><td><code><name></code></td><td>non terminal</td></tr>
- <tr><td><code>{}</code></td> <td>position capture</td></tr>
- <tr><td><code>{ p }</code></td> <td>simple capture</td></tr>
- <tr><td><code>{: p :}</code></td> <td>anonymous group capture</td></tr>
- <tr><td><code>{:name: p :}</code></td> <td>named group capture</td></tr>
- <tr><td><code>{~ p ~}</code></td> <td>substitution capture</td></tr>
- <tr><td><code>{| p |}</code></td> <td>table capture</td></tr>
- <tr><td><code>=name</code></td> <td>back reference</td></tr>
- <tr><td><code>p -> 'string'</code></td> <td>string capture</td></tr>
- <tr><td><code>p -> "string"</code></td> <td>string capture</td></tr>
- <tr><td><code>p -> num</code></td> <td>numbered capture</td></tr>
- <tr><td><code>p -> name</code></td> <td>function/query/string capture
- equivalent to <code>p / defs[name]</code></td></tr>
- <tr><td><code>p => name</code></td> <td>match-time capture
- equivalent to <code>lpeg.Cmt(p, defs[name])</code></td></tr>
- <tr><td><code>p ~> name</code></td> <td>fold capture
- (deprecated)</td></tr>
- <tr><td><code>p >> name</code></td> <td>accumulator capture
- equivalent to <code>(p % defs[name])</code></td></tr>
- </tbody></table>
- <p>
- Any space appearing in a syntax description can be
- replaced by zero or more space characters and Lua-style short comments
- (<code>--</code> until end of line).
- </p>
- <p>
- Character classes define sets of characters.
- An initial <code>^</code> complements the resulting set.
- A range <em>x</em><code>-</code><em>y</em> includes in the set
- all characters with codes between the codes of <em>x</em> and <em>y</em>.
- A pre-defined class <code>%</code><em>name</em> includes all
- characters of that class.
- A simple character includes itself in the set.
- The only special characters inside a class are <code>^</code>
- (special only if it is the first character);
- <code>]</code>
- (can be included in the set as the first character,
- after the optional <code>^</code>);
- <code>%</code> (special only if followed by a letter);
- and <code>-</code>
- (can be included in the set as the first or the last character).
- </p>
- <p>
- Currently the pre-defined classes are similar to those from the
- Lua's string library
- (<code>%a</code> for letters,
- <code>%A</code> for non letters, etc.).
- There is also a class <code>%nl</code>
- containing only the newline character,
- which is particularly handy for grammars written inside long strings,
- as long strings do not interpret escape sequences like <code>\n</code>.
- </p>
- <h2><a name="func">Functions</a></h2>
- <h3><code>re.compile (string, [, defs])</code></h3>
- <p>
- Compiles the given string and
- returns an equivalent LPeg pattern.
- The given string may define either an expression or a grammar.
- The optional <code>defs</code> table provides extra Lua values
- to be used by the pattern.
- </p>
- <h3><code>re.find (subject, pattern [, init])</code></h3>
- <p>
- Searches the given pattern in the given subject.
- If it finds a match,
- returns the index where this occurrence starts and
- the index where it ends.
- Otherwise, returns nil.
- </p>
- <p>
- An optional numeric argument <code>init</code> makes the search
- starts at that position in the subject string.
- As usual in Lua libraries,
- a negative value counts from the end.
- </p>
- <h3><code>re.gsub (subject, pattern, replacement)</code></h3>
- <p>
- Does a <em>global substitution</em>,
- replacing all occurrences of <code>pattern</code>
- in the given <code>subject</code> by <code>replacement</code>.
- <h3><code>re.match (subject, pattern)</code></h3>
- <p>
- Matches the given pattern against the given subject,
- returning all captures.
- </p>
- <h3><code>re.updatelocale ()</code></h3>
- <p>
- Updates the pre-defined character classes to the current locale.
- </p>
- <h2><a name="ex">Some Examples</a></h2>
- <h3>A complete simple program</h3>
- <p>
- The next code shows a simple complete Lua program using
- the <code>re</code> module:
- </p>
- <pre class="example">
- local re = require"re"
- -- find the position of the first numeral in a string
- print(re.find("the number 423 is odd", "[0-9]+")) --> 12 14
- -- returns all words in a string
- print(re.match("the number 423 is odd", "({%a+} / .)*"))
- --> the number is odd
- -- returns the first numeral in a string
- print(re.match("the number 423 is odd", "s <- {%d+} / . s"))
- --> 423
- -- substitutes a dot for each vowel in a string
- print(re.gsub("hello World", "[aeiou]", "."))
- --> h.ll. W.rld
- </pre>
- <h3>Balanced parentheses</h3>
- <p>
- The following call will produce the same pattern produced by the
- Lua expression in the
- <a href="lpeg.html#balanced">balanced parentheses</a> example:
- </p>
- <pre class="example">
- b = re.compile[[ balanced <- "(" ([^()] / balanced)* ")" ]]
- </pre>
- <h3>String reversal</h3>
- <p>
- The next example reverses a string:
- </p>
- <pre class="example">
- rev = re.compile[[ R <- (!.) -> '' / ({.} R) -> '%2%1']]
- print(rev:match"0123456789") --> 9876543210
- </pre>
- <h3>CSV decoder</h3>
- <p>
- The next example replicates the <a href="lpeg.html#CSV">CSV decoder</a>:
- </p>
- <pre class="example">
- record = re.compile[[
- record <- {| field (',' field)* |} (%nl / !.)
- field <- escaped / nonescaped
- nonescaped <- { [^,"%nl]* }
- escaped <- '"' {~ ([^"] / '""' -> '"')* ~} '"'
- ]]
- </pre>
- <h3>Lua's long strings</h3>
- <p>
- The next example matches Lua long strings:
- </p>
- <pre class="example">
- c = re.compile([[
- longstring <- ('[' {:eq: '='* :} '[' close)
- close <- ']' =eq ']' / . close
- ]])
- print(c:match'[==[]]===]]]]==]===[]') --> 17
- </pre>
- <h3>Abstract Syntax Trees</h3>
- <p>
- This example shows a simple way to build an
- abstract syntax tree (AST) for a given grammar.
- To keep our example simple,
- let us consider the following grammar
- for lists of names:
- </p>
- <pre class="example">
- p = re.compile[[
- listname <- (name s)*
- name <- [a-z][a-z]*
- s <- %s*
- ]]
- </pre>
- <p>
- Now, we will add captures to build a corresponding AST.
- As a first step, the pattern will build a table to
- represent each non terminal;
- terminals will be represented by their corresponding strings:
- </p>
- <pre class="example">
- c = re.compile[[
- listname <- {| (name s)* |}
- name <- {| {[a-z][a-z]*} |}
- s <- %s*
- ]]
- </pre>
- <p>
- Now, a match against <code>"hi hello bye"</code>
- results in the table
- <code>{{"hi"}, {"hello"}, {"bye"}}</code>.
- </p>
- <p>
- For such a simple grammar,
- this AST is more than enough;
- actually, the tables around each single name
- are already overkilling.
- More complex grammars,
- however, may need some more structure.
- Specifically,
- it would be useful if each table had
- a <code>tag</code> field telling what non terminal
- that table represents.
- We can add such a tag using
- <a href="lpeg.html#cap-g">named group captures</a>:
- </p>
- <pre class="example">
- x = re.compile[[
- listname <- {| {:tag: '' -> 'list':} (name s)* |}
- name <- {| {:tag: '' -> 'id':} {[a-z][a-z]*} |}
- s <- ' '*
- ]]
- </pre>
- <p>
- With these group captures,
- a match against <code>"hi hello bye"</code>
- results in the following table:
- </p>
- <pre class="example">
- {tag="list",
- {tag="id", "hi"},
- {tag="id", "hello"},
- {tag="id", "bye"}
- }
- </pre>
- <h3>Indented blocks</h3>
- <p>
- This example breaks indented blocks into tables,
- respecting the indentation:
- </p>
- <pre class="example">
- p = re.compile[[
- block <- {| {:ident:' '*:} line
- ((=ident !' ' line) / &(=ident ' ') block)* |}
- line <- {[^%nl]*} %nl
- ]]
- </pre>
- <p>
- As an example,
- consider the following text:
- </p>
- <pre class="example">
- t = p:match[[
- first line
- subline 1
- subline 2
- second line
- third line
- subline 3.1
- subline 3.1.1
- subline 3.2
- ]]
- </pre>
- <p>
- The resulting table <code>t</code> will be like this:
- </p>
- <pre class="example">
- {'first line'; {'subline 1'; 'subline 2'; ident = ' '};
- 'second line';
- 'third line'; { 'subline 3.1'; {'subline 3.1.1'; ident = ' '};
- 'subline 3.2'; ident = ' '};
- ident = ''}
- </pre>
- <h3>Macro expander</h3>
- <p>
- This example implements a simple macro expander.
- Macros must be defined as part of the pattern,
- following some simple rules:
- </p>
- <pre class="example">
- p = re.compile[[
- text <- {~ item* ~}
- item <- macro / [^()] / '(' item* ')'
- arg <- ' '* {~ (!',' item)* ~}
- args <- '(' arg (',' arg)* ')'
- -- now we define some macros
- macro <- ('apply' args) -> '%1(%2)'
- / ('add' args) -> '%1 + %2'
- / ('mul' args) -> '%1 * %2'
- ]]
- print(p:match"add(mul(a,b), apply(f,x))") --> a * b + f(x)
- </pre>
- <p>
- A <code>text</code> is a sequence of items,
- wherein we apply a substitution capture to expand any macros.
- An <code>item</code> is either a macro,
- any character different from parentheses,
- or a parenthesized expression.
- A macro argument (<code>arg</code>) is a sequence
- of items different from a comma.
- (Note that a comma may appear inside an item,
- e.g., inside a parenthesized expression.)
- Again we do a substitution capture to expand any macro
- in the argument before expanding the outer macro.
- <code>args</code> is a list of arguments separated by commas.
- Finally we define the macros.
- Each macro is a string substitution;
- it replaces the macro name and its arguments by its corresponding string,
- with each <code>%</code><em>n</em> replaced by the <em>n</em>-th argument.
- </p>
- <h3>Patterns</h3>
- <p>
- This example shows the complete syntax
- of patterns accepted by <code>re</code>.
- </p>
- <pre class="example">
- p = [=[
- pattern <- exp !.
- exp <- S (grammar / alternative)
- alternative <- seq ('/' S seq)*
- seq <- prefix*
- prefix <- '&' S prefix / '!' S prefix / suffix
- suffix <- primary S (([+*?]
- / '^' [+-]? num
- / '->' S (string / '{}' / name)
- / '>>' S name
- / '=>' S name) S)*
- primary <- '(' exp ')' / string / class / defined
- / '{:' (name ':')? exp ':}'
- / '=' name
- / '{}'
- / '{~' exp '~}'
- / '{|' exp '|}'
- / '{' exp '}'
- / '.'
- / name S !arrow
- / '<' name '>' -- old-style non terminals
- grammar <- definition+
- definition <- name S arrow exp
- class <- '[' '^'? item (!']' item)* ']'
- item <- defined / range / .
- range <- . '-' [^]]
- S <- (%s / '--' [^%nl]*)* -- spaces and comments
- name <- [A-Za-z_][A-Za-z0-9_]*
- arrow <- '<-'
- num <- [0-9]+
- string <- '"' [^"]* '"' / "'" [^']* "'"
- defined <- '%' name
- ]=]
- print(re.match(p, p)) -- a self description must match itself
- </pre>
- <h2><a name="license">License</a></h2>
- <p>
- This module is part of the <a href="lpeg.html">LPeg</a> package and shares
- its <a href="lpeg.html#license">license</a>.
- </div> <!-- id="content" -->
- </div> <!-- id="main" -->
- </div> <!-- id="container" -->
- </body>
- </html>
|