RXEngine, a fast text-directed Regular eXpression Engine with backreference and recursion

(This website is under construction)

Since I use C++ programming language at work, the search engine is also written in that language. As far as I know RXEngine is the only regular expression engine which is a text-directed engine but has backreference capability.

This may appear astonishing because of the common belief that backreferences can only be implemented in regex-directed engines. I think this is a misconception. If you can’t believe it, give it a try.

The following quote is from Russ Cox (https://swtch.com/~rsc/regexp/regexp1.html):

“Backreferences. As mentioned earlier, no one knows how to implement regular expressions with backreferences efficiently, though no one can prove that it’s impossible either. (Specifically, the problem is NP-complete, meaning that if someone did find an efficient implementation, that would be major news to computer scientists and would win a million dollar prize.)”

I’m not saying I could win the prize of Clay Mathematics Institute with my solution, but nonetheless my search engine is a remarkable novelty.

The search engine can be tested, although there is no interface for this on this website yet. Currently there is only a windows application. I created a dialog-based interface to experiment with RXEngine. You can download it from this link as a ZIP-file.

How it is working

First of all, let me note that my search engine is not a finite automaton, neither NFA nor DFA. Finite automata are essentially memoryless, mine engine is not. Because RXEngine is a text-directed engine, it does not use backtracking either, as regex-directed engines do. RXEngine does not know atomic groups or greedy or lazy quantifiers. In general, the engine travels through the input only once from left to right. The only exceptions are lookbehinds. Lookbehind assertions are reversed regular expressions and the inputs are scanned from right to left.

All type of lookaround assertions are full-featured regular expressions within the main regular expression or within another lookaround assertion. There are no limitations in lookarounds like the fixed-length criterion of lookbehind assertions of some engines. They can even contain recursive calls.

As a first step, the search engine compiles the regular expression. This means, that it creates a directed graph from the elements of the expression. However, the compiled regular expression graph is not static. Under certain circumstances, the subroutine calls, lookarounds and matched backreferences add new elements (vertices and edges) to the graph or modify existing ones during execution. The engine finds all the matching paths of this graph using a breadth-first search.

Using some of the matched key elements of the regular expression the engine creates another graph, I call it a result graph. The key elements are among others the parentheses of capture groups. The result graph is a rooted graph with at least one exit vertex. The root node is the beginning of the regular expression, an exit node is the end of the regular expression. There are several routes from the root element to an exit point. Each route can contain capturing groups. I call the set of capture groups belonging to a route a result set.

The result graph makes it possible to use backreferences. Because it contains all the starting and ending parentheses of the capture groups, it knows about every captured text of the input. A backreference is using only the last submatch but the engine contains all the history of submatches too.

Each result set differs from the others at least in one captured text. Typically, other search engines return only one set of results (without history), but since RXEngine registers all possible sets in parallel, all of them can be read from it (including history).

The number of the result sets may be large, but the result graph is a compact way to store all the captured text. The graph stores only indexes to the input text, not the text actually, and because it is a graph not an ever-expanding tree, the memory requirement for a capture group is in worst case “2*n”, where ‘n’ is the length of matched input.

In my opinion, the problem is not with the storage of the capture groups. The problem will be the reading out of the captured texts if earlier a lot of sections were captured. Maybe reading all the result sets will take more time than finding the match before.

If the engine finds a match, it will be the leftmost longest match by default, but there is a switch to return the leftmost shortest match. There can be matches of other lengths between the longest and shortest matches because each exit node in the result graph corresponds to one match However, I do not think the matches between the two extremes have any practical significance.

If the regular expression is containing one or more capture groups, the test-program displays the valid result sets in its „Match Information” window. Because the number of result sets may be vast, only the first 500 result sets are displayed if more than that exists.

The memory usage can be high if the search term contains capture groups. It is very difficult for me to make an estimate because it depends on what the result will be. During operation, some parts of the result-graph become disposable, but currently the memory management does not reuse these memory areas. Memory usage could be further reduced at the expense of processing speed. But till now for me the running time was more important than memory usage, so I have not optimized it.

The stack space during execution will not be a problem, the engine does not use recursion despite the fact, that it can process recursive regular expression patterns and nested lookarounds to any depth. The size of the heap memory is the limiting factor, not the stack.

Syntax

Use only the items listed below, compatibility with other search engines has not been a consideration so far.

(?#comment)             Comment

Escaped meta-characters which represents literal characters

\[      matches            [

\\      matches            \

\^     matches            ^

\$     matches            $

\.      matches            .

\|      matches            |

\?     matches            ?

\*     matches            *

\+     matches            +

\(      matches            (

\)      matches            )

\{     matches            {

\/     matches            / (optional, only for compatibility reason)

Characters

.       Any character

\d     Digit

\D     Not a digit

\s     Whitespace

\S     Not a whitespace

\w     Word character

\W    Not a word character

\xFF two hexadecimal digits

\xFFFF       four hexadecimal digits

Alternation

|       Alternation

Character class

[]      Character class

Anchors

^      Start of line

$      End of line

\b     Word boundary

\B     Not a word boundary

Repetitions

?      0 or 1 times

*      0 or more times

+      1 or more times

{n}   Repeate n-times

{n,}  Repeate at least n-times

{n,m}       Repeate at least n-times, maximum m times

Grouping

(regular expression)                     Capturing group

(?:regular expression)                   Non-capturing group

(?P<name>regular expression)      Named capturing group

Backreference

\1 through \9             Backreference

(?P=name)                Named backreference

Lookarounds

(?=regular expression)                  Positive lookahead

(?!regular expression)                   Negative lookahead

(?<=regular expression)                Positive lookbehind

(?<!regular expression)                 Negative lookbehind

Subroutines

(?1) through (?9)         Subroutine call

(?P>name)                Named subroutine call

Case sensitivity modifiers

(?i)                   Case-insensitive mode on

(?-i)                  Case-insensitive mode off

(?i:regular expression)          Case-insensitive mode on for parenthetical regexp

(?-i:regular expression)         Case-insensitive mode off for parenthetical regexp

Contact: send e-mail