QRegularExpression

The Qt 5.0 alpha was released just a few days ago (what you’re waiting for? Grab the tarball and test it!).

Amongst the thousands of changes, let me introduce you the class I’ve worked on in my spare time during the last months:

QRegularExpression

QRegularExpression is the official Perl-compatible regular expression class in Qt 5, and it’s going to replace QRegExp usages throughout Qt (and, hopefully, in your code).

QRegExp has always been known for not being exactly “optimal”: it has an awkward API, it’s slow, buggy, unmaintained, it supports a very limited subset of Perl regexp features, and so on.

You can find a partial list of QRegExp issues when doing research for its replacement on this page.

So, ultimately, libpcre was chosen as the Qt 5 regular expression engine. It’s the most widely used implementation of Perl-compatible regular expressions (second only to perl itself :-)). Since the 8.30 release it natively supports UTF-16 strings, which is the encoding used by QString in your program. No overhead is necessary for converting strings to be used with PCRE.

Given that PCRE isn’t installed by default on Windows, and (at the time of this writing) the vast majority of the Linux distributions (as well as OS X) ship with previous PCRE releases, Qt contains its own copy of PCRE 8.30. So don’t worry, you don’t have to install it as a dependency.

Enter QRegularExpression

QRegularExpression is better from almost any point of view than QRegExp. The only downside I’ve found when benchmarking is a higher memory usage, likely due to the more complex engine needed to support all the features you can use with QRegularExpression.

Let’s start by taking a look at its API.

Voces iustae sunt omnes divisae in partes tres

The actual API is divided in three classes:

Unlike QRegExp, a QRegularExpression object is not modified when a match is performed. (Actually it is, but that’s an implementation detail). Instead, a QRegularExpressionMatch object is returned, which contains the results of the match:

QRegularExpression re("a.*pattern");
QRegularExpressionMatch match = re.match("this is a string containing a - are you sure? - pattern");
if (match.hasMatch()) {
   // ...
}

You can also pass a number of pattern options, to enable Perl’s /ismxu switches, as well as other features currently unsupported by Perl.

The QRegularExpressionMatch object is used both for inspecting the results and for extracting captured substrings:

QRegularExpression re("(\\d+) (\\w+)");
QRegularExpressionMatch match = re.match("1234 word");
if (match.hasMatch()) {
    QString digits = match.captured(1); // "1234"
    QString letters = match.captured(2); // "word
}

Note that exactly like QRegExp, the capturing groups are numbered starting from 1. The capturing group #0 is the whole match.

Through QRegularExpressionMatch you can access:

  • the results of the match (did it match? did it partially match?);
  • the contents of the whole match (Perl equivalent: $&);
  • the contents of each numbered capturing group (Perl equivalent: $1, $2, …);
  • the contents of each named capturing group (Perl equivalent: %+);
  • the starting and ending offset of each capturing group (Perl equivalent: @+ and @-).

Global matching

Global matching (Perl’s /g modifier) is the act of searching for all occurrences of a regular expression inside a subject string.

While this may seem trivial (do a match; do another match starting from where the last one ended) it becomes surprisingly tricky when it comes to patterns that may match 0 characters (f.i. because of * quantifiers).

For instance, the pattern a*b*|c must give the following sequence of matches when matching against “ccaabb”:

""
"c"
""  
"c"
"aabb"
""

Yes, lines 1, 3 and 6 are empty matches. Let’s redo it from perl, this time printing the offsets:

$ perl -E 'while ("ccaabb" =~ /a*b*|c/g) { say "|$-[0]|$+[0]|$&|" }'
|0|0||
|0|1|c|
|1|1||
|1|2|c|
|2|6|aabb|
|6|6||

The key of the /g algorithm is that if the last match was a 0-length match, then a new, nonempty, anchored match is reattempted at the same position. If that succeedes, that’s the new match. Otherwise, we advance by one character and redo a normal match from that offset.

(I’m not exactly sure if this algorithm is documented somewhere, but that’s the behaviour QRegularExpression implements.)

When asked for a global match, QRegularExpression returns a QRegularExpressionMatchIterator object, which is a forward-only Java-like iterator over the results:

QRegularExpression re("(\\w+)");
QRegularExpressionMatchIterator i = re.globalMatch("snow rain ice");
while (i.hasNext()) {
    QRegularExpressionMatch match = i.next();
    // use match
}

This totally looks like what you do in Perl:

my $str = "snow rain ice";
while ($str =~ /(\w+)/g) {
   # use $1, etc.
}

Pretty handy, huh?

(Bonus question: how do you implement /g with QRegExp? Not in the way that it’s in the docs!)

Added features

The number of added features, when compared with QRegExp, is huge. A partial list is:

  • it supports lazy (non greedy) and possessive quantifiers;
  • it supports lookbehind assertions;
  • it supports Perl’s /s, /m, /x, /u regexp modifiers;
  • it properly supports Perl’s global matching
  • it supports named capturing groups;
  • it can properly handle non-BMP character ranges;
  • it supports soft and hard partial matching;
  • it supports subroutine calls/references
  • it supports comments;
  • it supports conditional patterns.

Most of these have been long-time nuisances with QRegExp. Now they’re gone.

Speed

QRegularExpression is also much faster than QRegExp. Let’s take a look at a couple of benchmarks. The testing machine configuration used is:

  • CPU: Intel(R) Core(TM)2 Duo CPU P9600 @ 2.66GHz
  • RAM: 4GB 1066 MHz
  • OS: Ubuntu 10.04 32 bit
  • Kernel: Linux 2.6.32-40-generic-pae #87-Ubuntu SMP Mon Mar 5 21:44:34 UTC 2012 i686 GNU/Linux
  • GCC: gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5.1)

with a release build of Qt 5.0 master (plus a couple of QRegularExpression fixes and improvements currently waiting to be reviewed).

List filtering

List filtering has many applications (think for instance at QSortFilterProxyModel).

For this benchmark I used the complete works by William Shakespeare, downloaded from Project Gutenberg. The text was split on newlines (for a total of 124796 lines), and the list filtered using different regular expressions:

  1. ([Tt]hou|shalt|[Nn]ot|pass|question)
  2. \b(\S{4,})\b.*\b\1\b”

The first one simply extracts the lines containing a certain word.

The second one matches the lines containing a word (…at least 4 non-spaces…) which appears in the same line at least twice; it is also repeated with a case insensitive match — that is, the repetition is allowed to have a different casing.

The results are the following:

********* Start testing of ListfilterREBEnchmark *********
Config: Using QTest library 5.0.0, Qt 5.0.0
PASS   : ListfilterREBEnchmark::initTestCase()
PASS   : ListfilterREBEnchmark::QRegExpBenchmark(Words)
RESULT : ListfilterREBEnchmark::QRegExpBenchmark():"Words":
     141 msecs per iteration (total: 141, iterations: 1)
PASS   : ListfilterREBEnchmark::QRegExpBenchmark(Duplicated words)
RESULT : ListfilterREBEnchmark::QRegExpBenchmark():"Duplicated words":
     2,846 msecs per iteration (total: 2,846, iterations: 1)
PASS   : ListfilterREBEnchmark::QRegExpBenchmark(Duplicated words case insensitive)
RESULT : ListfilterREBEnchmark::QRegExpBenchmark():"Duplicated words case insensitive":
     3,343 msecs per iteration (total: 3,343, iterations: 1)
PASS   : ListfilterREBEnchmark::QRegularExpressionBenchmark(Words)
RESULT : ListfilterREBEnchmark::QRegularExpressionBenchmark():"Words":
     108 msecs per iteration (total: 108, iterations: 1)
PASS   : ListfilterREBEnchmark::QRegularExpressionBenchmark(Duplicated words)
RESULT : ListfilterREBEnchmark::QRegularExpressionBenchmark():"Duplicated words":
     399 msecs per iteration (total: 399, iterations: 1)
PASS   : ListfilterREBEnchmark::QRegularExpressionBenchmark(Duplicated words case insensitive)
RESULT : ListfilterREBEnchmark::QRegularExpressionBenchmark():"Duplicated words case insensitive":
     438 msecs per iteration (total: 438, iterations: 1)
PASS   : ListfilterREBEnchmark::cleanupTestCase()
Totals: 8 passed, 0 failed, 0 skipped
********* Finished testing of ListfilterREBEnchmark *********

So, in this test, QRegularExpression is from 33% to 663% faster than QRegExp.

To say it all, this test also exposes a QRegExp bug — when dumping the lines containing two words, QRegularExpression matched some lines that QRegExp didn’t match. For instance

Thy father’s father wore it

is not matched by QRegExp:

    QRegExp re("\\b(\\S{4,})\\b.*\\b\\1\\b");
    QString s("Thy father's father wore it");
    qDebug() << re.indexIn(s);

This prints -1 (no match) instead of the expected 4. Sigh.

Large file

Another benchmark is a substring extraction from a large file. The benchmark performs a global match on a long string (10 megabytes), looking for substrings matching the following (simple) patterns:

  1. URI: ([a-zA-Z][a-zA-Z0-9]*)://([^ /]+)(/[^ ]*)?
  2. Email: ([^ @]+)@([^ @]+)
  3. Date: ([0-9][0-9]?)/([0-9][0-9]?)/([0-9][0-9]([0-9][0-9])?)
  4. URI|Email: ([a-zA-Z][a-zA-Z0-9]*)://([^ /]+)(/[^ ]*)?|([^ @]+)@([^ @]+)

The test file is this one, which appears to be a corpus of technical documentation. (Since it’s bigger than 10 MB, the test truncates it).

The results are:

********* Start testing of LargefileREBenchmark *********
Config: Using QTest library 5.0.0, Qt 5.0.0
PASS   : LargefileREBenchmark::initTestCase()
PASS   : LargefileREBenchmark::QRegExpBenchmark(URI)
RESULT : LargefileREBenchmark::QRegExpBenchmark():"URI":
     2,112 msecs per iteration (total: 2,112, iterations: 1)
PASS   : LargefileREBenchmark::QRegExpBenchmark(Email)
RESULT : LargefileREBenchmark::QRegExpBenchmark():"Email":
     5,503 msecs per iteration (total: 5,503, iterations: 1)
PASS   : LargefileREBenchmark::QRegExpBenchmark(Date)
RESULT : LargefileREBenchmark::QRegExpBenchmark():"Date":
     169 msecs per iteration (total: 169, iterations: 1)
PASS   : LargefileREBenchmark::QRegExpBenchmark(URI|Email)
RESULT : LargefileREBenchmark::QRegExpBenchmark():"URI|Email":
     6,854 msecs per iteration (total: 6,854, iterations: 1)
PASS   : LargefileREBenchmark::QRegularExpressionBenchmark(URI)
RESULT : LargefileREBenchmark::QRegularExpressionBenchmark():"URI":
     375 msecs per iteration (total: 375, iterations: 1)
PASS   : LargefileREBenchmark::QRegularExpressionBenchmark(Email)
RESULT : LargefileREBenchmark::QRegularExpressionBenchmark():"Email":
     569 msecs per iteration (total: 569, iterations: 1)
PASS   : LargefileREBenchmark::QRegularExpressionBenchmark(Date)
RESULT : LargefileREBenchmark::QRegularExpressionBenchmark():"Date":
     107 msecs per iteration (total: 107, iterations: 1)
PASS   : LargefileREBenchmark::QRegularExpressionBenchmark(URI|Email)
RESULT : LargefileREBenchmark::QRegularExpressionBenchmark():"URI|Email":
     778 msecs per iteration (total: 778, iterations: 1)
PASS   : LargefileREBenchmark::cleanupTestCase()
Totals: 10 passed, 0 failed, 0 skipped
********* Finished testing of LargefileREBenchmark *********

So, in this test, QRegularExpression is from 58% to 867% faster than QRegExp.

Smashing ur heap wiht meh code

The major speed improvement comes from the fact that PCRE uses a Just In Time compiler under all major platforms to speed up the execution of the matching algorithm. The JIT is enabled by default in release builds, but disabled in debug ones, so don’t worry if you notice dramatic slowdowns.

The reason for this is that Valgrind and other debugging tools crash in presence of self-modifying code on the heap. You must enable all SMC checks in Valgrind (–smc-check=all) in order to debug programs using PCRE’s JIT (even if you don’t use it in your program, it may be used internally by Qt or other libraries).

The downside of enabling such checks is that Valgrind gets dramatically slowed down. And that’s why JIT is disabled by default in debug builds. You can still override this setting and enable/disable the JIT usage both in debug and release builds.

Upgrade path from QRegExp

Most of the time, the upgrade path should be painless: just start using QRegularExpression with the same patterns. They should be mostly compatible, although the other various QRegExp syntaxes (wildcard, XSD, etc.) are not supported — yet.

The biggest difference is using the QRegularExpressionMatch object instead of calling QRegExp::indexIn() and then using QRegExp::cap(). But I believe that will lead to better code.

Moreover, you’re going to get the usual overloads for QString methods, QStringList, QObject::findChild(ren), etc. (they’re already in api_changes, waiting to be merged back into 5.0 master). Sadly, some class will have to wait until 5.1 to see such overloads.

Beware that you can have really bad surprises if you inadvertently were using illegal QRegExps (for instance, containing a lazy quantifier): they’ll now become legal, and start to match! If that triggers untested code paths, you can have an hard time figuring out what’s going on…

Other differences with QRegExp are in the official API docs.

That’s all, folks!

Hope you enjoyed this brief introduction to the new regular expression classes. :-)

Happy hacking!

Advertisements