Maintainable Regular Expressions

2011-08-22


I've seen a lot of regular expressions in my career. Some genius, others monstrous, some very subtle and some really blunt. The person who wrote the expression usually did his best to make the expression work and finally succeeded in making his expression pass all the test cases he thought up.

The end-result is usually a one liner that has too many ()'s, {}'s and []'s to be human readable ever again. And that is, lets face it, a huge problem. Because no matter how well it was tested, a time will come a bug is found or that the requirements change. 

Then what?!

Usually the expression is thrown away and written from scratch. Or, the expression is carefully dissected, placed over multiple lines, indented and prettied up, all to see what it was supposed to do.

I've also come across documentation where this nicely prettied up version is placed in a word document and each and every line is documented.

It would be so nice if you could just put that documentation right in the code. Wouldn't it?!


Well, this has been possible for quite some time. In C# you can use the @ sign to place a regular expression on multiple lines and you can add comments to a regular expression (#) just like ordinary code. This type of string is called a Verbatim string. It not only adds multi-line support, but also changes the way the compiler escapes your special characters. With Regular Expressions, this changed behavior is usually for the good. The only problem you'll encounter is that all that nice prettified white-space is actually added to your expression and things break.

That is why Microsoft decided to add the IgnorePatternWhitespace option to the RegexOptions enumeration:

IgnorePatternWhitespace

Eliminates unescaped white space from the pattern and enables comments marked with #. However, the IgnorePatternWhitespace value does not affect or eliminate white space in character classes.


Let's take this example from RegexLib.net:

Expression ^(([0-1]?[0-9])|([2][0-3])):([0-5]?[0-9])(:([0-5]?[0-9]))?$
Description Time in 24 hours format with optional seconds
Matches 12:15 | 10:26:59 | 22:01:15
Non-Matches 24:10:25 | 13:2:60


The expression is relatively simple, but already hard to read because of all the ()'s and alternations. Let's see how we can improve this:
  • Let us split up the expression over multiple lines, and put comments before each identifiable part
  • Let us add comments to show the actual range values can have.
    string timeExpression = @"
        ^ # Start of input
 
        #hours
        (
         ([0-1]?[0-9]) # 00-19
         |([2][0-3])   # 20-23
        )
 
        #minutes
        :([0-5]?[0-9])         # 0-59
 
        #seconds are optional
        (
         :([0-5]?[0-9]) # 0-59
        )?
 
        $ # end of input
    ";
 
    RegexOptions options = RegexOptions.IgnorePatternWhitespace 
          | RegexOptions.ExplicitCapture;
            
    Regex timeRegex = new Regex(timeExpression, options);

That's a lot better! And if you look carefully, you might spot a potential bug. In the minute and second part, the leading 0 is optional, usually this is only allowed for the Hour part. The samples provided with the expression don't make it obvious the leading 0 is optional. I didn't spot initially, guess why ;).


Arguably, the expression could be optimized by using repetition for the minute and second part. For something like time, where we're all very familiar with the different components and name them automatically, this might be a bad idea, but in many other cases this optimization might actually improve readability:
string timeExpression = @"
        ^ # Start of input
 
        #hours
        (
         ([0-1]?[0-9]) # 00-19
         |([2][0-3])   # 20-23
        )
 
        #minutes and optional seconds
        (:(([0-5][0-9])){1,2}         # 0-59
 
        $ # end of input";

Problem: But I have white-space that is actually important!

If you want to have white-space as part of your expression, you can still use \s, or put the white space in a character class: [ ].

Problem: My language doesn't support multi-line strings
VB.NET doesn't really support multi-line strings (like the verbatim strings in C#), but has the ugly continuation construct:
Dim timeExpression = "^ # Start of input      " _
        + "#hours                                 " _
        + "(                                      " _
        + "     ([0-1]?[0-9]) # 00-19             " _
        + "     |([2][0-3])   # 20-23             " _
        + ")                                      " _
        + "#minutes and optional seconds          " _
        + "(:(([0-5][0-9])){1,2}         # 0-59   " _
        + "$ # end of input                       "

It works, but isn't as nice.

Of course you can also inline comments and you can still ignore pattern white-space, but readability does suffer:
string timeExpression = @"(?# Start of input)^(?#hours)((?# 00-19)([0-1]?[0-9])|(?# 20-23)([2][0-3]))(?#minutes):(?# 0-59)([0-5]?[0-9])(?#seconds are optional)(:(?# 0-59)([0-5]?[0-9]))?(?# end of input)$";
 

Most Reading