Wednesday, September 18, 2024

Five Habits for Successful Regular Expressions

Regular expressions are hard to write, hard to read, and hard to maintain. Plus, they are often wrong, matching unexpected text and missing valid text. The problem stems from the power and expressiveness of regular expressions. Each metacharacter packs power and nuance, making code impossible to decipher without resorting to mental gymnastics.

Most implementations include features that make reading and writing regular expressions easier. Unfortunately, they are hardly ever used. For many programmers, writing regular expressions is a black art. They stick to the features they know and hope for the best. If you adopt the five habits discussed in this article, you will take most of the trial and error out of your regular expression development.

This article uses Perl, PHP, and Python in the code examples, but the advice here is applicable to nearly any regex implementation.

1. Use Whitespace and Comments

Most programmers have no problem adding whitespace and indentation to the code surrounding a regular expression. They would be laughed at or yelled at if they didn’t (hopefully, yelled at). Nearly everyone knows that code is harder to read, write, and maintain if it is crammed into one line. Why would that be any different with regular expressions?

The only trick to remember with extended whitespace is that the regex engine ignores whitespace. So if you are hoping to match whitespace, you have to say so explicitly, often with s.

In Perl, add an x to the end of the regex, so m/foo|bar/ becomes:

m/
foo
|
bar
/x

In PHP, add an x to the end of the regex, so “/foo|bar/” becomes:

"/
foo
|
bar
/x"

In Python, pass the mode modifier, re.VERBOSE, to the compile function:

pattern = r'''
foo
|
bar
'''

regex = re.compile(pattern, re.VERBOSE)

The value of whitespace and comments becomes more important when working with more complex regular expressions. Consider the following regular expression to match a U.S. phone number:

(?d{3})? ?d{3}[-.]d{4}

This regex matches phone numbers like “(314)555-4000”. Ask yourself if the regex would match “314-555-4000” or “555-4000”. The answer is no in both cases. Writing this pattern on one line conceals both flaws and design decisions. The area code is required and the regex fails to account for a separator between the area code and prefix.

Spreading the pattern out over several lines makes the flaws more visible and the necessary modifications easier.

In Perl this would look like:

/  
    (?     # optional parentheses
      d{3} # area code required
    )?     # optional parentheses
    [-s.]? # separator is either a dash, a space, or a period.
      d{3} # 3-digit prefix
    [-.]    # another separator
      d{4} # 4-digit line number
/x

The rewritten regex now has an optional separator after the area code so that it matches “314-555-4000.” The area code is still required. However, a new programmer who wants to make the area code optional can quickly see that it is not optional now, and that a small change will fix that.

2. Write Tests

There are three levels of testing, each adding a higher level of reliability to your code. First, you need to think hard about what you want to match and whether you can deal with false matches. Second, you need to test the regex on example data. Third, you need to formalize the tests into a test suite.

Deciding what to match is a trade-off between making false matches and missing valid matches. If your regex is too strict, it will miss valid matches. If it is too loose, it will generate false matches. Once the regex is released into live code, you probably will not notice either way. Consider the phone regex example above; it would match the text “800-555-4000 = -5355”. False matches are hard to catch, so it’s important to plan ahead and test.

Sticking with the phone number example, if you are validating a phone number on a web form, you may settle for ten digits in any format. However, if you are trying to extract phone numbers from a large amount of text, you might want to be more exact to avoid a unacceptable numbers of false matches.

When thinking about what you want to match, write down example cases. Then write some code that tests your regular expression against the example cases. Any complicated regular expression is best written in a small test program, as the examples below demonstrate:

In Perl:

#!/usr/bin/perl

my @tests = ( "314-555-4000",
              "800-555-4400",
	      "(314)555-4000",
              "314.555.4000",
              "555-4000",
              "aasdklfjklas",
              "1234-123-12345"          
            );

foreach my $test (@tests) {
    if ( $test =~ m/
                   (?     # optional parentheses
                     d{3} # area code required
                   )?     # optional parentheses
                   [-s.]? # separator is either a dash, a space, or a period.
                     d{3} # 3-digit prefix
                   [-s.]  # another separator
                     d{4} # 4-digit line number
                   /x ) {
        print "Matched on $testn";
     }
     else {
        print "Failed match on $testn";
     }
}

In PHP:

<?php $tests = array( "314-555-4000",
&nbsp &nbsp "800-555-4400",
&nbsp &nbsp "(314)555-4000",
&nbsp &nbsp "314.555.4000",
&nbsp &nbsp "555-4000",
&nbsp &nbsp "aasdklfjklas",
&nbsp &nbsp "1234-123-12345"
&nbsp );

$regex = "/
&nbsp &nbsp (? # optional parentheses
&nbsp &nbsp d{3} # area code
&nbsp &nbsp )? # optional parentheses
&nbsp &nbsp [-s.]? # separator is either a dash, a space, or a period.
&nbsp &nbsp d{3} # 3-digit prefix
&nbsp &nbsp [-s.] # another separator
&nbsp &nbsp d{4} # 4-digit line number
&nbsp /x";

foreach ($tests as $test) {
&nbsp if (preg_match($regex, $test)) {
&nbsp &nbsp echo "Matched on $test
";
&nbsp }
&nbsp &nbsp else {
&nbsp &nbsp echo "Failed match on $test
";
}
}
?>

In Python:

import re

tests = ["314-555-4000",
         "800-555-4400",
         "(314)555-4000",
         "314.555.4000",
         "555-4000",
         "aasdklfjklas",
         "1234-123-12345"        
        ]

pattern = r'''
(?     # optional parentheses
              d{3} # area code
            )?     # optional parentheses
            [-s.]? # separator is either a dash, a space, or a period.
              d{3} # 3-digit prefix
            [-s.]  # another separator
              d{4} # 4-digit line number
           '''

regex = re.compile( pattern, re.VERBOSE )

for test in tests:
    if regex.match(test):
        print "Matched on", test, "n"
    else:
        print "Failed match on", test, "n"

Running the test script exposes yet another problem in the phone number regex: it matched “1234-123-12345”. Include tests that you expect to fail as well as those you expect to match.

Ideally, you would incorporate these tests into the test suite for your entire program. Even if you do not have a test suite already, your regular expression tests are a good foundation for a suite, and now is the perfect opportunity to start on one. Even if now is not the right time (really, it is!), you should make a habit to run your regex tests after every modification. A little extra time here could save you many headaches.

3. Group the Alternation Operator

The alternation operator (|) has a low precedence. This means that it often alternates over more than the programmer intended. For example, a regex to extract email addressesout of a mail file might look like:

^CC:|To:(.*)

The above attempt is incorrect, but the bugs often go unnoticed. The intent of the above regex is to find lines starting with “CC:” or “To:” and then capture any email addresses on the rest of the
line.

Unfortunately, the regex doesn’t actually capture anything from lines starting with “CC:” and may capture random text if “To:” appears in the middle of a line. In plain English, the regular expression matches lines beginning with “CC:” and captures nothing, or matches any line containing the text “To:” and then captures the rest of the line. Usually, it will capture plenty of addresses and nobody will notice the failings.

If that were the real intent, you should add parentheses to say it explicitly, like this:

(^CC:)|(To:(.*))

However, the real intent of the regex is to match lines starting with “CC:” or “To:” and then capture the rest of the line. The following regex does that:

^(CC:|To:)(.*)

This is a common and hard-to-catch bug. If you develop the habit of wrapping your alternations in parentheses (or non-capturing
parentheses — (?:…)) you can avoid this error.

4. Use Lazy Quantifiers

Most people avoid using the lazy quantifiers *?, +?, and ??, even though they are easy to understand and make many regular expressions easier to write.

Lazy quantifiers match as little text as possible while still aiding the success of the overall match. If you write foo(.*?)bar, the quantifier will stop matching the first time it sees “bar”, not the last time. This may be important if you are trying to capture “###” in the text “foo###bar+++bar”. A regular quantifier would have captured “###bar+++”.

Let’s say you want to capture all of the phone numbers from an HTML file. You could use the phone number regular expression example we discussed earlier in this article. However, if you know that the file contains all of the phone numbers in the first column of a table, you can write a much simpler regex using lazy quantifiers:

<tr><td>(.+?)<td>

Many beginning regular expression programmers avoid lazy quantifiers with negated character classes. They write the above code as:

<tr><td>([^>]+)</td>

That works in this case, but leads to trouble if the text you are trying to capture contains common characters from your delimiter (in this case, </td>). If you use lazy quantifiers, you will spend
less time kludging character classes and produce clearer regular expressions.

Lazy quantifiers are most valuable when you know the structure surrounding the text you want to capture.

5. Use Available Delimiters

Perl and PHP often use the forward slash to mark the start and end of a regular expression. Python uses a variety of quotes to mark the start and end of a string, which may then be used as a regular expression. If you stick with the slash delimiter in Perl and PHP, you will have to escape any slashes in your regex. If you use regular quotes in Python, you will have to escape all of your backslashes. Choosing different delimiters or quotes allows to avoid escaping half of your regex. This makes the regex easier to read and reduces the potential for bugs when you forget to escape something.

Perl and PHP allow you to use any non-alphanumeric or whitespace character as a delimiter. If you switch to a new delimiter, you can avoid having to escape the forward slashes when you are trying to match URLs or HTML tags such as “http://” or “<br />“.

For example:

/http://(S)*/

could be rewritten as:

#http://(S)*#

Common delimiters are #, !, |. If you use square brackets, angle brackets, or curly braces, the opening and closing brackets must match. Here are some common uses of delimiters:

#…#!…!{…}s|…|…| (Perl only)s[…][…] (Perl onlys<…>/…/ (Perl only)

In Python, regular expressions are treated as strings first. If you use quotes — the regular string delimiter — you will have to escape all of your backslashes. However, you can use raw strings, r'', to avoid this. If you use raw triple-quoted strings with the re.VERBOSE option, it allows you to include newlines.

For example:

regex = "(w+)(d+)"

could be rewritten as:

regex = r'''
           (w+)
           (d+)
         '''

Conclusion

The advice in this article focuses on making regular expressions readable. In developing habits to achieve this, you will be forced to think more clearly about the design and structure of your regular expressions. This will reduce bugs and ease the life of the code maintainer. You will be especially happy if that code maintainer is you.

I would like to thank Sarah Burcham for advice on this article. Also, thanks to Jeffrey E.F. Friedl for Mastering Regular Expressions. His book serves as the foundation for everything I do with regular expressions.

Tony Stubblebine writes Perl and regular expressions for O’Reilly.

*Reprinted with the permission of the O’Reilly Network.

Tony Stubblebine, writes Perl and regular expressions for O’Reilly. Previously he held the titles of Web Peasant and Rogue Developer while hacking Perl for MasterCard International. He is also the Social Director and Senior Nightlife Correspondent for the O’Reilly Network.


cover

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles