Collaborator-8-3-CTA-banner

The Developer’s RegEx Survival Guide: 15 Rules for Making Sense of Regular Expressions

Regular expressions are a powerful, expressive and compact way to solve many programming problems involving text. Yet when I help people on StackOverflow, I see a lot of pain and anguish from people misusing regexes, or not knowing the common pitfalls. In this article, I give pointers to help newcomers learn how best to use these tools. These rules should help save you from the heartache of debugging.

Most of my examples are in Perl because Perl is the granddaddy of all the programming languages as far as regex support. While the syntax for working with regexes is different in languages like PHP, Ruby, or Java, the principles are the same.

1. Never assume that a regex will match.

The #1 problem I see with rookies and regexes is failing to check whether the regex matched.

Consider this case where you have a file with lines of the form:

   red 47
   green 17
   yellow 42

and you’re parsing the output with a loop like this:

   while ( my $row = <$filehandle> ) {
      $row =~ /^(\w+) (\d+)$/;
      print "Color=$1, count=$2\n";
   }

The regex matches a word and a number and prints them out. But what if, somehow, a line in a format you weren’t expecting appears, such as:

   light blue 28

The regex match fails and the values of the group capture variables $1 and $2 are unchanged. They won’t be set to undef. They just won’t change at all, and you’ll have the values from the previous line in the file.

What you must do is check if the regex matches:

   while ( my $row = <$filehandle> ) {
      if ( $row =~ /^(\w+) (\d+)$/ ) {
         print "Color=$1, count=$2\n";
      }
   }

If you have a case where you are absolutely certain “that can never fail,” then handle the case where it does fail and have your program stop executing.

   while ( my $row = <$filehandle> ) {
      $row =~ /^(\w+) (\d+)$/ or die "Regex failed to match!";
      print "Color=$1, count=$2\n";
   }

Always assert your assumptions. Your users will forgive you for a program crashing. They will not forgive you for giving incorrect results.

2. Only build character classes if you need to.

Character classes define a set of characters to match, such as [a-z] for lowercase letters, or [0-9] for the ten digits. Metacharacters can stand in for common character classes, such as \d for a digit, or \s for whitespace. The capital versions of these metacharacters stand for the negation, so where \d is a digit, \D is a non-digit.

Metacharacters make your regexes shorter, with less punctuation, and therefore easier to read.

3. Use /i instead of character classes.

There are many trailing regex modifiers, but perhaps the most commonly used one is /i for case-insensitive.

Say you’re matching a string that can have a 2-digit number and either am or pm, without regard to case. Instead of writing it as:

   /^\d\d[aApP][mM]$/

write it as:

   /^\d\d[ap]m$/i

4. Know when to quote.

If you’re searching for Mr. Smith, then this won’t do:

   if ( $name =~ /Mr. Smith/ ) { ...

because the period (.) is a metacharacter meaning “any character.” The strings Mrs Smith and Mr! Smith would both match. You need to use:

   if ( $name =~ /Mr\. Smith/ ) { ...

Where most trips up people is when the pattern is coming out of a variable.

   $wanted = 'Mr. Smith';
   if ( $name =~ /$wanted/ ) { ...

There are two solutions. Either use the quotemeta function in Perl and PHP, or Regexp.escape in Ruby:

   $wanted = quotemeta( 'Mr. Smith' );
   # $wanted is now 'Mr\. Smith'
   if ( $name =~ /$wanted/ ) { ...

or use the \Q modifier in the regex:

   if ( $name =~ /\Q$wanted/ ) { ...

5. Use alternate delimiters to avoid Leaning Toothpick Syndrome.

Say you’re trying to match something that looks roughly like a URL. You can do it like this in Perl:

   if ( $potential_url =~ /^https?:\/\// ) { ...

but those slashes and backslashes get confusing. Use an alternate delimiter, either with the same character:

   if ( $potential_url =~ m#^https?://# ) { ...

or a pair of matching characters, such as curly braces, which is my preference:

   if ( $potential_url =~ m{^https?://} ) { ...

Note that since the delimiter is not /, I must use the m operator to signify a match operation, or an s operator to signify a substitution.

6. Don’t quote everything because you don’t remember the metacharacters.

Many punctuation characters have special meanings in regexes, but not all of them do. Learn which ones mean what, and act accordingly, but don’t quote everything.

Say you’re trying to find computer author Stan Kelly-Bootle. The regex might look like this:

   /Mr\. Stan Kelly-Bootle/

but I’ve often seen people write

   /Mr\. Stan Kelly\-Bootle/

where the hyphen is escaped out of superstition thinking that any punctuation character must be escaped. The period (.) in `Mr.` does need to be escaped because otherwise the period means “any character.” Hyphens do not have special meaning, so they do not need to be escaped to match a literal hyphen. Escaping them only makes your regex harder to read.

7. Use string functions when not extracting patterns.

If you want to see if a string is between 10 and 12 characters long, you can do this:

   if ( $str =~ /^...........?.?$/ ) { ...

or you can do this:

   my $len = length($str);
   if ( $len >= 10 && $len <= 12 ) { ...

The second one is a little more typing, but far clearer.

If you want to extract the first three characters of a string, you could do:

   $str =~ /^(...)/;
   my $first_three = $1;

or you can use the existing tools of your language:

   my $first_three = substr( $str, 0, 3 );

Write whichever form best expresses the intent of the code. For my money, it’s the latter.

8. Use functions, not regexes, to find existence of constant strings.

If you’re trying to find the existence of a string, and not a pattern, then a built-in function is clearer, immune to problems with metacharacters, and very slightly faster if you execute it many times.

   # PHP
   if ( strpos( $big_string, 'Mr. Smith' ) !== FALSE ) { ...
   # Perl
   if ( index( $big_string, 'Mr. Smith' ) > -1 ) { ...

Note that you also have the bonus of not having to quote metacharacters.

9. Use repeat counts to make repeated characters clearer.

Here are two regexes:

   /\d\d\d-\d\d\d-\d\d\d\d/
   /\d\d\d-\d\d-\d\d\d\d/
   /\d\d\d\d-\d\d-\d\d/

What do they match? That’s a lot of \d characters to count. But what if they were written as:

   /\d{3}-\d{3}-\d{4}/
   /\d{3}-\d{2}-\d{4}/
   /\d{4}-\d{2}-\d{2}/

Now we can more easily see that the first one matches a phone number, the second a Social Security number, and the third a date.

10. Use the `/x` modifier to allow whitespace and comments to improve readability.

Another way to make your regexes more readable is by adding whitespace. The /x modifier allows this. Our date example above:

   /\d{4}-\d{2}-\d{2}/

can also be written as:

   /\d{4} - \d{2} - \d{2}/x

providing some spacing for readability.

The /x causes all whitespace in the pattern to be ignored, allowing us to make things more readable. It also lets us add comments, like so:

   /\d{4}      # Year
   -
   \d{2}       # Month
   -
   \d{2}       # Day
   /x

To the regex engine, all three regexes above are identical. The more complex your regexes get, the more whitespace and comments can help.

11. Build large regexes from smaller ones.

Another way to make complex regexes easier to understand is to build large regexes out of smaller pieces.

Say you’re working on analyzing a log file with date, time, and IP address on each line, and it looks like this:

   2013-09-27 18:26 192.168.0.17 Login attempt failed

Create regexes for each of the elements like so:

   my $date_re = qr/\d{4}-\d{2}-\d{2}/;
   my $time_re = qr/\d{2}:\d{2}/;
   my $byte_re = qr/\d{1,3}/;
   my $ip_re   = qr/$byte_re\.$byte_re\.$byte_re\.$byte_re/;
   if ( $line =~ /^($date_re) ($time_re) ($ip_re) (.*)/ ) {
      my ($date, $time, $ip, $message) = ($1, $2, $3, $4);
      # ... do stuff with your extracted fields.
   }

That’s far easier to read and less error-prone than the equivalent expansion of:

   /(\d{4}-\d{2}-\d{2}) (\d{2}:\d{2}/) (\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) (.*)/

In PHP, regexes aren’t objects, so don’t use the qr// regex constructor.

   my $date_re = '\d{4}-\d{2}-\d{2}`;
   ...
   if ( preg_match( XXX ..... )

12. Understand regex operator precedence.

In arithmetic operations, operator precedence matters. 4+5*3+2 does not mean 9*5. So too it is with regex operators.

Say you’re searching a block of text for the names Steve or Bob. Your regex looks like this:

   /Steve|Bob/

When you try that, you find that you get a lot of matches on Stevenson, so you use the \b word boundary metacharacter:

   /\bSteve|Bob\b/

However, you still get matches for Stevenson. Why?

In regular expressions, the `|` alternation character binds less tightly than metacharacters, so what you’re really searching for is /\bSteve/ or /Bob\b/.

Just as in arithmetic operations, parentheses force the order of evaluation.

   /\b(Steve|Bob)\b/

The evaluates to “word boundary, followed by ‘Steve’ or ‘Bob’, followed by a word boundary.”

13. Not everything has to be done in a single regex.

Regular expressions are great for matching patterns, but they’re not made for validating data, or for doing anything with the data that you’ve found.

Say that you’re scanning a big block of text looking for phone numbers. Your code might look like this:

   while ( $text =~ /(\d\d\d-\d\d\d-\d\d\d\d)/smg ) {
      my $phone_number = $1;
      # .. do something with $phone_number
   }

Everything works great until it’s decided that you shouldn’t capture toll-free phone numbers (with 800 as the area code). So you change that initial capture for three digits \d\d\d to the convoluted mess of ([0-79]\d\d|8[1-9]\d|8\d[1-9]), which works out to be:

  • A digit other than “8” followed by two more digits, OR
  • “8” followed by a non-zero digit and any digit, OR
  • “8” followed by any digit and a non-zero digit
   while ( $text =~ /(([0-79]\d\d|8[1-9]\d|8\d[1-9])-\d\d\d-\d\d\d\d)/smg ) {
      my $phone_number = $1;
      # .. do something with $phone_number
   }

That’s a lot less readable. And now what do you do if you later have to skip local 312 and 773 area codes? It’s a maintenance nightmare.

Don’t make the regex do the work of both positively finding a phone number and negatively validating it. It’s much clearer to break up the detection into two distinct steps.

   while ( $text =~ /(\d\d\d-\d\d\d-\d\d\d\d)/smg ) {
      my $phone_number = $1;
      if ( $phone_number !~ /^800/ ) { # If it doesn't start with 800
         # .. do something with $phone_number
      }
   }

Here, the first regex finds a candidate for a phone number, and the second regex looks to see if it should be excluded.

Many regex problems fall into this pattern of, “Find a match, and validate it as a separate step.” Here are some examples of problems that should probably not be done with one regex.

  • Find all the URLs in the document, except for ones pointing to example.com
  • Find all email addresses but ignore example.com addresses.
  • Find all 4-digit years, but only greater than 1970.
  • Find a date, and check that it’s a valid date.

14. Don’t try to parse XML/HTML.

A common misuse of regexes among beginners is trying to parse HTML or XML with regexes. It’s just pattern matching, right? It’s not. The syntax of HTML is better left to a dedicated library or module.

Say you’ve got a file of HTML where you’re trying to extract URLs from <img> tags.

   <img src="http://example.com/whatever.png">

So you write a regex like this (in Perl):

   if ( $html =~ /<img\s+src="(.+?)"/g ) {
      my $url = $1;
      ...  # Do something with $url
   }

In this case, $url does indeed contain http://example.com/whatever.png. But what happens when you start getting HTML like this?

   <img src='http://example.com/whatever.png'>

Oops, those single quotes are valid HTML. Or maybe you get:

   <img src=http://example.com/whatever.png>

without the quotes around the URL. Or maybe it will be:

   <img border=0 src="http://example.com/whatever.png">

with an intervening attribute between img and src. Or maybe the tag will be split across lines like so:

   <img
      src="http://example.com/whatever.png">

which is perfectly valid. Or maybe you’ll just start getting false positives when you get HTML, like this:

   <!-- This URL is old, so we commented it out.
   <img src="http://example.com/outdated.png">
   -->

because your regex is unable to handle commented-out code.

Instead, use a tool that’s already been written, tested, and debugged. In Perl, you can use HTML::Parser  or even WWW::Mechanize to handle many common parsing problems. In Python, you can use the built-in ElementTree module. In PHP, you use the built-in DOM module. See my website, htmlparsing.com, where I collect pointers on these and other ways to properly do HTML parsing in various languages.

15. Regexes are not a magic wand to wave at every problem that involves strings.

Jamie Zawinski has a famous quote:

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

While this is overly cynical in its view of regexes and their utility, it does point at a programmers’ tendency. As soon as they see a problem that requires manipulating strings, developers reach for the regexes. Unfortunately, sometimes having the hammer of a regex can blind you to the problems better solved with a screwdriver. Or a toothpick.

Regexes are a powerful tool to keep in your programmer’s toolbox. Use them responsibly, and at the right time, and they’ll make your programming life a joy.

See also:

Collaborator-Launch-Blog-Bottom

subscribe-1

Comments

  1. Overall some good pointers but, and this may be a Perl thing (though I doubt it), I was definitely confused by your incorrect use of the word “quote”.

    Quoting something means putting single or double quotes around it. i.e.,
    Not Quoted
    “Quoted”
    ‘Quoted’

    Putting a back-slash in front of something to remove its special meaning is called “escaping”.

    No Escaping Here.
    Escaping Done Here.

    Thus the header for # 4 should read:

    Know When to Escape

    Or something similar

    • That said the reference to quote in # 4 refers to “quotemeta” so maybe that would play OK but it is definitely used incorrectly in # 6.

      Its unfortunate the name implies quoting when in fact the function should be named “escapemeta” since all meta-characters in the input string are escaped to their literal meaning

  2. Felix Ostmann says:

    I dont like your example #6 …

    The – is really special and should always escaped. It is hard to train newbies and explain that the – should escaped in character classes if it is not the last element. Why not simple escape that – every time you dont want that special regardless of the context?

    [a-z,.!?-] works
    [a-z.,!?-_] dont work as expected …

    • I disagree with your disagreement. The set of characters that are “special” inside character classes is *far* smaller than the set that are special in regex in general. For example, to match an asterisk or question mark, you can simply write m{[*?]} since neither of those are special in a character class. In fact, the *only* characters special in a char class are backslash and minus. Best to have a rule, rather than overly quote in the rest of the regex.

Trackbacks

  1. […] — Andy Lester, The Developer’s RegEx Survival Guide: 15 Rules for Making Sense of Regular Expressions […]

  2. […] — Andy Lester, The Developer’s RegEx Survival Guide: 15 Rules for Making Sense of Regular Expressions […]

Speak Your Mind

*