Categories
Posts

When Regular Expressions Get Greedy

I’ve been using a regular expression (PHP’s preg_match function) to parse email addresses. The addresses have a consistent pattern that look like:

anamehere+XXXXXXXX@mail.example.com

The XXXXXX section is a random set of characters: a-z, A-Z, 0-9 and -. The regular expression’s job was to extract that random set of characters, which is easy to do:

[sourcecode lang=”php”]
$email_addr = ‘anamehere+XXXXXX@mail.example.com’;
preg_match( ‘|anamehere+(.*)@.*$|’, $email_addr, $match );
$email_code = $match[1];
[/sourcecode]

Since the format of the email address is consistent it was easy to pull out the section I was interested in. Well, it was easy until it broke when the email was sent in from a specific email client. Turns out this client set the value of the email address to something different:

"anamehere+XXXXXX@mail.example.com" <anamehere+XXXXXX@mail.example.com>

After getting over my frustration at this email client I looked at what the regular expression matched:

XXXXXX@mail.example.com" <anamehere+XXXXXX

Well that was no good. Instead of just matching the XXXXXX it was slurping up other portions of the email address as well. Enter the greed factor of Perl Compatible Regular Expressions (PCRE). If you aren’t familiar with greedy regular expressions it simply means that they try to match as much as the can. Fortunately there is a way to turn off the greediness using the U pattern modifier. By adding a U to the end of the regular expression things got much better:

[sourcecode lang=”php”]
$email_addr = ‘anamehere+XXXXXX@mail.example.com’;
preg_match( ‘|anamehere+(.*)@.*$|U’, $email_addr, $match );
$email_code = $match[1];
[/sourcecode]

Now even the odd ball email address was extracting just the XXXXXX in the regular expression.

Using the U at the end turns off greediness for the entire regular expression. You can turn off the greediness of a single quantifier (the * in this case) by following it with a ?. Using that technique the regular expression is anamehere+(.*?)@.*$ which works as well.

If you find that your regular expressions are matching more that you want remember that they are greedy by default.

2 replies on “When Regular Expressions Get Greedy”

You could also have just used a more restrictive character range/class in your regex to avoid this.

something like:

‘|anamehere+([^@]+)@.*$|

or

‘|anamehere+([a-zA-Z0-9-]*)@.*$|

🙂

Leave a Reply

Your email address will not be published. Required fields are marked *