Softpanorama

Home Switchboard Unix Administration Red Hat TCP/IP Networks Neoliberalism Toxic Managers
May the source be with you, but remember the KISS principle ;-)
Skepticism and critical thinking is not panacea, but can help to understand the world better

Perl Regular Expressions Best Practices and Tips

News

Perl Regular Expressions

Recommended Links Best Books Perl control structures grep & map
Perl as a command line utility tool Debugging Perl IDE Perl power tools Reimplementation of Unix tools Extended Notation and Commenting Regular Expressions
Greedy and Non-Greedy Matches Perl Split function Perl HTML Matching Examples Regular Expressions Best Practices index and rindex in Perl Perl tr function

Beautifiers

Perl Warts

 Perl philosophy and history Tips Perl regex history Etc

Regular expression is a sharp tool. You can cut yourself badly if you are not careful ;-)

One needs to be very careful with regular expressions and avoid overcomplexity like a plague. In complex regular expressions surprises are a dozen for dollar for the uninitiated. Even careful testing does not guarantee that you fully understand its behavior. Complex regular expressions  provide for enormous number of ways to shoot yourself in the foot!

Softpanorama Recommedations

Please pay special attention to non-greedy (lazy) quantifiers as they are simpler to use and less prone to errors.

It makes a lot of sense first to debug a complex regular expression is a special test script, feeding it with sample strings and observing the output.

Be careful when using the * quantifier because it can match an empty string, which might not be your intention. The regex /b*/ will match any string - even one without any b characters.

Meta-character  .  and modifiers ?, *, + that appear inside the square brackets that define a character class are used in their literal sense. They lose their meta-meaning.

One important feature of capture variables is that only the successful match affect them. If a match is unsuccessful, then  previous values are preserved whatever they may be. That leads to difficult to find errors if you are not careful. You should never use capture variables without checking if the march is successful or not.  What is worse you forget about this "feature" from time to time and make the same mistake again and again. And then spend a day debugging your now semi-forgotten script when you accidentally discover that it misbehaves in certain cases. I think that this is a design blunder of Perl. It should set all capture variables undefined in can of unsuccessful match. Please check your scripts for usage of capture variable and manually check in each case that if statement for matching is used.

The result of a substitution operator in the scalar context is the number of substitutions made

There are many ways of matching any value.

If the first method you try doesn't work, try breaking the value into smaller components and match each boundary.

If all else fails, you can always ask for help on the comp.lang.perl.misc newsgroup.

The $. special variable keeps track of the record number. Every time the diamond operators read a line, this variable is incremented.

If you need to save the value of the matched strings stored in the regex memory, make sure to assign them to other variables. regex memory is local to the enclosing block and lasts only until another match is done.

Note: If variable interpolation is used in the replacement string none of the meta-characters can be used in the replacement string

Backreferences are not set if a regular expression fails

Your regex matching code should protect from unassigned built-in variable regex matching errors.

The regex Component Order of Precedence
Precedence Level Component
1 Parentheses
2 Quantifiers
3 Sequences and Anchors
4 Alternation
 
You can use parentheses to affect the order that components are evaluated because they have the highest precedence. you need to use extended syntax or you will be affecting the regex memory.

Two general observation:

To be successful you need to use to adopt some elements of style that makes mistakes less probable:

Below are some reference tables from Medinets book: 

 Regular Expression Meta-Characters, Meta-Brackets, and Meta-Sequences

Meta
Character

Description

^ This meta-character - the caret - will match the beginning of a string or if the /m option is used, matches the beginning of a line. It is one of two pattern anchors - the other anchor is the $.
. This meta-character will match any character except for the newline unless the /s option is specified. If the /s option is specified, then the newline will also be matched.
$ This meta-character will match the end of a string or if the /m option is used, matches the end of a line. It is one of two pattern anchors - the other anchor is the ^.
| This meta-character - called alternation - lets you specify two values that can cause the match to succeed. For instance, m/a|b/ means that the $_ variable must contain the "a" or "b" character for the match to succeed.
* This meta-character indicates that the "thing" immediately to the left should be matched 0 or more times in order to be evaluated as true.
+ This meta-character indicates that the "thing" immediately to the left should be matched 1 or more times in order to be evaluated as true.
? This meta-character indicates that the "thing" immediately to the left should be matched 0 or 1 times in order to be evaluated as true. When used in conjunction with the +, _, ?, or {n, m} meta- characters and brackets, it means that the regular expression should be non-greedy and match the smallest possible string.

 
Meta
Brackets
Description
() The parentheses let you affect the order of pattern evaluation and act as a form of pattern memory.
(?...) If a question mark immediately follows the left parentheses, it indicates that an extended mode component is being specified.
{n, m} The curly braces let specify how many times the "thing" immediately to the left should be matched. {n} means that it should be matched exactly n times. {n,} means it must be matched at least n times. {n, m} means that it must be matched at least n times and not more than m times.
[] The square brackets let you create a character class. For instance, m/[abc]/ will evaluate to true if any of "a", "b", or "c" is contained in $_. The square brackets are a more readable alternative to the alternation meta-character.

 
Meta
Sequences
Description
\ This meta-character "escapes" the following character. This means that any special meaning normally attached to that character is ignored. For instance, if you need to include a dollar sign in a pattern, you must use \$ to avoid Perl's variable interpolation. Use \\ to specify the backslash character in your pattern.
\0nnn Any Octal byte.
\A This meta-sequence represents the beginning of the string. Its meaning is not affected by the /m option.
\b This meta-sequence represents the backspace character inside a character class; otherwise, it represents a word boundary. A word boundary is the spot between word (\w) and non-word(\W) characters. Perl thinks that the \W meta-sequence matches the imaginary characters off the ends of the string.
\B Match a non-word boundary.
\cn Any control character.
\d Match a single digit character.
\D Match a single non-digit character.
\e Escape.
\E Terminate the \L or \U sequence.
\f Form Feed.
\G Match only where the previous m//g left off.
\l Change the next character to lowercase.
\L Change the following characters to lowercase until a \E sequence is encountered.
\n Newline.
\Q Quote Regular Expression meta-characters literally until the \E sequence is encountered.
\r Carriage Return.
\s Match a single whitespace character.
\S Match a single non-whitespace character.
\t Tab.
\u Change the next character to uppercase.
\U Change the following characters to uppercase until a \E sequence is encountered.
\v Vertical Tab.
\w Match a single word character. Word characters are the alphanumeric and underscore characters.
\W Match a single non-word character.
\xnn Any Hexadecimal byte.
\Z This meta-sequence represents the end of the string. Its meaning is not affected by the /m option.
 

Regular Expressions form almost a 'language within a language' in Perl. As you can see above, they can be fairly involved, and (lets face it) if you are not familiar with them now, you are not going to learn them without practice. Therefore, we suggest the following path for learning regular expressions.

More about  Modifiers

The matching operator has several options. The most useful option is probably the capability to ignore case (option i) and to iterate throuth all matches in a string (option g).

Options for the Matching Operator

Option Description
g This option finds all occurrences of the pattern in the string. You can iterate over the matches using a loop statement or put result into array
i This option ignores the case of characters in the string.
m This option treats the string as multiple lines. Perl does some optimization by assuming that $_ contains a single line of input. If you know that it contains multiple newline characters, use this option to turn off the optimization.
o This option compiles the pattern only once. You can achieve some small performance gains with this option. It should be used with variable interpolation only when the value of the variable will not change during the lifetime of the program.
s This option treats the string as a single line.
x This option lets you use extended regular expressions. Basically, this means that Perl will ignore whitespace that's not escaped with a backslash or within a character class. I highly recommend this option so you can use spaces to make your regular expressions more readable. See the section "Example: Extension Syntax" later in this chapter for more information.

Modifiers

The matching operator has several options. The most useful option is probably the capability to ignore case (option i) and to iterate through all matches in a string (option g).

Options for the Matching Operator

Option Description
g This option finds all occurrences of the pattern in the string. You can iterate over the matches using a loop statement or put result into array
i This option ignores the case of characters in the string.
m This option treats the string as multiple lines. Perl does some optimization by assuming that $_ contains a single line of input. If you know that it contains multiple newline characters, use this option to turn off the optimization.
o This option compiles the pattern only once. You can achieve some small performance gains with this option. It should be used with variable interpolation only when the value of the variable will not change during the lifetime of the program.
s This option treats the string as a single line.
x This option lets you use extended regular expressions. Basically, this means that Perl will ignore whitespace that's not escaped with a backslash or within a character class. I highly recommend this option so you can use spaces to make your regular expressions more readable. See the section "Example: Extension Syntax" later in this chapter for more information.

 

The material below is adapted from O'Reilly Network Five Habits for Successful Regular Expressions

Use extended syntax for complex patterns

Consider the following regular expression to match a U.S. phone number:

 \(?\d{3}\)?\s?\d{3}[-.]\d{4} 

This regex matches phone numbers like "(973)555-4000". Ask yourself if the regex would match "973-555-4000" or "555-4000". The answer is no in both cases. Writing this pattern on one line conceals both flaws and design decisions. The area code actually can't be omitted and form 973-555-4000 will not be accepted.

Spreading the pattern out over several lines makes the flaws more visible and the necessary modifications easier. In Perl using extended syntax (option x) we can re-write this expression to accept the second form as following:

 /  
    \(?     # optional parentheses
      \d{3} # area code required
    \)?     # optional parentheses
    [-\s.]? # separator is either a dash, a space, or a period.
      \d{3} # 3-digit prefix
    [-.]    # another separator
      \d{4} # 4-digit line number
/x 

The rewritten regex now has an optional separator after the area code so that it matches phone 973-555-4000 as well as (973)555-4000. The area code is still required.

However, a new programmer who wants to make the area code optional can quickly see that it is not optional now and might change the code to use separate regex for each of three major cases instead of merging them like we did. So much for our optimization ;-). Readable code helps immensely but in no way it is a substitute for good design.

Write Tests for each and every complex regex

There are three levels of testing, each adding a higher level of reliability to your code. First, you need to think hard about what you want to match and whether you can deal with false matches. Second, you need to test the regex on example data. Third, you need to formalize the tests into a test suite.

Deciding what to match is a trade-off between making false matches and missing valid matches. If your regex is too strict, it will miss valid matches. If it is too loose, it will generate false matches. Once the regex is released into live code, you probably will not notice either way. Consider the phone regex example above; it would match the text "800-555=0355". False matches are hard to catch, so it's important to plan ahead and test.

Sticking with the phone number example, if you are validating a phone number on a web form, you may settle for ten digits in any format. However, if you are trying to extract phone numbers from a large amount of text, you might want to be more strict to avoid a unacceptable numbers due to  false matches.

When thinking about what you want to match, write down example cases. Then write some code that tests your regular expression against the example cases. Any complicated regular expression is best written in a small test program, as the examples below demonstrate:

In Perl:

#!/usr/bin/perl

my @tests = ( "314-555-4000",
              "800-555-4400",
	      "(314)555-4000",
              "314.555.4000",
              "555-4000",
              "aasdklfjklas",
              "1234-123-12345"          
            );
foreach my $test (@tests) {
    if ( $test =~ m/
       \(?     # optional parentheses
       \d{3} # area code required
       \)?     # optional parentheses
       [-\s.]? # separator is either a dash, a space, or a period.
       \d{3} # 3-digit prefix
       [-\s.]  # another separator
       \d{4} # 4-digit line number/x ) {
        print "Matched on $test\n";
    } else {
        print "Failed match on $test\n";
    } # if
} # foreach

Running the test script exposes yet another problem in the phone number regex: it matched "1234-123-12345". That demonstrates the key principle of text selection: include both tests that you expect to fail (they may succeed) as well as those you expect to match (they may not)

Ideally, you preseve this test as part of the test suite for your entire program. Even if you do not have a test suite already, your regular expression tests are a good foundation for a suite, and now is the perfect opportunity to start on one. Even if now is not the right time (really, it is!), you should make a habit to run your regex tests after every modification. A little extra time here could save you many headaches.

Group the Alternation Operator

The alternation operator (|) has a low precedence. This means that it often alternates over more than the programmer intended. For example, a regex to extract email addresses out of a mail file might look like:

^CC:|To:(.*)

The above attempt is incorrect, but the bugs often go unnoticed. The intent of the above regex is to find lines starting with "CC:" or "To:" and then capture any email addresses on the rest of the line.

Unfortunately, the regex doesn't actually capture anything from lines starting with "CC:" and may capture random text if "To:" appears in the middle of a line. In plain English, the regular expression matches lines beginning with "CC:" and captures nothing, or matches any line containing the text "To:" and then captures the rest of the line. Usually, it will capture plenty of addresses and nobody will notice the failings.

If that were the real intent, you should add parentheses to say it explicitly, like this:

(^CC:)|(To:(.*))

However, the real intent of the regex i to match lines starting with "CC:" or "To:" and then capture the rest of the line. The following regex does that:

^(CC:|To:)(.*)

This is a common and hard-to-catch bug. If you develop the habit of wrapping your alternations in parentheses (or non-capturing parentheses -- (?:)) you can avoid this error.

Use Lazy Quantifiers

Most people avoid using the lazy quantifiers *?, +?, and ??, even though they are easy to understand and make many regular expressions easier to write.

Lazy quantifiers match as little text as possible while still aiding the success of the overall match. If you write foo(.*?)bar, the quantifier will stop matching the first time it sees "bar", not the last time. This may be important if you are trying to capture ###; in the text foo###bar+++bar. A regular quantifier would have captured ###bar+++.

Let's say you want to capture all of the phone numbers from an HTML file. You could use the phone number regular expression example we discussed earlier in this article. However, if you know that the file contains all of the phone numbers in the first column of a table, you can write a much simpler regex using lazy quantifiers:

<tr><td>(.+?)<td>

Many beginning regular expression programmers avoid lazy quantifiers with negated character classes. They write the above code as:

<tr><td>([^>]+)</td>

That works in this case, but leads to trouble if the text you are trying to capture contains common characters from your delimiter (in this case, </td>). If you use lazy quantifiers, you will spend less time kludging character classes and produce clearer regular expressions.

Lazy quantifiers are most valuable when you know the structure surrounding the text you want to capture.

Use Alternative Delimiters

Perl allow you to use any non-alphanumeric or whitespace character as a delimiter. If you switch to a new delimiter, you can avoid having to escape the forward slashes when you are trying to match URLs or HTML tags such as "http://" or "<br />".

For example:

/http:\/\/(\S)*/

could be rewritten as:

#http://(\S)*#

Common delimiters are #, !, |. If you use square brackets, angle brackets, or curly braces, the opening and closing brackets must match. Here are some common uses of delimiters:

#…# !…! {…}
s|…|…| s[…][…] s<…>/…/

Tips

Prev | Up | Contents | Down | Next



Etc

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D


Copyright © 1996-2018 by Dr. Nikolai Bezroukov. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) in the author free time and without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to make a contribution, supporting development of this site and speed up access. In case softpanorama.org is down you can use the at softpanorama.info

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the author present and former employers, SDNP or any other organization the author may be associated with. We do not warrant the correctness of the information provided or its fitness for any purpose.

The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: March 12, 2019