Text processing using regex

News	Regular Expressions -- the most popular nonprocedural language	Recommended Books	GNU grep Regular Expressions	Recommended Links	POSIX regular Expressions	Overview of regular expressions in Perl
GNU grep Regular Expressions	POSIX regular Expressions	bash	Horror Stories	Tips	Humor	Etc

Along with bash we learned at least two additional languages. One is Perl and the second one is regular expressions.

Regular expressions is non-procedural language (functional language) in which you do not specify the sequence of actions, but define the pattern that you want to search for and allow regular expression interpreter to construct the program for searching this pattern and execute it. It is limited to searching text strings. Regular expressions originated in 1956, when mathematician Stephen Cole Kleene who studied some type of abstract automata.

The regular expression interpreter (engine) matches the regular expression pattern against given text string. In the process of matching it can put in memory some elements that were matched and you can retrieve them for further use. In other words regular expression can perform parsing of text string.

Regular expressions were used first by one of the creator of Unix Ken Thompson) starting from 1968. Ken Thompson implemented regular expression matching in his editor QED by just-in-time compilation (JIT) to IBM 7094 code on the Compatible Time-Sharing System (CTSS), which was an early example of JIT compilation.

He later added this capability to the Unix editor ed, which eventually led to classic Unix tool grep ("grep" is a word derived from the command for regular expression searching in the ed editor: g/re/p meaning "globally search for regular expression and print matching lines" (Regular expression - Wikipedia). Other group of researchers which included Mike Lesk and Eric Schmidt implemented Lex -- the generator of lexical parsers based on regular expressions in 1975. After that regular expression entered mainstream.

The biggest problem with using regular expressions in Unix is that there isn't just one set of them. There are three different types of regular expression and three major regular expression engines used in Unix.

The POSIX Basic Regular Expression (BRE) engine. Used in grep and in bash. This is default regular expression engine for Unix/Linux command line.
The POSIX Extended Regular Expression (ERE) engine. This is the engine which stems from he engine implemented in AWK. This a slight generalization of regular expression engine used in AWK. GNU awk can be views as the reference implementation.
Perl compatible regular expressions (PCRE). Originated in Perl in 1987. Starting in 1997, Philip Hazel developed PCRE (Perl Compatible Regular Expressions), the library that implements the engine that closely mimic Perl's regular expression functionality. This is now standard de facto which is used by many modern tools (and some not so modern such as GNU grep tools and language. The list includes Python, PHP, Java, Apache HTTP Server, several databases (MySQL and Postgress). See Perl Compatible Regular Expressions - Wikipedia

This is the major shortcoming connected with the fact that UNIX is around 45 years old (released in 1973). In comparison Perl which implement the most advances regular expression engine will be 30 years old in December 2017. Some tools like GNU grep can use different types of regular expression engine. For example, with option -E grep will use ERE and with option -P (recommended) PCRE.

We will use study mostly Perl regular expression because they are now standard de-facto in the world of regular expressions. Generally you should avoid using ERE, if you can. You can't avoid using basic regular expression because this is what bash implements on command line.

Recent developments in Bash

Recent developments

Bash 3.2 introduced =~ operator with "normal" Perl-style regular expressions that can be used instead in many cases and they are definitely preferable in new scripts that you might write. for more about Perl compatible regular expression see Text processing using regex. they are now standard de-fact and are use in Perl, Python, Java, Jascript and other moden languages. So good knowledge of them is necessary for any system administrar.

Let's say we need to establish whether variable $ip appears to be a valid IP address:

for ip in "255.255.255.255" "10.10.10.10" "400.0.0.0" ; do

   echo "=== testing $ip ==="
   if [[ $ip =~ ^[0-2][0-9]{0,2}\.[0-2][0-9]{0,2}\.[0-2][0-9]{0,2}\.[0-2][0-9]{0,2} ]] ; then
       echo "$ip Looks like a valid IP"
   else
      echo "$ip is unvalid ip"
   fi
done

running this fragment will will get:

=== testing 255.255.255.255 ===
255.255.255.255 Looks like a valid IP
=== testing 10.10.10.10 ===
10.10.10.10 Looks like a valid IP
=== testing 400.0.0.0 ===
400.0.0.0 is unvalid ip

In bash-3.1, a string append operator (+=) was added:

PATH+=":~/bin"
echo "$PATH"

Starting with bash 4.1 negative length specifications in the ${var:offset:length} expansion, previously errors, are now treated as offsets from the end of the variable.

It also extended printf builtin which now has a new %(fmt)T specifier, which allows time values to use strftime-like formatting.

printf -v can now assign values to array indices.

The read builtin now has a new `-N nchars' option, which reads exactly NCHARS characters, ignoring delimiters like newline.

The {x} operators to the [[ conditional command now do string comparison according to the current locale if the compatibility level is greater than 40.

Still if you need to do a lot of pattern matching, it is preferable to use Perl as it is more suitable for pattern matching, has good debugger and is easy to learn for any syadmin who knows bash

Perl compatible regular expression

As was mentioned before regular expressions are a language inside the language. Per compatible regular expression should be viewed as a separate language that has no direct connections to Perl. It is used with many other languages (Python, PHP, Java) in almost the same form as in Perl just with different syntactic sugar.

Still Perl was the first language to introduce "close binding" of regex and the language per se, the feature that was later more or less successfully copied to Python, TCL and other languages. Also the level of integration of the regular expression language into the main language is higher in Perl, then in any alternative scripting languages.

Perl is also close to bash (and Unix shells in general) and as such is a good alternative of AWK now as additional overhead of Perl interpreter is negligible on modern servers. Generally sysadmins who know Bash scripting well can be productive in Perl after a couple of days of training. Here are major similarities

Lexical structure is close. In both languages comments start with # and there are two types of sting literals -- single quotes and double quoted.
Variable names are prefixes ($ for scalar variable, line is shell with $ means dereferencing)
Double quoted strings are "interpolated" -- process to expand variables contains in them
Similar "duality" in comparison -- different sets of symbols are used to compare string and numbers. Using particular set convert particular variable to the target type (string of number)
Unsuccessful conversion to numeric result in value zero, not in the error.
You can execute arbitrary Unix commands or pipes using backquotes
The same syntax for functions sub <name> Like in shell subroutines and functions can be assed arbitrary number of arguments which are untyped.

There are also some differences. Among them:

In Perl you use $ also on the left sigh of the assignment. Perl has two additional prefixes (@ and % ) for arrays and hashes correspondingly
Perl uses C-style syntax for conditional and loops with curvy brackets denoting a block
Unlike shell arguments to function are passed by reference, not by value.
Functions can be declared anywhere in the program not before the first invocation like in shell
Unlike shell Perl behave more like a regular language without shell style quirks like textual macrosubstitution of variable before syntax analysis.
Perl has an excellent built-in debugger
Perl has an excellent ecosystem of reusable modules (CPAN)

Here is the script that replaces each line which contain a sting passed as the first argument with the replacement string supplied as the second argument for the file supplied as the third argument, coded in both language (such script might be useful for making small changes in configuration files)

#!/bin/bash
# find lines with the search string  and replace them
# with the replace string for each line of the file
#

search_string=$1
replace_string=$2
file=$3
cat $3 | while read line ; do 
     if [[ $line =~ $search_string ]] ; then 
        print "$replace_string\n"
     else 
        print $line
     fi
done

#!/usr/bin/perl
# find lines with the search string  and replace them
# with the replace string for each line of the file
#

   $search_string=$ARGV[0];
   $replace_string=$ARGV[1];
   $file=$ARGV[2];
   open (SYSIN, "<$file");
   while ( $line=<SYSIN> ) {
      if ( $line =~ /$search_string/) {
         print "$replace_string\n";
       } else {
         print $line;
       }
   }

So out of respect to this innovation (and due tot he fact that Perl will be 30 years old in December 2017) we will study Perl compatible regular expressions in context of Perl. In bash Perl is often used with so called "one-liners" -- small scripts that fit into one line but which perform useful functions that is difficult to implement in bash. This is an example of "dual language programming" when high level implementation is done in bash and lower level functions are programmed in other language (in this case Perl). Here are some examples of one-liners (you can construct you one using them as a template; for more see Perl one-liners or search for this phase in Google):

Replace a pattern pattern1 with another (pattern2) globally inside the file and create a backup
```
perl -i.bak -pe 's/pattern1/pattern2/g' filename
```
Delete particular line of set of lines (like in grep) in the file (poor man editor):
```
perl -i.bak -ne 'next if ($_ =~/pattern_for_deletion/); print;' filename
```

Select only lines between a pattern pattern1 and pattern2

perl -i.bak -ne 'print unless /pattern1/ .. /pattern2/' filename

for example

perl -i.bak -ne 'print unless /^Oct15,2017$/ .. /^Oct30,2017$/' /home/joeuser/2do.lst

Perl language regular expression parser gradually evolves. The latest significant changes were introduced in version 5.10 make it more powerful and less probe to errors. This version of Perl is the minimal version recommended for any serious text parsing work.

As regular expressions (regex for short) is a new language, using the famous "Hello world" program as the first program seems to be appropriate. As a remnant from shell/AWK legacy a regular expression lexically is a special type of literals (similar to double quoted literal).

It is usually (but not necessarily) is included in slashes. In matching operator the source string (where matching occurs) is specified on the left side of the special =~ operator (matching operator), while regex is on the right side.

The simplest case is to search substring in string like in built-in function index. The following expression is true if the string Hello appears anywhere in the variable $sentence.

$sentence = "Hello world";
if ($sentence =~ /Hello/) { 
   print "Matched\n"
} else {
   print "Not matched\n"
}

The regular expressions are case sensitive, so if we assign to $sentence the same string but in lower case

$sentence = "hello world";

then the above match will fail.

The operator !~ can be used for a non-match. For example, the expression

$sentence !~ /Hello/

is true if the string Hello does not appear in $sentence.

Alternatively you can use qr instead of slashes. That's very important, if you regex contain a lot of slashes

$url !~ qr(/cygdrive/f/public_html)

the $_ is the default operand for regular expressions. But in most cases the string against which to performs the match or substitution should be specified explicitly with operator =~ and its negation !~. For example:

$my_string = "The graph has many leaves";
if ( $my_string =~ m/graph/ ) {
   print("The source string contains the word 'graph'.\n");}
   $result =~ s/graph/tree/;
   print "Replaced with 'tree'\n";
}
print("initial string: '$my_string'\n.The result is '$result'\n");

In this example each of the regular expression operators applies to the $my_string variable instead of $_.

Two types of regex

There are two main uses for regular expressions in Perl:

matching: We already saw this form in the examples above. Expressions /regexp / or m{regex} (with m you can use so called alternative delimiters such as {} () or something else) indicates that the regular expression inside the regular expression brackets (whatever they are) will be matched against the scalar on the left hand side of the =~ or !~.
If there is no string of the left side that matching is performed against the content of the default scalar variable $_ . For example
```
/Hello/
```
will search Hello in $_.
substitution: the form s/regexp/substitute_text/ indicates that the regular expression is going to be substituted by the string substitute__text. As syntactic sugar, you can leave s, but this would be an "excessive sugar" which just obscures Perl code and should be avoided. You can also use alternative brackets with s like with m, for example s{regex}{substitute_text} As in case of simple matching by default regular expression and substitution applies to the special variable $_.

Regular expressions in Perl operate only against strings. No arrays on left hand side of matching statement please.

Regular expressions in Perl operate against strings. No arrays on left hand side of matching statement please.

Success and Failure of Matching

We can capture the success or failure of the match (but not the number of matches) in a scalar variable. This way we have a way to determine the success or failure of the matching and substitution, respectively:

@test_array=("The graph has many leaves",
             "Fallen leaves, so many leaves on the ground.");
foreach $test (@test_array) {
   $match = ($test =~ m/leaves/);
   print("Result of match of word 'leaves' in string '$test' is $match\n");
}

This program displays the following:

Result of match of word 'leaves' in string 'The graph has many leaves' is 1
Result of match of word 'leaves' in string 'Fallen leaves, so many leaves on the ground' is 1

The other useful feature of this example is that it shows you how to obtain the return values of the regular expression operators. In case subsequent action depends on the value of changed variables you should always check if the expression successive or failed because way to often regular expression behave differently then their creators expect.

In scalar context the match operation returns the number of matches. That means that if match failed it returns zero.

We could use a conditional as to check if match was successful or no:

$sentence = "Disneyworld in Orlando";
if ($sentence =~ /world/){
   print "there is a substring 'world' somewhere in the sentence: $sentence\n";
}

Sometimes it's easier to test the special variable $_, especially if you need to test each input string in the input loop. In this case you can write something like:

while (<>) { # get "Hello world" from the input stream
   if (/world/) {
      print "There is a word 'world' in the sentence '$_'\n";
   }
}

As we already have seen the $_ variable is the default for many Perl built-in functions (tr, split, etc).

Regular Expressions Metacharacters

The problem with regex metacharacters is that there are plenty of them. They provide a lot of power for sophisticated user and at the same time make them appear very complicated, at least at the very beginning.

It's best to build up your skills slowly: creation of complex regex can be considered as a kind of an art form (like solving a puzzle or chess problems). Please pay special attention to non-greedy (lazy) quantifiers as they are simpler to use and less prone to errors.

It makes a lot of sense first to debug a complex regular expression is a special test script, feeding it with sample strings and observing the output.

Please pay special attention to non-greedy (lazy) quantifiers as they are simpler to use and less prone to errors. It makes a lot of sense first to debug a complex regular expression is a special test script, feeding it with sample strings and observing the output.

There are three types of metacharacters:

Regular metacharacters. Each of them represents a class of symbols. When they are matched they consume some characters from the string
Anchors. Thos signify special position in the string but matching them does not consume any characters.
Quantifiers. Those change the meaning of metacharacters

As they are used as metacharacters, characters $, |, [],{} (), \, / ^, / and several others in regular expressions should be preceded by a backslash.

For example:

$ip_addr=~/\d+\.\d+\.\d+\.\d+/; # dot character should be escaped

Regular metacharacters

Regular metacharacters are special characters that represent some class of symbols. They consume one character from the string if they are matched (with quantifiers it can be less or more). In other word, they 'eats' characters of the class they represent. A good example is metacharacters that consumes characters is . (dot) which match any character. Among the most common regular metacharacters are:

. Any single character except a newline (length one). There is a special modifier to force . match newline too
\d -- matches a digit (character grouping [0-9]). Equivalent to [0-9]
\w -- matches a word character (underscore is counted as a word character here). Equivalent to [a-zA-Z_0-9]
\s -- matches a 'space' character (tab, newline, space). Equivalent to [ \t\n\r\f]
Classes. Classes can be called "definable metacharacters". They are group of characters in square brackets. They are can be sets or ranges and should be put inside square brackets a -(minus) indicates "between" and a ^ after [ means "not". For example:
- [AP] -- matches either letter A or letter P
- [0-9] -- marches digit form 0 to 9
- [0123456789ABCDF] match any hexadecimal digit
- [A-Z] matches capital letters
- [a-z] matches lower case letters
- [A-Za-z1-9_] -- equivalent to \w (note that symbol " _" is included)
- [abcde] # Either a or b or c or d or e
- [a-e] # same thing ("-" denote range here)
- [a-fx-z] # Anything from a to f inclusive and from x to z inclusive
- [^a-z] # Any non lower case letter
- /[a-zA-Z] # Any letter
- /[a-z]+/ # Any non-zero sequence of lower case letters
- /[01]/ # Either "0" or "1"

If you use capital latter instead of lower case letter the meaning of the particular metacharacter is reversed:

\D -- matches a non-digit (character grouping [^0-9]
\W -- matches a non-word character (character grouping [^a-zA-Z0-9_]
\S -- matches a 'non-space' character (character grouping [^\t\n ]).
\B -- anchor that matches a lack of word boundary (\b).

Anchors

Anchors are metacharacters that serve as markers and that never consume characters from the string. Anchors always match zero number of characters of a particular class. That means that they do not require any character to be present, only some logical condition is this place of the string needs to be true. Anchors don't match a character, they match a condition. In other words they do not consume any symbols. They just tell the regex engine that the particular match occurred. Two most common anchors are ^ and $:

^ -- anchor which matches 'beginning of line' if placed at the beginning of a regular expression. So the regex /^Hello/ will match only if the word Hello is the first in the string and there are no blanks before it. Create a simple test and see this behaviour yourself.
$ Same of ^ but signify the end of the line. It is somewhat strange as in the US $ sign usually used as a prefix fro dollar amounts as in $15, but this probably originated in Canada :-)
b -- matches the word boundary (rarely used). B reverses the meaning of this anchor and has the meaning "anything but a word boundary".

Quantifiers

Perl has three groups of quantifiers (which are also metacharacters, but they affect interpretation of previous character). The most important metacharacters include three groups with two members in each - one greedy and the other non-greedy (lazy):

One or more of the last characters or group (length one of more)
- + -- greedy. Matches one or more of preceding characters, but try to grab as many characters as possible
- +? -- non greedy. Marches one or more preceding characters but try to grab minimum possible number of characters. Usually used with .(dot): .+? to search for the next occurrence of the string, for example:
  /(.+?)the/
Zero or more the last character or group (length zero or more)
- * -- greedy. Matches zero or more of preceding characters, but try to grab as many characters as possible
- *? -- non greedy. Matches one or more preceding characters but try to grab minimum possible number of characters
Zero or one the last character or group (length zero or one)
- ? -- greedy. Matches zero or one character
- ?? -- non greedy. Does not make much sense

Non greedy modifies are newer but easier to understand as they correspond to the search of substring, Greedy modifies correspond to search of the last occurrence of the substring. That's the key difference. We will discuss not greedy modifies in the next section: More Complex Perl Regular Expressions

For example:

$sentence="Hello world"; 
if ($sentence =~ /^\w+/) { # true if the sentence starts with a word like "Hello"  
   print "The string $sentence starts with a word\n";
} else {
    print "The string $sentence does not starts with a word\n";
}

Full list includes 12 quantifiers:

Maximal (greedy)	Minimal (lazy)	Allowed Range
`{`n,m`}`	`{`n,m`}?`	Must occur at least n times but no more than m times
`{`n`,}`	`{`n`,}?`	Must occur at least n times
`{`n`}`	`{`n`}?`	Must match exactly n times
`*`	`*?`	0 or more times (same as `{0,}`)
`+`	`+?`	1 or more times (same as `{1,}`)
`?`	`??`	0 or 1 time (same as `{0,1}`)

We will discuss additional quantifiers later

Examples

It's probably best to build up your use of regular expressions slowly from simplest cases to more complex. You are always better off starting with simple expressions, making sure that they work and them adding additional more complex elements one by one. Unless you have a couple of years of experience with regex do not even try to construct a complex regex one in one quaint step.

Here are a few examples:

$a = '404 - - ';
$a =~ /40\d/; # matches 400, 401, 403, 404 etc.

Here we took a fragment of a record of the http log and tries to match the return code. Note that you can match any part of the integer, not only the whole integer. A similar idea works for real, but generally real numbers have much more complex syntax:

$target='simple real number: 22.33';
$target=~/\d+\.\d*/;

Note: the regex /\d+\.\d*/ isn't a general enough to match all the real numbers permissible in Perl or any other programming language. This is a actually a pretty difficult problem, given all of the formats that programming languages usually support and here regular expressions are of limited use: lexical analyzer is a better tool.

Now let's try to match works. The simplest regular expression that matches a single word is \w+.Here is a couple of examples:

$target='hello world'; 
$target~ m{(\w+)\s+(\w+)}; # detecting two words separated by white space

$target='A = b';
$target =~ /(\w+)\s*=\s*(\w+)/; # another way to ignore white space in matching

Here are more examples of simple regular expressions that might be reused in other contexts:

t.t		 # t followed by any letter followed by t
	
^131		 # 131 at the beginning of a line
0$		 # 0 at the end of a line
\.txt$		 # .txt at the end of a line
/^newfile\.\w*$/ # newfile. with any  followed by zero or more arbitrary characters
                 # This will match newfile.txt, new_prg, newscript, etc.
/^.*marker/      # head of the string up and including the word "marker"
/marker.*$/	 # tail of the string starting from the 'market' and till the end (up to newline). 		
/^$/		 # An empty line

Several additional examples:

0		     # zero: "0"
0*		     # zero of more zeros		
0+		     # one or more zeros
0*0		     # same as above
\d		     # any digit but only one
\d+                  # any integer
\d+\.\d*             # a subset of real numbers. Please note that 0. is a real number
\d+\.\d+\.\d+\.\d+   # IP addresses starting (no control of the number of digits so 1000.1000.1000.1000 would match  this regex
10\.\d+\.\d+\.\d+    # IP addresses ending with 255. You can use {1.3} instead of + for more correct regex.

Tips:

If you need to match a word whose length is unknown, you probably should use neither * nor*? because a zero length word makes no sense.
^$ matches the empty line.
^\s*$ matches string that contains only blanks/tabs or is empty.

How to Create Complex Regex

Complex regex are constructed from simple regular expressions using the following metacharacters:

Character Sequences: A sequence of characters (substring) will match the identical substring in the searched string. For example, m/abc/; will match "abc" but not "cab" or "bca". If any character in the sequence is a meta-character, you need to use the backslash to match its literal value.
- Self-Matching Characters: Any character will match itself unless it is a meta-character or one of $, @, %, &. The meta-characters are listed in the table below, and the other characters are used to begin variable names and function calls. You can use the backslash character to force Perl to match the literal meaning of any character. For example, m/a/; will return true if the letter a is in the $_ variable. And m/\$/; will return true if the character $ is in the $_ variable.
Alternation: The alternation meta-character (|) will let you match more than one possible string. For example, m/a|b/; will match if either the "a" character or the "b" character is in the searched string. You can use sequences of more than one character with alternation. For example, m/dog|cat/; will match if either of the strings "dog" or "cat" is in the searched string. You can use several substrings in parentheses like in m/(dog|cat)/; However, this will affect pattern memory (see below)
Anchors: there are two types oar anchor: beginning and end of the string and word boundaries.
- The caret (^) and the dollar sign meta-characters are used to anchor a pattern to the beginning and the end of the searched string. The caret is always the first character in the pattern when used as an anchor. For example, m/^one/; will only match if the searched string starts with sequence of characters, one. The dollar sign is always the last character in the pattern when used as an anchor. For example, m/(last|end$/; will match only if the searched string ends with either the character sequence last or the character sequence end .
- Word Boundaries: The \b meta-sequence will match the spot between a space and the first character of a word or between the last character of a word and the space. The \b will match at the beginning or end of a string if there are no leading or trailing spaces. For example, m/\bfoo/; will match foo even without spaces surrounding the word. It will also match $foo because the dollar sign is not considered a word character. The statement m/foo\b/; will match foo but not foobar, and the statement m/\bwiz/; will match wizard but not geewiz. The \B meta-sequence will match everywhere except at a word boundary.
Quantifiers: There are several meta-characters that are devoted to controlling how many characters are matched. For example, m/a{5}/; means that five a characters must be found before a true result can be returned. The *, +, and ? meta-characters and the curly braces are all used as quantifiers. Ranges are also possible:
- {n} - matches n copies of the preceding character!
- {n,m} - matches at least n but not more than m copies of the preceding character
- {n,} - matches at least n copies of the preceding character.
Pattern Memory: Parentheses are used to store matched values into buffers for later recall. Sometimes they are called back-references. After you use m/(fish|fowl)/; to match a string and a match is found, the variable $1 will hold either fish or fowl depending on which sequence was matched.
Variable Interpolation: Any variable is interpolated, and the new pattern is then evaluated as a regular expression. Only one level of interpolation is done. This means that if the value of the variable includes, for example, $scalar as a string value, then $scalar will not be interpolated. In addition, back-quotes do not interpolate within double-quotes, and single-quotes do not stop interpolation of variables when used within double-quotes. Variables can also be interpolated within character classes.

Note: When slashes are used inside regex, they need to be escaped. You can use qr instead of slashes in complex regex.

Counting the number of matches

Perl provides several capability to specify how many times a given component must be present before the match is true. You can specify both minimum and maximum number of repetitions.

{n} The component must be present exactly n times.
{n,} The component must be present at least n times.
{n,m} The component must be present at least n times and no more than m times.

One can see that old quantifiers that we already know (*, + and ?) can be expressed via new ones:

* and {0,}
- *? Match zero, one or more times ( the fewest possible)
- {0,}? Match zero, one or more times ( the fewest possible)
+ and {1,}
- +? Match one or many times (the fewest possible number)
- {1,}?
? and {0,1}
- ?? Match zero, or one time, but match the fewest possible number of times
- {0,1}?

This regex will match "You" and "The" but not "" or " The". In order to account for the leading whitespace, which may or not be at the beginning of a line, you need to use the asterisk (*) quantifier in conjunction with the \s symbolic character class in the following way:

m/^\s*\w+/;

Be careful when using the * quantifier because it can match an empty string, which might not be your intention. The regex /b*/ will match any string - even one without any b characters.

At times, you may need to match an exact number of components. The following match statement will be true only if five words are present in the $_ variable:

$_ = 131.1.1.1 - joejerk [21/Jan/2000:09:50:50 -0500] "GET http://216.1.1.1/xxxgirls/bigbreast.gif HTTP/1.0" 200 51500
m/(\w+\s+){3}/; # get the user name of the offender

In this example, we are interested in getting exactly the third word which corresponds to the user id in HTTP logs. After match $3 should contain this id.

The same ideas can be used for processing date and time in the HTTP logs.

Metacharacters in Character Classes

The character class [0123456789] or, shorter, [0-9] defines the class of decimal digits, and [0-9a-fA-F] defines the class of hexadecimal digits. You should use a dash to define a range of consecutive characters. You can use metacharacters inside character classes ( but not as endpoints of a range). For example:

$test = "A\t12";
if ( m/[XYZ\s]/ ) {
   print "Variable test matched the regex\n"

which will display

Variable test matched the regex

because the value of $test includes the tab character which matched metacharacter \s in the character class [XYZ\s].

Meta-character . and modifiers ?, *, + that appear inside the square brackets that define a character class are used in their literal sense. They lose their meta-meaning.

Alternation

Alternation allow you to provide several alternative regex and only one of those would match for the success of regex. In other words, the regular expression:

/^foreach|^for|^while/

means "look for the line beginning with the string 'for' OR the string 'if' or the string 'while'."

The ( | ) syntax split regular expression on sections and each "section" will be tried sequentially. Alternation always tries to match the first item in the list of alternative. If it doesn't match, the second pattern is then tried and so on.

This is called left most matching, and it is very similar to short-circuit operator ||. The evaluation of alternative stops at the first success. Whichcan lead to subtle bugs, if one string in alternation is a substring of the other and shoter sting is positioned before the longer string:

$line = 'foreach $i (@n) { $sum+=$i;}'; 
if ($line =~ /^for|^while/ ) {
   print "Regular loop\n";
} elsif ( $line =~ ^foreach/ ) {
   print "Foreach loop\n");
}

In this case the string foreach will never be matched as string for will match before it. This is a common a mistake and it is prudent always put longest string first in such cases. This tip is also helpful when you don't know whether or not a word will be followed by a delimiter, or an end of line character, or whether or not a word is plural:

for $line ('words', 'word') { 
   if ($line =~ /\bword\b/ ) {
      print "singular\n";
   } elsif ($line =~/\bwords\b/ ) { 
      print "Plural\n"; 
}

Ignoring case

The useful modifier for matching is i (ignore case). for example

$line =~ /"word(s?)/i

will match "word" or "words" independent of case.

Parenthesized Groups and Capture Variables

If some part of regex is enclosed in parenthesis it is considered a group and matching to this groups substring is assigned to special variables $1, $2,.... For example:

$ip='10.192.10.1'; 
$ip=~/(\d+)\.(\d+)\.(\d+)\.(d+)/; 
Print "Ip address components are $1, $2, $3 and $4\n";

You can use alteration within the group, for example

(red|blue)

If regular expression matched the string, the substrings that matched each group are assigned to so called "capture variables": $1, $2, $3, ... In other words each group captures what it content matches and assign it to corresponding capture variable.

One important feature of capture variables is that only the successful match affect them. If a match is unsuccessful, then previous values are preserved whatever they may be. That leads to difficult to find errors if you are not careful. You should never use capture variables without checking if the march is successful or not.

One important feature of capture variables is that only the successful match affect them. If a match is unsuccessful, then previous values are preserved whatever they may be. That leads to difficult to find errors if you are not careful.

You should NEVER EVER use capture variables without checking if the march is successful or not.

This feature of Perl is rarely discussed in textbooks and is very error prone. Errors are difficult to pinpoint as they are depend on whether the match was successful of not. And what is worse you forget about this "feature" from time to time and make the same mistake again and again. I think that this is a design blunder of Perl. It should set all capture variables undefined in case of unsuccessful match.

What is worse you forget about this "feature" from time to time and make the same mistake again and again. And then spend a day debugging your now semi-forgotten script when you accidentally discover that it misbehaves in certain cases. I think that this is a design blunder of Perl. It should set all capture variables undefined in case of unsuccessful match. Please check your scripts for usage of capture variable and manually check in each case that if statement for matching is used.

For example the regex /\w+/ will let you determine if $_ contains a word, but does not let you know what the word is. In order to accomplish that, you need to enclose the matching components with parentheses. For example:

if ( m/(\w+)/ ) { 
   $word=$1; 
}

By doing this, you force Perl to store the matched string into the $1 variable. The $1 variable can be considered as pattern memory or backreference.

We will discuss backreferences in more details later.

More on substitutions

As well as identifying substrings that match regular expressions Perl can make substitutions based on those matches. The way to do this is to use the s function which mimics the way substitution is done in the vi text editor. If the target string is omitted then $_ variable is used as a target string for the substitution .

To replace an occurrence of regular expression h.*?o by string Privyet; in the string $sentence we use the expression

$sentence =~ s/h.*?o/Privyet/;

and to do the same thing with the $_ variable just write the right side of the previous operator:

s/h.*?o/Privyet/;

The first part of this expression is called matching pattern and the second part is called substitution string. The result of a substitution operator in the scalar context is the number of substitutions made, so it is either 0 (false) or 1 (true) in this case.

The result of a substitution operator in the scalar context is the number of substitutions made. Not the string that matched the pattern

This example only replaces the first occurrence of the string, and it may be that there will be more than one such string we want to replace. To make a global substitution the last slash is followed by a g modifier as follows:

s/h.*?o/Privyet/g

Here the target is $_ variable. The expression returns the number of substitutions made ( 0 is none).

If we want to also make replacements case insensitive with the modifier i (for "ignore case"). The expression

s/h.*?o/Privyet/gi

will force the regex engine to ignoring case. Note that case will be ignored only in matching -- substitution string will in inserted exactly as you specified.

Modifier i (ignore case) is very useful for both matching and substitution
Modifier g (global) is very useful for substitution as it replaced all occurrences of regex in the string

The substitution operator can also be used to delete any substring. In this case the replacement string should be omitted. For example to remove the substring "Nick" from the $_ variable, you could write: s/Nick//;

There is additional modifier that is applicable to both regex and replacement string -- /e modifier that changes the interpretation of the pattern delimiters. If used, variable interpolation is active even if single quotes are used.

Like in index function you can use variables in both matching pattern and substitution string. For instance:

# let's assume that $_ = "Nick Bezroukov";
$regex  = "Nick";
$replacement_string = "Nickolas";
$result = s/$regex/$replacement_string/;

Here is a slightly more complex example of replacement (Snort rules):

#alert udp $site_dhcp 63 -> any any (msg:"policy tftp to dchp segment"; classtype:attempted-admin; sid:235; rev:60803;) 

$new="classtype:$ARGV[0];"; 
	 
while(<>) { 
   $line=$_;
   $line=~s[classtype\:.*?\;][$new];
   print $line;	
}

This program changes the $_ variable by performing the replacement and the $result variable will be equal to 1 -- the number of substitutions made.

For a single substitution of a string a similar capability is available with built-in function substr.

$result = substr($_,index($_'Nick'),length('Nick'));

If would be nice to be able to match on array too but this is not the case. If you try something such as:

@buffer =~ m/yahoo/; # Wrong way to search for a string in the array

In the example above the array @buffer will be converted to scalar (number of its elements) and if we assume that the array has 10 elements that means that you will be doing something like:

'10' =~ m/yahoo/;

The right way to solve this problem is to use grep function like in an example below:

grep(m/variable/, @buffer);

In scalar context the number of matches will be returned. In array context the list of elements that matched will be returned.

Each matched group in matching pattern can be referenced with so called backreferences. Backreferences are also numbers consecutively \1, \2. \3. ... They can be used both in matching pattern and in replacement string.

Alternative delimiters

With the "standard" notation you need to use a backslash to escape special character you want to match. That means that you still need to use the backslash character to escape any of the meta-characters including slash, which is pretty common in Unix pathnames. For example:

$path =~ m/usr\/local\/bin/;

This tries to match /usr/local/bin in $path. Regex that contains lot of backslashes the whole regular expression becomes unreadable. To rectify this problem Perl allows the use of alternative regex delimiters (delimiter that marks the beginning and end of a given regular expression) if you use initial m for matching:

m{/usr/local/bin/}

Actually { } is probably the most readable alternative variant that permit easy finding of opening and closing brackets in any decent editor (including Emacs, vi, vim, Slickedit, MultiEdit). Note that if a left bracket is used as the starting delimiter, then the ending delimiter must be the right bracket. Both the match and substitution operators let you use variable interpolation.

But can use other symbols, for example:

m"/usr/local/bin" # here double quote serves as a  regex delimiter

In case you regex contains a lot of special symbols you can first assign it to a single-quoted string and then use variable in the matching operator. The regex inside slashes are treated like double quoted strings and you can interpolate with them with a variable. For example:

$profile = '/root/home/.profile'; 
m/$profile/;

The same trick works for substitution too, for example:

s{$USA_spelling}{$Canada_spelling}sg;

This capability to find matching bracket can be useful when we deal with multiple line regular expressions using extension syntax described below.

If the match regex evaluates to the empty string, the last valid regex is used. So, if you see a statement like

if (//) {print;}

in a Perl program, look for the previous regular expression operator to see what the regex really is. The substitution operator also uses this interpretation of the empty regex (but never for the substitution part which is a string, not a regular expression).

Commenting complex regular expressions

Extended mode which is activated by using modifier x provides capability to write comments within the regex as well as use whitespace freely for readability. For example, instead of regular expression:

# Match an assignment like a=b;. $1 will be the name of the variable and the
# first word. $2 will be the second word.
m/^\s+(\w+)\W+(\w+)\s+$/;

We can write

m/^\s+ (?# leading spaces)
   (w+) (?# get first word)
    \s*=\s*  (?# match = with white space before and after ignored )
    (.*) (?# right part )
    \; (?# final semicolon)
/x

Here we move groups to separate lines it improves readability and gives us opportunity to put comments using asymmetrical brackets (?# and )

But you can go too far and "kill with kindness". Here is an example of over-commented regular expression that is more difficult to read the one line version:

m/
    (?# This  regex will match any Unix style assignments in configuration file delimited with semicolon
    (?# results are put into $1 and $2 if the match is successful.)

    ^      (?# Anchor this match to the beginning of the string)           

    \s*    (?# skip over any whitespace characters)
           (?# we use the * because there may be none)

    (\w+)  (?# Match the first word, put in the first variable)
           (?# the first word because of the anchor)
           

    \W+    (?# Match at least one non-word)
           (?# character, there may be more than one)

    (\w+)  (?# Match another word, put into the second variable)
           
    \s*    (?# skip over any whitespace characters)
           (?# use the * because there may be none)

    $      (?# Anchor this match to the end of the)
           (?# string. Because both ^ and $ anchors)
           (?# are present, the entire string will)
           (?# need to match the  regex. A sub-string will not match.) 
/x;

Please note that the really important trick of using \W to match any combination of delimited like "=" " = " or " =" remains unexplained. In a way those comments make regex more difficult to understand, not easier. In general, if you do not such an excesses. In commenting the first rule is not too much zeal ;-).

Extended matching capabilities provided by modifier /x

Along with the ability to add comments, suffix x also provides addition matching capabilities:

(?:<regexp>) grouping without creating a backreference. This extension lets you add parentheses to your regular expression without causing a regex memory position to be used.
(?!<regexp>) So called "zero-width negative assertion" Only match if not followed by <regexp>This extension lets you specify what should not follow your regex. For instance, /blue(?!bird)/ means that "bluebox" and "bluesy" will be matched but not "bluebird".
(?=<regexp>) matches the next group of text, but doesn't 'eat' it for further matches. This extension lets you match values without including them in the $& variable. Rarely used...

The most useful of the extensions listed above is grouping without creating a backreference.

You can also specify regex modifiers inside the regex itself

(?sxi)

This extension lets you specify an embedded modifier in the regex rather than adding it after the last delimiter. This is useful if you are storing regexs in variables and using variable interpolation to do the matching.

Blocking the assignment of a group in parenthesizes to special variables

Extensions also let you change the order of evaluation without assigning the value of matched group to special variables ($1, $2,...). For example,

m/(?:Operator|Unix)+/;

matches the strings Operator and Unix in any order. No special regex variables ($1, $2, $3,...) will be assigned.

At times, you might like to include a regex component in your regex without including it in the $& variable that holds the matched string. The technical term for this is a zero-width positive look-ahead assertion. You can use this to ensure that the string following the matched component is correct without affecting the matched value. For example, if you have some data that looks like this:

A:B:C

and you want to find all operators in /etc/passwd file and store the value of the first column, you can use a look-ahead assertion. This will do both tasks in one step. For example:

while (<>) {
    push(@array, $&) if m/^\w+(?=\s+Operator\s+)/;
}

print("@array\n");

Let's look at the regex with comments added using the extended mode. In this case, it doesn't make sense to add comments directly to the regex because the regex is part of the if statement modifier. Adding comments in that location would make the comments hard to format.

So we can use a different tactic and put the regex in variable

$ regex = '^\w+     (?# Match the first word in the string)

            (?=\s+   (?# Use a look-ahead assertion to match)
                     (?# one or more whitespace characters)

            Operator  (?# text to match but not to include)
                     
           \s+' (?# one or more whitespace characters) 
while (<>) {
    push(@array, $&) if m/$ regex/xo;
}

print("@array\n");

Here we used a variable to hold the regex and then used variable interpolation in the regex with the match operator. To speed things up we use o modifier, which tells Perl to evaluate regular expression only once.

Zero-width negative assertion

The last extension that we'll discuss is the zero-width negative assertion. This type of component is used to specify values that shouldn't follow the matched string. For example, using the same data as in the previous example, you can look for everyone who is not an operator. Your first inclination might be to simply replace the (?=...) with the (?!...) in the previous example.

There are many ways of matching any value.

If the first method you try doesn't work, try breaking the value into smaller components and match each boundary.
If all else fails, you can always ask for help on the comp.lang.perl.misc newsgroup.

Searching for a substrings using backreferences

One of its more common uses of regex is find a substring in a string but remember that in simple cases the index function is simpler and better. Remember that regular expression matching is greedy and you will get the longest match possible:

$regex = "a*a";
$_ = "abracadabra";
if  m/$regex/ {print "Found $regex in $_\n"

When matching lines in a file you can print matched strings along with their line number using special variable $.

$target = "yahoo";
open(INPUT, "< visited_sites.dat");
while (<INPUT>) {
     if (/$target/o ) {
         print "Site $target was visited: $. $_";
     }
}
close(INPUT);>

The $. special variable keeps track of the record number. Every time the diamond operators read a line, this variable is incremented.

Please note that this example would be better programmed using the index function.

So the question arise what is are additional capabilities of regexs that made them superior to string fuctions in complex situations. The answer is that regexs have so called matching memory or regex memory-- a set of special variables that are assigned values during matching operation for regexs a whole and each component of the regex enclosed inside parentheses. regex memory often called backreferences. This memory persists after the execution of a particular match statement. You can think about backreferences as a special kind of assignment statements.

Each time you use the parentheses '()' in regex Perl assumes that you want to assign the result of matching to a special variable with names like $1, $2, $3.... ). Naturally $1 can be used to refer to the first, $2 -- the second, $3 -- the third matched sub-pattern. These variables can then be accessed directly, by name, or indirectly by assigning the matching expression to an array.

You saw a simple example of this earlier right after the component descriptions. That example looked for the first word in a string and stored it into the first buffer, $1. The following small program

$_ =  "a=5";
m/(\w+) = (\d+) /;
print("$1, $2);

will display

A, 5

This is a simplified example of how one can process Unix-style configuration files. You can use as many buffers as you need. Each time you add a set of parentheses, another buffer is used. If you want to find all the word in the string, you need to use the /g match modifier. In order to find all the words, you can use a loop statement that loops until the match operator returns false.

$_ =  "word1 word3 word3";
while (/(\w+)/g) {
    print("$1\n");
}

Naturally, this program will display

word1
word2
word3

because of each iteration exactly one new match will be printed. As you can see regex has internal memory and in case of using modifier g in the loop will continue extract parts of the the initial strings one by one. But much more interesting approach to a similar problem is to use array on the left side of the assignment statement:

$_ =  "word1 word2 word3";
@matches = /(\w+)/g;
print("@matches\n");

The program will display:

word1 word2 word3

To help you to know what matched and what did not Perl has several auxiliary built-in variables with really horrible names:

$+ - This variable is assigned the value that the last bracket match matched.
$& - This variable is assigned the value of the entire matched string. If the match is not successful, then $& retains its value from the last successful match.
$` - This variable is assigned everything in the searched string that is before the matched string.
$' - This variable is assigned everything in the search string that is after the matched string.

For example:

$text = "this matches 'THIS' not 'THAT'";
$text =~ m"('TH..')";
print "$` $'\n";

Here, Perl will save and later print substring "this matches '" for &` and "' not 'THAT'" for &'. the characters 'THIS' are printed out - Perl has saved them for you in $1 which later gets printed. That regular expressions match the first occurrence on a line. 'THIS' was matched because it came first. And, with the default regexp behavior, 'THIS' will always be the first string to be matched (you can change this default by modifiers - see below)

If you need to save the value of the matched strings stored in the regex memory, make sure to assign them to other variables. regex memory is local to the enclosing block and lasts only until another match is done.

back references are available in the matching regex itself. In other words, if you put parentheses around a group of characters, you can use this group of characters later in the regular expression or substitution string. But there is an important syntactic difference -- if you want to use the back references in the matching regex you need to use the syntax \1, \2, etc. If you want to use the back references in substitution string you use regular $1, $2, etc.

Perl is Perl and there are some irregularities in Perl regular expressions ;-) Here are some examples:

$line = 'Hello world';
$line =~s/(\w+) (\w+)/$2 $1/; # This makes string 'world Hello'.

We can also use of back reference in the matching regex itself:

if (/(A)(B)\2\1/) { print "Hello ABBA";}

The example is pretty artificial, but it well illustrates the key concept. There are 4 steps to the match of this string:

The first A in parentheses matches the letter A and is saved into \1 and $1.
(B) matches the string 'B' and is stored into \2 and $2.
\2 then matches the second ''B" in the string, because it is equal to "B".
\1 matches the next 'A'

Note: If variable interpolation is used in the replacement string none of the meta-characters can be used in the replacement string

Here are some more examples:

$text = 'word1 word2 word3';
($word1, $word3) = ($text =~ 	m"(\w+).*(\w+)");

Notice, however, that assignment occurs when the text string matches. When the text string does not match, then $word1 and $word3 would be empty. Try the example above with the sting "1999 2000 2001" to see the result. So, what happens if your regular expression does not match at all? Nothing will be assigned and special variable will preserve their values (so the values from prev match if any would be used).

Backreferences are not set if a regular expression fails. They retain the values from the last successful march.

This is a frequent Perl 'gotcha'. Built-in variables like $1 does not get change if the regular expression fails. Some people think this a bug, others consider this a feature. Nonetheless, this second point becomes painfully obvious when you consider the following code.

$_ = 'Perl bugs bite';
/\w+ (\w+) \w+/; # sets $1 to be "bugs".
$_ = 'Another match another 	bug';
/(^a.*\s)"; # /^a.*\s will not match to any substring in the string
print $1 # Surprise ! "bugs" will be printed !

In this case, $1 is the string 'bugs', since the second match expression failed! This Perl behavior can cause hours of searching for bug. So, consider yourself warned. Or more to the point, always check if a match was successful before assigning anything to it. You can use one of the following three checks to avoid this type of errors:

if clause. This is the simplest and the most safe method to use. You can do it with a regual if statement or with && operator
- With if statement
```
if (/(^a.*\s)/) {
  $matched = $1; 
} else { 
   print "matching failed"; 
}
```
- Using shell style short circuiting method (operator &&). This is a Perl idiom similar in style to Unix shell practice but it's not very useful.
```
($scalarName =~ m"(regular expression)") && ($match = $1);
```
Direct assignment: Since you can assign a regular expression directly to an array, you can take advantage of the fact that strings will be assigned zero length string in case match fails. For example:
```
($match1, $match2) = /(\w+).*(\w+));
if ($match1 eq '' || $match2 eq '' ) {
} else {
print " match failed\n" 
}
```

Although the first method is the most clean any one will do the job. In any case your regex matching code should protect from unassigned built-in variable regex matching errors.

In any case your regex matching code should protect from unassigned built-in variable regex matching errors.

In any case your regex matching code should protect from unassigned built-in variable regex matching errors.

Backreferences in array context

There are several cases:

The first case is Scalar context, no modifiers. This is not very interesting and as was discussed above 0 or 1 will be returned.
Much more interesting case is Matching in array context, no backreferences. Here, this matches the first position the regular expression can match, and simply puts the backreferences in a form that is quickly accessible.

For example:

($variable, $equals, $value) = ($line =~ m"(\w+)\s*(=)\s*(\w+)");

This takes the first reference (\w+) and makes it $variable, the second reference (=) and makes it $equals, and the third reference (\w+) and makes it $value.

Another interesting case is Matching in array context with 'g' modifier. This takes the regular expression, applies it as many times as it can be applied, and then stuffs the results into an array that consists of all possible matches. For example:

$line = '1.2 3.4 beta 5.66';
@matches = ($line =~ m"(\d*\.\d+)"g);

will make '@matches' equal to '(1.2, 3.4, 5.66)'. The 'g' modifier does the iteration, matching 1.2 first, 3.4 second, and 5.66 third. Likewise:

undef $/;
my $FD = new FileHandle("file");
@comments = (<$FD> =~ m"/\*(.*?)\*/");

will make an array of all the comments in the file '$fd'

Matching in scalar context with 'g' modifier as iterator

Finally, if you use the matching operator in scalar context, you get a behavior that is entirely different from anything else (in the regular expression world, and even the Perl world). This is that 'iterator' behavior we talked about. If you say:

$line = "BEGIN <data> BEGIN
	<data2> BEGIN <data3>"

while ($line =~ m"BEGIN(.*?)(?=BEGIN|$)"sg){ 
push(@blocks, $1); 
}

This then matches the following text (in bold), and stuffs it into @blocks on successive iterations of while:

BEGIN <data>(%)BEGIN
	<data2> BEGIN <data3>
	BEGIN <data> BEGIN <data2>(%)BEGIN
	<data3>
	>BEGIN <data> BEGIN <data2>
	BEGIN <data3>

We have indicated via a '(%)' where each of the iterations start their matching. Note the use of (?=) in this example too! It is essential to matching the correct way, since if you don't use it, the 'matcher' will get set in the wrong place.

Nested Backreferences

As backreferences are implicit assignments they can be nested. Let's discuss parsing of date format in HTTP logs.

m{\([(\d)*\])};

Here, the outermost (( )) parentheses captures the whole thing: 'softly slowly surely subtly'. The innermost (()) parentheses captures a combination of strings beginning with an s and ending with a "ly" followed by spaces. Hence, it first captures 'softly', throws it away then captures 'slowly', throws it away then captures 'surely', then captures 'subtly'.

The first two examples are fairly straightforward. '[0-9]' matches the digit '1' in 'this has a digit (1) in it'. '[A-Z]' matches the capital 'A' in 'this has a capital letter (A) in it'. The last example is a little bit trickier. Since there is only one 'an' in the regex, the only characters that can possibly match are the last four 'an A'.

However, by asking for the regex 'an [^A]' we have distinctly told the regular expression to match 'a', then 'n', then a space, and finally a character that is NOT an 'A'. Hence, this does not match. If the regex was 'match an A not an e', then this would match, since the first 'an' would be skipped, and the second matched! Lik

$scalarName = "This has a tab( )or a newline in it so it matches";
$scalarName =~ m"[\t\n]" # Matches either a tab or a newline.
                         # matches since the tab is present

This example illustrates some of the fun things that can be done with matching and wildcarding. One, the same characters that you can have interpolated in a " " string also get interpolated in both a regular expression and inside a character class denoted by a brackets ([\t\n]). Here, "\t" becomes the matching of a tab, and "\n" becomes the matching of a newline.

Precedence in Regular Expressions

regex components have an order of precedence just as operators do. If you see the following regex:

m/a|b+/

it's hard to tell if the regex should be

 m/(a|b)+/  # match any sequence of  "a" and "b" characters
             # in any order.

m/a|(b+)/   # match either the "a" character or the "b" character
            # repeated one or more times.

The order of precedence shown in below. By looking at the table, you can see that quantifiers have a higher precedence than alternation. Therefore, the second interpretation is correct.

**The regex Component Order of Precedence**
Precedence Level	Component
1	Parentheses
2	Quantifiers
3	Sequences and Anchors
4	Alternation

You can use parentheses to affect the order that components are evaluated because they have the highest precedence. you need to use extended syntax or you will be affecting the regex memory.

The quotemeta function

Both the matching and the substitution operators perform variable interpolation both in the regex and substitution strings, for example:

$variable =~ m"$scalar";

then $scalar will be interpolated, turned into the value for scalar. There is a caveat here. Any special characters will be acted upon by the regular expression engine, and may cause syntax errors. Hence if scalar is:

$scalar = "({";

Then saying something like:

$variable =~ m"$scalar";

is equivalent to saying: $variable =~ m"({"; which is a runtime syntax error. If you say:

$scalar = quotemeta('({');

instead will make $scalar become '\(\{' for you, and substitute $scalar for:

$variable =~ m"\{\{";

Then, you will match the string '({' as you would like.

You can use array in regex (it will be converted to the string with elements separated by spaces like in print statement), but this is tricky and rarely used:

$variable =~ m/@arrayName/; # this equals m/elem1 elem2/;

Here, this is equal to m/elem1 elem2/. If the special variable $" was set to '|', this would be equal to m/elem|elem2/, which as we shall see, matches either 'elem' or 'elem2' in a string. This works for special characters too:

For example:

$_ = "AAA BBB AAA";
print "Found bbb\n" if  m/bbb/i;

This program finds a match even though the regex uses lowercase and the string uses uppercase because the /i modifier was used, telling Perl to ignore the case. The result from a global regex match (modifier g) can be assigned to an array variable or used inside a loop.

As we already know the substitution operator has all modifiers used in the matching operator plus several more. One interesting modifier is the capability to evaluate the replacement regex as an expression instead of a string. You could use this capability to find all numbers in a file and multiply them by a given percentage. Or you could repeat matched strings by using the string repetition operator.

If back quotes are used as delimiters, the replacement string is executed as a DOS or UNIX command. The output of the command is then used as the replacement text.

Additional modifiers

In addition to modifiers x and i that we already learned about, the matching operations can have additional modifiers. The full list includes four modifiers:

x -- Use multiline "pretty-printing" of regular expressions with whitespace and comments.
i -- Perform case-insensitive pattern matching
m -- [Default] Treat string as multiple lines (^ and $ match internal \n, in they are present).
s -- Treat string as a single line (^ and $ ignore \n, but . matches \n).
e -- permits interpretation of the replacement part of the regular expression as a script

Modifier s

Without modifier s, a dot ('.') matches anything but a newline. Sometimes this is helpful. Sometimes it is very frustrating, especially if you have data that spans multiple lines. Consider the following case:

$line = 'BLOCK:
 Some text
END BLOCK
BLOCK:
 Another text
END BLOCK'

Now suppose you want to match the text between keyword BLOCK and "END BLOCK":

$line =~ m{
            BLOCK(\d+)
               (.*?)
            END\ BLOCK # Note backslash. 	Space will be ignored otherwise
         }x;

This does not work. Since the wildcard ('.') matches every character EXCEPT a newline, the regular expression hits a dead end when it gets to the first newline.

Sometimes, as in this case, it is helpful to have the wildcard ('.') match EVERYTHING, not just the newline. And, by extension, to have the wildcard (\s) match [\n\t ], not just tabs and spaces. This is what the modifier 's' does.

In other words it forces Perl to not assume that the string you are working on is one line long. The above then does work with an s on the end of the regular expression:

$line =~ m{
   BLOCK(\d+)
   (.*?)
   END\ BLOCK
}xs;

With the modifier s this now works as expected.

Modifier m

Modifier m is an opposite of the s modifier. In other words, it treats the regular expression as multiple lines, rather than one line. This basically makes it so ^ and $; now match not only the beginning and ending of the string (respectively), but also make ^ match any character after a newline, and make $ match a newline. For example,

$line = 'a
b
c';
$line =~ m"^(.*)$"m;

the m modifier will make the backreference $1 become 'a' instead of "a\nb\nc".

Modifier e

Modifier e provides the possibility to evaluate the second part of the s/// as a complete 'mini-Perl program' rather than as a string. This dramatically increases the power of substitution operator in Perl.

For example let's assume that you want to substitute all of the letters in the following string with their corresponding ASCII number:

$string = 'hello';
$string =~ s{(\w)} # we save the $1.
   {ord($1). " "; } egx;
print "$string\n";

This example will convert each letter into its representation (via org function) and will print

 '104 101 108 108 111".

Each character was taken in turn here and run through the 'ord' function that turned it into a digit. This is pretty powerful functionality but at he same time it is difficult to read and understand. In other words it risk being incomprehensible even for the original programmer when in a month or a year he returns to make some modifications in he program .

We suggest you use such construct only if you to some length documenting why you are usingit and why they are in this case better that more explicit and cleaner way of programming the same functionality: For example:

$string = turnToAscii($string);
sub turnToAscii{
my ($string) = @_;
my ($return, @letters);
   @letters = split(//, $string);
   foreach $letter (@letters) {
      $letter = ord($letter) . " " if ($letter =~ m"\w");
   } 
   $return = join('', @letters); $return; 
}

This latter example is longer but is more easily maintainable. However, it is not only longer it is also slower, so if this construct need need to process long strings the initial "obscure" construct has advantages.

Modifier g in loops

Modifier g in substitution meant that every single instance of a regular expression was replaced. However, this is meaningless in the context of matching. In matching Perl remembers where that match occurs and starts the next matching from this place, not from the beginning of the string. When Perl hits the end of the string, the iterator is reset:

$line = "hello stranger hello friend hello sam";
while ($line =~ m"hello (\w+)"sg){
   print "$1\n";
}

This outputs

stranger
friend
sam

and then quits, because the inherent iterator comes to the end of the expression.

There is one caveat here. With modifier g any modification to the variable being matched via assignment causes this internal iterator to be reset to the beginning for the string.

$word = "hello";
$text=<>;   
$i=0;
while ($text =~ m"($word)"sg) {
   print "instance $i of the word $word was found\n"
   $text="$text\n Word '$word' was found with offset".length($`)."\n";
   $i++;
}

As the variable $text is changed inside the loop, the iterator will be reset to the beginning of the string, creating an infinite loop!

Modifier o: compile regular expression only once

This modifier is helpful when you have a complex regex that in inside a nested loop, so the time consumed by matching greatly influence the total time the program runs.

foreach $filesystem (@fstab) {
   foreach $file (@files) {
      foreach $line in (@text) {
         $line =~ m"<complex regular expression>";
      }
   }
}

By default each time that Perl hits this regular expression, it compiles it. This takes time, and if your regex is complex and does not contain any variable interpolation this is unnecessary operation that can be and should be blocked.

It is not recommended and is a bad style to use modifier o with a regex that contains variable interpolation. But Perl allows this.

It assumes that you make a promise that after first evaluation the variable that represents regex will never change. If it does, Perl will not notice your change.

Etc

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers : Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda : SE quotes : Language Design and Programming Quotes : Random IT-related quotes : Somerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose Bierce : Bernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 : Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds : Larry Wall : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOS : Programming Languages History : PL/1 : Simula 67 : C : History of GCC development : Scripting Languages : Perl history : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-Month : How to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D

Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to to buy a cup of coffee for authors of this site

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: March, 12, 2019

Text processing using regex

Perl compatible regular expression

There are three types of metacharacters:

Ignoring case

Google matched content

Softpanorama Recommended