Softpanorama

May the source be with you, but remember the KISS principle ;-)
Home Switchboard Unix Administration Red Hat TCP/IP Networks Neoliberalism Toxic Managers
(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and  bastardization of classic Unix

Introduction to Perl 5.10 for Unix System Administrators

(Perl 5.10 without excessive complexity)

by Dr Nikolai Bezroukov

Contents : Foreword : Ch01 : Ch02 : Ch03 : Ch04 : Ch05 : Ch06 : Ch07 : Ch08 :


Prev | Up | Contents | Down | Next

5.2. Overview of Perl regular expressions


The Hello World Example

As was mentioned before regular expressions are a language inside the language. Regex should be viewed as a separate language that has no direct connections to Perl. It is used with many other languages (Python, PHP, Java) in almost the same form as in Perl just with different syntactic sugar. Still Perl was the first language to introduce "close binding" of regex and the language per se, the feature that was later more or less successfully copied to Python, TCL and other languages. Also the level of integration of the regular expression language into main language is higher in Perl, then in any alternative scripting language. Still as it is a different language some problems arise. For example Perl debugger can't debug regular expressions.

Perl language regular expression parser gradually evolves. The latest significant changes were introduced in version 5.10 make it more powerful and less probe to errors. This version of Perl is the minimal version recommended for any serious text parsing work.

As regular expressions (regex for short) is a new language, using the famous "Hello world" program as the first program seems to be appropriate. As a remnant from shell/AWK legacy a regular expression lexically is a special type of literals (similar to double quoted literal).

It is usually (but not necessarily) is included in slashes. In matching operator the source string (where matching occurs) is specified on the left side of the special =~ operator (matching operator), while regex is on the right side.

The simplest case is to search substring in string like in built-in function index. The following expression is true if the string Hello appears anywhere in the variable $sentence.

$sentence = "Hello world";
if ($sentence =~ /Hello/) { 
   print "Matched\n"
} else {
   print "Not matched\n"
}   

The regular expressions  are case sensitive, so if we assign to $sentence the same string but in lower case

$sentence = "hello world";

then the above match will fail.

The operator !~ can be used for  a non-match. For example, the expression

$sentence !~ /Hello/

is true if the string Hello does not appear in $sentence.

Alternatively you can use qr instead of slashes. That's very important,  if you regex contain a lot of slashes

$url !~ qr(/cygdrive/f/public_html)
the $_ is the default operand for regular expressions. But in most cases the string against which to performs the match or substitution should be specified explicitly with operator =~ and its negation !~. For example:
$my_string = "The graph has many leaves";
if ( $my_string =~ m/graph/ ) {
   print("The source string contains the word 'graph'.\n");}
   $result =~ s/graph/tree/;
   print "Replaced with 'tree'\n";
}
print("initial string: '$my_string'\n.The result is '$result'\n");
In this example each of the regular expression operators applies to the $my_string variable instead of $_.

Two types of regex

There are two main uses for regular expressions in Perl:

Regular expressions in Perl operate only against strings. No arrays on left hand side of matching statement please.

Regular expressions in Perl operate against strings. No arrays on left hand side of matching statement please.

Success and Failure of Matching

We can capture the success or failure of the match (but not the number of matches)  in a scalar variable. This way we have a way to determine the success or failure of the matching and substitution, respectively:

@test_array=("The graph has many leaves",
             "Fallen leaves, so many leaves on the ground.");
foreach $test (@test_array) {
   $match = ($test =~ m/leaves/);
   print("Result of match of word 'leaves' in string '$test' is $match\n");
}

This program displays the following:

Result of match of word 'leaves' in string 'The graph has many leaves' is 1
Result of match of word 'leaves' in string 'Fallen leaves, so many leaves on the ground' is 1

The other useful feature of this example is that it shows you how to obtain the return values of the regular expression operators. In case subsequent action depends on the value of changed variables you should always check if the expression successive or failed because way to often regular expression behave differently then their creators expect.

In scalar context the match operation returns the number of matches. That means that if match failed it returns zero.

We could use a conditional as to check if match was successful or no:

$sentence = "Disneyworld in Orlando";
if ($sentence =~ /world/){
   print "there is a substring 'world' somewhere in the sentence: $sentence\n";
}

Sometimes it's easier to test the special variable $_, especially if you need to test each input string in the input loop. In this case you can write something like:

while (<>) { # get "Hello world" from the input stream
   if (/world/) {
      print "There is a word 'world' in the sentence '$_'\n";
   }
}

As we already have seen the $_ variable is the default for many Perl built-in functions (tr, split, etc).

Regular Expressions Metacharacters

The problem with regex metacharacters is that there are plenty of them. They provide a lot of power for sophisticated user and at the same time make them appear very complicated, at least at the very beginning.

It's best to build up your skills slowly: creation of complex regex can be considered as a kind of an art form (like solving a a puzzle or chess problems). Please pay special attention to non-greedy (lazy) quantifiers as they are simpler to use and less prone to errors.

It makes a lot of sense first to debug a complex regular expression is a special test script, feeding it with sample strings and observing the output.

Please pay special attention to non-greedy (lazy) quantifiers as they are simpler to use and less prone to errors. It makes a lot of sense first to debug a complex regular expression is a special test script, feeding it with sample strings and observing the output.

There are three types of metacharacters:

As they are used as metacharacters, characters $, |, [],{} (), \, / ^, / and several others in regular expressions should be preceded by a backslash.

For example:

$ip_addr=~/\d+\.\d+\.\d+\.\d+/; # dot character should be escaped

Regular metacharacters

Regular metacharacters are special characters that represent some class of symbols. They consume one character from the string if they are matched (with quantifiers it can be less or more). In other word, they 'eats' characters of the class they represent. A good example is metacharacter that consumes characters is . (dot) which match any character. Among the most common regular metacharacters are:

  1. . Any single character except a newline (length one). There is a special modifier to force . match newline too
  2. \d -- matches a digit (character grouping [0-9]). Equivalent to [0-9]
  3. \w -- matches a word character (underscore is counted as a word character here). Equivalent to [a-zA-Z_0-9]
  4. \s -- matches a 'space' character (tab, newline, space). Equivalent to [ \t\n\r\f]
  5. Classes. Classes can be called "definable metacharacters". They are group of characters in square brackets. They are can be sets or ranges and should be put inside square brackets a -(minus) indicates "between" and a ^ after [ means "not". For for example:

If you use capital latter instead of lower case letter the meaning of metacharacter is reversed:

Anchors

Anchors are metacharacters that serve as markers and that never consume characters from the string. Anchors always match zero number of characters of a particular class. That means that they do not require any character to be present, only some logical condition is this place of the string needs to be true. Anchors don't match a character, they match a condition. In other words they do not consume any symbols. They just tell the regex engine that the particular match occurred. Two most common anchors are ^ and $:

Quantifiers

Perl has three groups of quantifiers (which are also metacharacters, but they affect interpretation of previous character). The most important metacharacters include three groups with two members in each - one greedy and the other non-greedy (lazy):

Non greedy modifies are newer but easier to understand as they correspond to the search of substring, Greedy modifies correspond to search of the last occurrence of the substring. That's the key difference. We will discuss not greedy modifies in the next section: More Complex Perl Regular Expressions

For example:

$sentence="Hello world"; 
if ($sentence =~ /^\w+/) { # true if the sentence starts with a word like "Hello"  
   print "The string $sentence starts with a word\n";
} else {
    print "The string $sentence does not starts with a word\n";
}  
Full list includes 12 quantifiers:
 
Maximal
(greedy)
Minimal
(lazy)
Allowed Range
{n,m} {n,m}? Must occur at least n times but no more than m times
{n,} {n,}? Must occur at least n times
{n} {n}? Must match exactly n times
* *? 0 or more times (same as {0,})
+ +? 1 or more times (same as {1,})
? ?? 0 or 1 time (same as {0,1})

We will discuss additional quantifiers later

Examples

It's probably best to build up your use of regular expressions slowly from simplest cases to more complex. You are always better off starting with simple expressions, making sure that they work and them adding additional more complex elements one by one. Unless you have a couple of years of experience with regex do not even try to construct a complex regex one in one quaint step.

Here are a few examples:

$a = '404 - - ';
$a =~ /40\d/; # matches 400, 401, 403, 404 etc.

Here we took a fragment of a record of the http log and tries to match the return code. Note that you can match any part of the integer, not only the whole integer. A similar idea works for real, but generally real numbers have much more complex syntax:

$target='simple real number: 22.33';
$target=~/\d+\.\d*/;

Note: the regex /\d+\.\d*/ isn't a general enough to match all the real numbers permissible in Perl or any other programming language. This is a actually a pretty difficult problem, given all of the formats that programming languages usually support and here regular expressions are of limited use: lexical analyzer is a better tool.

Now let's try to match works. The simplest regular expression that matches a single word is \w+. Here is a couple of examples:

$target='hello world'; 
$target~ m{(\w+)\s+(\w+)}; # detecting two words separated by white space
$target='A = b';
$target =~ /(\w+)\s*=\s*(\w+)/; # another way to ignore white space in matching

Here are more examples of simple regular expressions that might be reused in other contexts:

t.t		 # t followed by any letter followed by t
	
^131		 # 131 at the beginning of a line
0$		 # 0 at the end of a line
\.txt$		 # .txt at the end of a line
/^newfile\.\w*$/ # newfile. with any  followed by zero or more arbitrary characters
                 # This will match newfile.txt, new_prg, newscript, etc.
/^.*marker/      # head of the string up and including the word "marker"
/marker.*$/	 # tail of the string starting from the 'market' and till the end (up to newline). 		
/^$/		 # An empty line 

Several additional examples:

0		     # zero: "0"
0*		     # zero of more zeros		
0+		     # one or more zeros
0*0		     # same as above
\d		     # any digit but only one
\d+                  # any integer
\d+\.\d*             # a subset of real numbers. Please note that 0. is a real number
\d+\.\d+\.\d+\.\d+   # IP addresses starting (no control of the number of digits so 1000.1000.1000.1000 would match  this regex
/\d+\.\d+\.\d+\.255/ # IP addresses ending with 255

Tips:

Prev | Up | Contents | Down | Next



Etc

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D


Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to to buy a cup of coffee for authors of this site

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: March 12, 2019