Softpanorama May the source be with you, but remember the KISS principle ;-)	Home	Switchboard	Unix Administration	Red Hat	TCP/IP Networks	Neoliberalism	Toxic Managers
	(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and bastardization of classic Unix

Regular Expressions -- the most popular nonprocedural language

News	Recommended Books	Overview of Perl Regular expressions	Recommended Links	Reference	Book chapters	Tutorials
Perl Regular Expressions	Overview of regular expressions in Perl	More Complex Perl Regular Expressions	Greedy and Non-Greedy Matches	Perl Split function	Regular Expressions Best Practices	Typical Examples of Using the Match Operator
POSIX regular Expressions	GNU grep regular expressions	Find regular expressions	AWK Regular Expression	Vim Regular Expressions	Microsoft Frontpage regex	Algorithms
Javascript Regular Expressions	Patterns in HP OMU	Problems	Spam filtering	Tips	Humor	Etc

Regular expressions (aka regex) are a unique mini programming language for parsing the text. This is a functional language, not a procedural language; probably the most popular functional language in use. Usually a particular tool adds the capability to replace parts of the text matched. Good command of regular expression represent a valuable skill for any system administrator or programmer. You can use them in editors (for example vim – never use vi, vim is much better). They are also shared by many other UNIX utilities (egrep). See POSIX regular Expressions

Unix the the OS that introduced regular expressions as a powerful non-procedural programming language. The origin of this non-procedural (functional) notation is Unix editors (ed grandfather qed probably was the first editor with built-in regex) As they later in various forms were added to shell and various utilities like find for most Unix users they are quite natural. Everybody else is in much less fortunate position. It you never have used Unix, than the closest relative of regular expressions would be so called masks in DOS/Windows (*.*, *.tx?, etc) and formats in Fortran, PL/1 and C. For example mask *.* that in DOS and Windows denote all files in the current directory is actually a primitive regular expression. All decent text editors (and best HTML editors like Frontpage) support searching using regular expressions too. But still they are more widely use in Unix then in Windows.

Unfortunately Unix is now more then 40 years old and naturally there are several generations of regular expressions implemented in it. Here we see the truth of Donald Knuth humorous definition of unix/linux as an OS with six different types of regular expressions.

Regular expressions are one of the most useful features of Perl (but contrary to common advocacy line definitely not the most useful feature). Due to important enhancements now PERL regex became a standard for regular expression engines. Starting in 1997, Philip Hazel developed PCRE (Perl Compatible Regular Expressions), which attempts to closely mimic Perl's regular expression functionality and is used by many modern tools including PHP and Apache HTTP Server.

For a tutorial on Perl regular expressions see Dr. Nikolai Bezroukov. Perl for system administrators. Ch. 5

Regular expressions in Perl - a summary with examples

With enhancement regex became a powerful non-procedural notation for parsing strings. They are more flexible then string functions. Generally everything that is achievable via regular expressions can be programmed using string functions, but regular expression are in most cases more compact and in complex cases more efficient.

It is important to understand that regular expressions in Perl are a language within the language and as soon as you are in a regular expression normal Perl rules are non-applicable. You should forget about Perl lexical and syntactical rules inside regular expression -- it's a different animal.

Complex regular expressions are hard to write, hard to read, and hard to maintain. Even people who write regex on a daily basis often write them wrong, matching unexpected text and missing valid text. Usage of some metacharacters have subtle nuances that doom novices. See Perl.com Apocalypse 5 for an extremely illuminating discussion of regex shortcomings.

Regex are also available in any other scripting language and most compiled languages via additional libraries or classes.

Advanced regex like we see in Perl 5 is not a result of design -- this is more a result of evolving of a simple mechanism into more complex, gradually solving problems that were detected and improving usefulness of this mechanism. Now mechanism is not that simple and as such presents a danger for newcomers. The flipside of regular expressions is that they can notoriously misbehave if you don't have enough experience with them. So it's very important to practice the use of regular expressions starting with simple one and gradually mastering the general principles that are involved in their construction.

In Perl 5 regex has an extra readability form, which facilitates both your sanity and clearer thinking when dealing with regular expressions. This is a very useful extension.

Debugging is pretty difficult. Few languages provide specialized regex debuggers. Perl does have primitive regex debugger.

There is a good book 'Mastering Regular Expressions' by Jeffrey published by O'Reilly. Paradoxically the book also a good example of what not to do with regular expressions and demonstrate that "overcomplexity drive", clearly visible in attempts to replace lex and yacc with regular expressions, doomed to be a failure.

Some of his examples (double word problem, matching comments in C and several others) are perfect examples of what not to do with regular expressions. In case of double words, converting text into array of words with a pipe and then checking the stream for two identical words in a better and cleaner solution. In case of comments one should try flex or procedural string functions. Actually in those cases regex solution can be more complex than solution using string functions.

Dr. Nikolai Bezroukov

Top Visited <p>Your browser does not support iframes.</p>					Switchboard
					Latest
					Past week
					Past month

NEWS CONTENTS

20141203 : shell - Regex in KornShell ( Stack Overflow )
20121122 : Nooface TermKit Fuses UNIX Command Line Pipes With Visual Output ( Nooface TermKit Fuses UNIX Command Line Pipes With Visual Output, Nov 22, 2012 )
20121122 : The XSB System Version 2.2 Volume 2 Libraries and Interfaces ( The XSB System Version 2.2 Volume 2 Libraries and Interfaces, )
20121122 : Regular Expression...someday Regular expression as a programming language. Is it possible ( Regular Expression...someday Regular expression as a programming language. Is it possible, )
20070311 : Sys Admin v16, i03 The Replacements ( Sys Admin v16, i03 The Replacements, Mar 11, 2007 )
20070221 : Unix Review Tcl Scores High in RE Performance by Cameron Laird and Kathryn Soraiz ( Unix Review Tcl Scores High in RE Performance, Feb 21, 2007 )
20061124 : freshmeat.net Project details for regular expression parser ( freshmeat.net Project details for regular expression parser, Nov 24, 2006 )
20060919 : The Fishbowl Beware Regular Expressions ( The Fishbowl Beware Regular Expressions, Sep 19, 2006 )
20060919 : PATTERN MATCHING version 1.1 by Dmitry A. Kazakov ( )
20060919 : pcregrep - grep utility that uses perl 5 compatible regexes. ( pcregrep - grep utility that uses perl 5 compatible regexes., )
20041119 : PCRE - Perl Compatible Regular Expressions ( PCRE - Perl Compatible Regular Expressions, Nov 19, 2004 )
20041119 : USENIX ;login - the dark side of regular expressions ( USENIX ;login - the dark side of regular expressions, )
20041119 : Regular expressions in sed and commands that use them ( Regular expressions in sed and commands that use them, )
20041119 : Regular Expressions in grep ( Regular Expressions in grep, )
20041119 : Vim Regular Expressions ( Vim Regular Expressions, )
20041119 : REX XML Shallow Parsing with Regular Expressions by Robert D. Cameron (School of Computing Science; Simon Fraser University) this guy tries to reinvent lexical analysis for XML ;-). ( REX XML Shallow Parsing with Regular Expressions, )
20041119 : Steve Ramsays Guide to Regular Expressions ( Steve Ramsay's Guide to Regular Expressions, )
20041119 : A Tao of Regular Expressions ( A Tao of Regular Expressions, )
20041119 : New Regular Expression Features in Tcl 8.1 ( New Regular Expression Features in Tcl 8.1, )
20041119 : Regular expression man page ( Regular expression man page, )
19990112 : Java Regular Expressions by any Java programmer. ( Java Regular Expressions, Jan. 12, 1999 )

Old News ;-)

[Dec 03, 2014] shell - Regex in KornShell

Stack Overflow
Ksh has supported limited extended patterns since ksh88, using the
special '(' pattern ')'
syntax.

In ksh88, the 'special' character prefixes change the number of matches expected:
'*' for zero or more matches
'+' at least one match
'@' for exactly one match
'?' for zero or one matches
'!' for negation
In ksh93, this was expanded with
'{' min ',' max '}'
to express an exact range:
for w in 1423 12 "" abc 23423 9 33 3  333
do
  [[ $w == {1,3}(\d) ]] && print $w has between 1 and three digits
  [[ $w == {2}(\d) ]] && print $w has exactly two digits
done
And finally, you can have perl-like clutter with '~', which introduces a whole new class of extensions, including full regular expressions with:

'~(E)( regex )'

More examples can be found in Finnbarr P. Murphy's blog: http://blog.fpmurphy.com/2009/01/ksh93-regular-expressions.html

[Nov 22, 2012] Nooface TermKit Fuses UNIX Command Line Pipes With Visual Output

TermKit is a visual front-end for the UNIX command line. A key attribute of the UNIX command line environment is the ability to chain multiple programs with pipes, in which the output of one program is fed through a pipe to become the input for the next program, and the last program in the chain displays the output of the entire sequence - traditionally as ASCII characters on a terminal (or terminal window). The piping approach is key to UNIX modularity, as it encourages the development of simple, well-defined programs that work together to solve a more complex problem.

TermKit maintains this modularity, but adds the ability to display the output in a way that fully exploits the more powerful graphics of modern interfaces. It accomplishes this by separating the output of programs into two types: data output, which is intended for feeding into subsequent programs in the chain, and view output, for visually rich display in a browser window.
The result is that programs can display anything representable in a browser, including HTML5 media. The output is built out of generic widgets (lists, tables, images, files, progress bars, etc.) (see screen shot). The goal is to offer a rich enough set for the common data types of Unix, extensible with plug-ins. This YouTube video shows the interface in action with a mix of commands that produce both simple text-based output and richer visual displays. The TermKit code is based on Node.js, Socket.IO, jQuery and WebKit. It currently runs only on Mac and Windows, but 90% of the prototype functions work in any WebKit-based browser.

The XSB System Version 2.2 Volume 2 Libraries and Interfaces

XSB's POSIX Regular Expression and Wildcard Matching Packages
- Regular Expression Matching and Substitution
- Wildcard Matching and Globing

More about Syntax of Regular Expressions - Xerox XRCE

Regular Expression...someday Regular expression as a programming language. Is it possible

I began to learn about regular expression about 3 years ago. I was at that time, never heard about, or even expect to know such a thing. I was asked to learn Perl to solve some Bioinformatics work, and in my mind i don't have any regex knowledge, except mathematical statements like x = { y| y is subset of z}, or y = {1,2,3}, and some basic knowledge of Z-language. Although the math statements did not closely resembles the regular expression statements that are used in programming language, its foundation is still a 'regular expression' (since regular expression is about stating the regular behavior of the item that we want to specify). We assure y = {1,2,3} by /[1-3]/. Anyway, from that point of time, i began to learn Perl, and slowly i was introduced to m// operator, s/// and tr/// (text processing requires a massive use of these operators).
What i like about regular expression is its compactness. Techniques for simplifying codes have been explored long time before. One approach is by using function (which is eventually another concept that come from math). Using function (or some use the word subroutine), we manage to reduce codes, and simplify them just by calling their name instead of rewriting the same codes. Almost with a similar purpose in mind, we use regex to simplify complex requirements, which is by representing a set of rules within a simple statement, i.e. /a-z/. One should realize that we are representing many lines of codes within a single statement. Just imagine, using regular expression as programming language, a million lines of codes can be turned into just several lines of codes (or symbols).

I'm also believe that regular expression can possibly be a language that is easier to remember, and can be written faster. This because in regular expression we use simple symbols to represent (possibly) complex rules. It have been proved that our brains can (easily) remember things that we see visually compared to the things that are written or touched. Also human brains will capture things in graphical form. Based on this fact, isn't it possible that we can remember some simple symbols more faster than to remember huge amount of text? Also since we only need to write symbols, it will not take us a long time to write the codes in regular expression (unless for complex rules).

Regular expression is specified using a finite set of symbols, such as '?' to represent existence, '+' to represent repetition etc, make it looks more encrypted. Programming language was created to bring computer language (machine language) more closer to the natural language, so that it will become easier for people to write codes to be computer programs. Based on this fact, it seem impossible for encrypted code like regular expression to be accepted as one of high programming language.However, it is not an excuse. Even most of programming language today require some comments to clarify its purpose, or explain what the code does. People might claims that some high level language is already self-explained (the codes explains its purpose). However many of us will found that this statement is not true for all cases. When a section of codes becomes so complex, even the most proclaimed self-explanatory language require at least few comments to describe the codes. Some of todays implemented regular expressions allows comments to be included in the regular expression statement. So it is not encrypted at all when the regular expression are combined with some extra comments. In implementation, no different in code size since comments will be ignored.

I'm just writing the general ideas of how regular expression can possibly be a programming language here. There's still a lot of things that need to be considered, studied and experimented with. But I'm still hoping for this idea to become true.

[Mar 11, 2007] Sys Admin v16, i03 The Replacements

OK, this is starting to look ugly. Like a regex match, we can pull that apart with a trailing x:
s/
  (
    ^        # either beginning of line
    |        # or
    (?<=,)   # a single comma to the left
  )
  .*?        # as few characters as possible
  (
    (?=,)    # a single comma to the right
    |        # or
    $        # end of string
  )
/XXX/gx;
That's much easier to read (relatively speaking).
Like a regular expression match, we can use an alternate delimiter for the left and right sides of the substitution:
$_ = "hello";
s%ell%ipp%; # $_ is now "hippo"
The rules are a bit complicated, but it works precisely the way Larry Wall wanted it to work. If the delimiter chosen is not one of the special characters that begins a pair, then we use the character twice more to both separate the pattern from the replacement and to terminate the replacement, as the example above showed.
However, if we use the beginning character of a paired character set (parentheses, curly braces, square brackets, or even less-than and greater-than), we close off the pattern with the corresponding closing character. Then, we get to pick another delimiter all over again, using the same rules. For example, these all do the same thing:
s/ell/ipp/;
s%ell%ipp%;
s;ell;ipp;; # don't do this!
s#ell#ipp#; # one of my favorites
s[ell]#ipp#; [] for pattern, # for replacement
s[ell][ipp]; [] for both pattern and replacement
s<ell><ipp>; <> for both pattern and replacement
s{ell}(ipp); {} for pattern, () for replacement
No matter what the closing delimiter might be for either the pattern or the replacement, we can include the character literally by preceding it with a backslash:
$_ = "hello";
s/ell/i\/n/; # $_ is now "hi/no";
s/\/no/res/; # $_ is now "hires";
To avoid backslashing, pick a distinct delimiter:
$_ = "hello";
s%ell%i/n%; # $_ is now "hi/no";
s%/no%res%; # $_ is now "hires";
Conveniently, if a paired character is used, the pairs may be nested without invoking any backslashes:
$_ = "aaa,bbb,ccc,ddd,eee,fff,ggg";
s((^|(?<=,)).*?((?=,)|$))(XXX)g; # replace all fields with XXX
Note that even though the pattern contains closing parentheses, they are all paired with opening parentheses, so the pattern ends at the right place.
The right side of the substitution operation is generally treated as if it were a double-quoted string: variable interpolation and backslash interpretation is performed directly:
$replacement = "ipp";
$_ = "hello";
s/ell/$replacement/; # $_ is now "hippo"
The left side of a substitution is also treated as if it were a double-quoted string (with a few exceptions), and this interpolation happens before the result is evaluated as a regular expression:
$pattern = "ell";
$replacement = "ipp";
$_ = "hello";
s/$pattern/$replacement/; # $_ is now "hippo"
Using this form of pattern, Perl is forced to compile the regular expression at runtime. If this happens in a loop, Perl may need to recompile the regular expression repeatedly, causing a slowdown. We can give Perl a hint that the pattern is really a regular expression by using a regular expression literal:
$pattern = qr/ell/;
$replacement = "ipp";
$_ = "hello";
s/$pattern/$replacement/; # $_ is now "hippo"
The qr operation creates a Regexp object, which interpolates into the pattern with minimal fuss and maximal speed.

[Feb 21, 2007] Unix Review Tcl Scores High in RE Performance by Cameron Laird and Kathryn Soraiz

... Regular-Expressions.info is the Web place to go for tutorials and ongoing enthusiasm on the subject of REs, and the Wikipedia article on the subject does a good job of explaining how programmers see REs.

... There are parsing tasks that simply can't be done by REs, notably XML; others that can be written as REs, but only at the cost of extreme complexity (email addresses are an example); and still others where REs work quite well, but slower or more clumsily than such alternatives as scanf, procedural string parsing, glob-style patterns, the pattern matching such languages as Erlang or Icon build in, and even interesting special-purpose parser other than RE, such as Paul McGuire's Pyparsing or Damian Conway's Parse-RecDescent. Christopher Frenz' book hints at the range of techniques available for practical use.

All this still doesn't exhaust what there is to know about Perl and its RE: Perl REs are actually "extended" REs, and there's controversy about the costs of that extension; Perl 6 is changing Perl RE's again; much of Perl's RE functionality is now readily available in "mainstream" languages such as C, Java, C#; and so on. There's a relative abundance of commenters on Perl, though, so we leave these subjects to others for now.

Performance curiousities

One final misconception about Perl's REs deserves mention, however: that its REs dominate all others in convenience, correctness, performance, and so on. As a recent discussion in the Lambda the Ultimate forum remarked, Perl's performance on a well-known RE benchmark is less than a quarter of that of the equivalent Tcl. This is not unusual; Tcl's RE engine is quite mature and fast.

The final point to make for this month about REs, though, is that many of these facts simply don't matter! We've seeded this column with an abundance of hyperlinks to outside discussions and explanations; read for yourself about the different implementations and uses of REs. What's remarkable, though, is how little consequence most of these details appear to have to working programmers.

The benchmark just mentioned, for example, and others that demonstrate Tcl can be hundreds of times faster than other languages in RE processing suggests that Tcl might be a favorite of programmers with big parsing jobs. It's just not so. Objectively, Tcl's RE performance is a strength, particularly in light of the fact that it correctly handles Unicode.

Almost no one cares. Cisco, Oracle, IBM, Siemens, Daimler, Motorola, and many other companies and government agencies use Tcl in "mission-critical" applications. We've interviewed managers from dozens of such development staffs, and never has a decision-maker volunteered that the speed of Tcl's RE engine affected choice of the language. Even the most passionate pieces of Tcl advocacy don't mention its RE engine.

Amazingly, Tcl insiders don't regard its RE engine as "wrung out". In fact, it's one of the more conservative parts of the Tcl implementation, and core maintainer Donal Fellows has abundant ideas for its improvement.

[Nov 24, 2006] freshmeat.net Project details for regular expression parser

regular expression parser is a C++ regexp parser that accomplishes The Open Group specification Issue 6, IEEE Std 1003.1, 2004 Edition. It allows you to parse input using regular expressions, and to retrieve parsed sub-expression matches in a few steps.
Release focus: Initial freshmeat announcement

Changes:
This release supports wide chars and localization, although localization functionality has not yet been tested precisely.

[Sep 19, 2006] The Fishbowl Beware Regular Expressions

August 18, 2003 *Fishbowl) Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. -Jamie Zawinski, in comp.lang.emacs

Regular expressions are a very powerful tool. They're also very easy to mis-use. The biggest problem with regexps occurs when you attempt to use a series of regular expressions to substitute for an integrated parser.

I recently upgraded Movable Type, and in the process I installed Brad Choate's excellent MT-Textile plugin. MT-Textile gives you a vaguely wiki-like syntax for blog entries that rescues you from writing a mess of angle-brackets every time you want to write a post.

I love MT-Textile, but sadly the more comfortable I get with it, the more I realise its limitations. MT-Textile is built on top of a series of regular expressions, and as such, the more you try to combine different Textile markups, the more likely you are to confuse the parser and end up with something different to what you intended. Any parser built on top of multiple regular expressions gets confused very easily, depending on the order the regexps are applied in.

I ran into the same problem with I was running my own wiki. I started with a Perl wiki, which (like all Perl code) was highly dependent on regular expressions. I quickly found that the effort required to add new markup to the wiki, keeping in mind the way each regexp would interact with the previous and subsequent expressions, increased exponentially with the complexity of the expression language.

After a certain point, diminishing returns will kill you.

I'd like to propose the following rule:

Every regexp that you apply to a particular block of text reduces the applicability of regular expressions by an order of magnitude.

I won't pretend to be a great expert in writing parsers-I dropped out of University before we got to the compiler-design course-but after a point, multiple regular expressions will hurt you, and you're much better off rolling your own parser.
Posted to nerd, stories at August 18, 2003 11:46 PM I feel the need to come to the defence of REs. :)
While I agree with your post in principle, what I think you're really advocating is the *structured* application of REs. If you go down the route of writing a proper parser (eg: using flex/bison or ANTLR) the first step is to write an ordered list of REs that will be used to tokenise the source text. The important step being that the list of REs is _clearly ordered_, and there are explicit rules on whether to keep matching subsequent REs or not depending on which REs have matched already. In this context, the application of further REs increases the complexity of your parser in something closer to a linear fashion.

I don't really follow Stephen Schmidt's comments about parsers being hard to extend and compile-time bound. The easiest way to parse a language is to specify the grammar based on tokens, and that's exactly what yacc/bison do to the tokenised output from [f]lex. The separation of tokenising and grammar building is precisely what makes these tools useful and powerful. And if you don't want to use C/C++ there's perl-bison and ANTLR (for Java).
Posted by: Stewart Johnson at August 22, 2003 01:40 PM (#link)

PATTERN MATCHING version 1.1 by Dmitry A. Kazakov

Pattern matching is a powerful tool for syntax analysis. The main idea of pattern matching comes from the SNOBOL4 language (see the wonderful book THE SNOBOL4 PROGRAMMING LANGUAGE by R. E. Griswold, J. F. Poage and I. P. Polonsky). Some of the pattern expression atoms and statements were taken from there. One can find that patterns are very similar to the Backus-Naur forms. Comparing with the regular expressions (used by grep and egrep in UNIX) patterns are more powerful, but slower in matching.

This library is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this library; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.

As a special exception, if other files instantiate generics from this unit, or you link this unit with other files to produce an executable, this unit does not by itself cause the resulting executable to be covered by the GNU General Public License. This exception does not however invalidate any other reasons why the executable file might be covered by the GNU Public License.

The current version works under Windows and UNIX.
Download (match_1_1.tgz tar + gzip, Windows users may use WinZip)

pcregrep - grep utility that uses perl 5 compatible regexes.

pcregrep - grep utility that uses perl 5 compatible regexes.
Perl-style regexps have many useful features that the standard POSIX ones don't; this is basically the same as grep but with the different regexp syntax.

The other reason for the existence of pcregrep is that its source code is an example of programming with pcre library.

[Nov 19, 2004] PCRE - Perl Compatible Regular Expressions

The PCRE library is a set of functions that implement regular expression pattern matching using the same syntax and semantics as Perl 5. PCRE has its own native API, as well as a set of wrapper functions that correspond to the POSIX regular expression API. The PCRE library is free, even for building commercial software.

PCRE was originally written for the Exim MTA, but is now used by many high-profile open source projects, including Python, Apache, PHP, KDE, Postfix, Analog, and nmap. Other interesting projects using PCRE include Ferite, Onyx, Hypermail, and Askemos.

USENIX ;login - the dark side of regular expressions

Regular expressions in sed and commands that use them

Regular Expressions in grep

Vim Regular Expressions

I started this tutorial for one simple reason - I like regular expressions. Nothing compares to the satisfaction from a well-crafted regexp which does exactly what you wanted it to do :-). And yes, I have a life too. I hope it's passable as a foreword. Feel free to send me your comments, corrections and suggestions concerning this tutorial.

REX XML Shallow Parsing with Regular Expressions by Robert D. Cameron (School of Computing Science; Simon Fraser University) this guy tries to reinvent lexical analysis for XML ;-).

The syntax of XML is simple enough that it is possible to parse an XML document into a list of its markup and text items using a single regular expression. Such a shallow parse of an XML document can be very useful for the construction of a variety of lightweight XML processing tools. However, complex regular expressions can be difficult to construct and even more difficult to read. Using a form of literate programming for regular expressions, this paper documents a set of XML shallow parsing expressions that can be used a basis for simple, correct, efficient, robust and language-independent XML shallow parsing. Complete shallow parser implementations of less than 50 lines each in Perl, JavaScript and Lex/Flex are also given.

Steve Ramsay's Guide to Regular Expressions

If you've ever typed "cp *.html ../" at the UNIX command prompt, or entered "garden?" into a web-based search engine, you've already used a simple regular expression. Regular expressions ("regex's" for short) are sets of symbols and syntactic elements used to match patterns of text.

Even these simple examples testify to the power of regular expressions. In the first instance, you've copied all the files which end in ".html" (as opposed to copying them one by one); in the second, you've conducted a search not only for "garden," but for "garden, gardening, gardens, and gardeners" all at once.

For a tool with full regex support, metacharacters like "*" and "?" (or "wildcard operators," as they are sometimes called) are only the tip of the iceberg. Using a good regex engine and a well-crafted regular expression, one can easily search through a text file (or a hundred text files) searching for words that have the suffix ".html" (but only if the word begins with a capital letter and occurs at the beginning of the line), replace the .html suffix with a .sgml suffix, and then change all the lower case characters to upper case. With the right tools, this series of regular expressions would do just that:

s/(^[A_Z]{1})([a-z]+)\.sgml/\1\2\.html/g tr/a-z/A-Z/

As you might guess from this example, concision is everything when it comes to crafting regular expressions, and while this syntax won't win any beauty prizes, it follows a logical and fairly standardized format which you can learn to read and write easily with just a little bit of practice.

A Tao of Regular Expressions

A regular expression is a formula for matching strings that follow some pattern. Many people are afraid to use them because they can look confusing and complicated. Unfortunately, nothing in this write up can change that. However, I have found that with a bit of practice, it's pretty easy to write these complicated expressions. Plus, once you get the hang of them, you can reduce hours of laborious and error-prone text editing down to minutes or seconds. Regular expressions are supported by many text editors, class libraries such as Rogue Wave's Tools.h++, scripting tools such as awk, grep, sed, and increasingly in interactive development environments such as Microsoft's Visual C++.

Regular expressions usage is explained by examples in the sections that follow. Most examples are presented as vi substitution commands or as grep file search commands, but they are representative examples and the concepts can be applied in the use of tools such as sed, awk, perl and other programs that support regular expressions. Have a look at Regular Expressions In Various Tools for examples of regular expression usage in other tools. A short explanation of vi's substitution command and syntax is provided at the end of this document.

New Regular Expression Features in Tcl 8.1

Regular expression man page

[Jan. 12, 1999] Java Regular Expressions -- Very nice page with a lot of useful links that should be probably consulted by any Java programmer.

Perl

Mastering Regular Expressions -- links to the addtional material from the book by
Backtracking in regular expressions Also available from SPAN Backtracking in regular expressions
PERL5 Regular Expression Description
MakeRegex -- Freeware composes a regex-expression from a list of words. It had been inspired by the emacs elisp module make-regex.el, by Simon Marshall.
Perl Regular Expressions -- The perlre - Perl regular expressions man page
Perl Tutorial: String matching -- One of the most useful features of Perl (if not the most useful feature) is its powerful string manipulation facilities. At the heart of this is the regular expression (RE) which is shared by many other UNIX utilities.
PERL5 Regular Expression Description By Tom Christiansen
Regexp for URLs Abigal's collective effort to write a regular expression to match URLs.
Regular Expressions Give Perl its Luster -- short info card from Webreview.com
Regular Expressions [cmu.edu]
Regular Expressions [sunsite.auc.dk]

Tutorials

Mastering Regular Expressions free chapter: Chapter 4 The Mechanics of Expression Processing

Sys Admin v16, i03 The Replacements by Randal L. Schwartz

Gnu Regex manual from NASA
Perl Regular Expression Tutorial
Perl Regular Expression Simulator (includes Java applet for testing your own regexes)
Syntax of regexes

Steve Ramsay's Guide to Regular Expressions

Appendix Z - Regular Expressions -- from UNIX Shell Programming Fourth Edition

Regular Expressions [sunsite.auc.dk]
Regular Expressions Primer
Mastering Regular Expressions - O'Reilly's ``hip owls'' book
Regular Expressions [cmu.edu]

Perl5 Regular Expression Description

CSC 4170 Regular expressions - [Syllabus] [Previous Lecture] [Next Lecture] Regular expressions Primitive Regular Expressions Regular Expressions Languages Defined by Regular Expressions Building Regular Expressions Example Regular Expressions Regular expressions come in many..
--http://www.netaxs.com/people/nerp/automata/regexp0.html

Regular Expressions - Last Updated on: Wed 16 Sep 1998 1stDraft Rate This Page Too Easy Just Right Too Hard The Shell / Regular Expressions Regular Expressions Regular expressions are much like wildcards in the sense that they match patterns of characters, but they are.

Lecture 10 : Regular Grammars, Regular Expressions & Finite State Automata - Lecture 10 : Regular Grammars, Regular Expressions & Finite State Automat
--http://www.cs.um.edu.mt/~hzarb/CSM201/notes/lecture10/lec...

Introduction to the Internet, Regular expressions, Page 1 - Chapter 6: Regular expressions, ex, awk, perl and find Section 1 Regular expressions Page 1 Regular expressions Regular expressions Regular expressions provide a powerful method for matching patterns of characters. Regular expressions (REs) are..
--http://iris.geobio.elte.hu/iris/85321/study-guide/chap6/s...

Reference

perlre - Perl regular expressions

NAME
DESCRIPTION

Regular Expressions Reference

Book Chapters

Regular Expressions The chapter from Joseph N. Hall's book, Effective Perl Programmming, in PDF. Very Good

Friedl, Jeffrey - author of Mastering Regular Expressions (O'Reilly's "Hip Owls" book).

Mastering Regular Expressions

Now that we have some background under our belt, let's delve into the mechanics of how a regex engine really goes about its work. Here we don't care much about the Shine and Finish of the previous chapter; this chapter is all about the engine and the drive train, the stuff that grease monkeys talk about in bars. We'll spend a fair amount of time under the hood, so expect to get a bit dirty with some practical hands-on experience.

Start Your Engines!

Let's see how much I can milk this engine analogy for. The whole point of having an engine is so that you can get from Point A to Point B without doing much work. The engine does the work for you so you can relax and enjoy the Rich Corinthian Leather. The engine's primary task is to turn the wheels, and how it does that isn't really a concern of yours. Or is it?

... ... ...

Finite automata

Intro to Tcl Regular Expressions

Linguist Noam Chomsky defined a hierarchy of languages, in terms of complexity. This four-level hierarchy, called the Chomsky hierarchy, corresponds precisely to four classes of machines. Each higher level in the hierarchy incorporates the lower levels: that is, anything that can be computed by a machine at the lowest level can also be computed by a machine at the next highest level.

How do we prove this? We prove it by constructing an interpreter for the lower level machine on the higher level one: if we can construct such an interpreter, then clearly the higher level machine is capable of computing anything the lower level machine can compute: it can simply compute it on the interpreter!

The levels of the Chomsky hierarchy are as follows:
MACHINE CLASS            LANGUAGE CLASS
-------------            --------------
finite automata          regular languages
pushdown automata        context-free languages
linear bounded automata  context-sensitive languages
Turing machines          recursively enumerable languages

CSC 4170 Regular expressions

Primitive Regular Expressions
Regular Expressions
Languages Defined by Regular Expressions
Building Regular Expressions
Example Regular Expressions
Regular expressions come in many forms. The syntax we use in class is much simpler than the syntax used by, say, the UNIX grep command, but both have equivalent descriptive power. (grep, by the way, is even used at the North Pole.)

Here's a nice description of regular expressions in UNIX, specifically in regexp, and here's a more detailed explanation.

Here are UNIX man (manual) pages for regexp and for grep, egrep, and fgrep ( link1, link2, link3). Regular expressions are used for searching in a number of editors such as ed, vi, and emacs. They have found their way into programming languages such as Tcl and Perl. Here's a paper on specifying search expressions in Perl, and here's one on using Perl to search the Web. Here's yet another writeup on using regular expressions in Perl.

XRCE MLTT Examples of Networks and Regular Expressions

Problems

Tom Christiansen on Irregular Expressions

Etc

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers : Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda : SE quotes : Language Design and Programming Quotes : Random IT-related quotes : Somerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose Bierce : Bernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 : Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds : Larry Wall : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOS : Programming Languages History : PL/1 : Simula 67 : C : History of GCC development : Scripting Languages : Perl history : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-Month : How to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D

Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to to buy a cup of coffee for authors of this site

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: March 12, 2019

Regular Expressions -- the most popular nonprocedural language

NEWS CONTENTS

Old News ;-)

[Dec 03, 2014] shell - Regex in KornShell

Stack Overflow

[Nov 22, 2012] Nooface TermKit Fuses UNIX Command Line Pipes With Visual Output

The XSB System Version 2.2 Volume 2 Libraries and Interfaces

Regular Expression...someday Regular expression as a programming language. Is it possible

[Mar 11, 2007] Sys Admin v16, i03 The Replacements

[Feb 21, 2007] Unix Review Tcl Scores High in RE Performance by Cameron Laird and Kathryn Soraiz

[Nov 24, 2006] freshmeat.net Project details for regular expression parser

[Sep 19, 2006] The Fishbowl Beware Regular Expressions

PATTERN MATCHING version 1.1 by Dmitry A. Kazakov

pcregrep - grep utility that uses perl 5 compatible regexes.

[Nov 19, 2004] PCRE - Perl Compatible Regular Expressions

USENIX ;login - the dark side of regular expressions

Regular expressions in sed and commands that use them

Regular Expressions in grep

Vim Regular Expressions

REX XML Shallow Parsing with Regular Expressions by Robert D. Cameron (School of Computing Science; Simon Fraser University) this guy tries to reinvent lexical analysis for XML ;-).

Steve Ramsay's Guide to Regular Expressions

A Tao of Regular Expressions

New Regular Expression Features in Tcl 8.1

Regular expression man page

[Jan. 12, 1999] Java Regular Expressions -- Very nice page with a lot of useful links that should be probably consulted by any Java programmer.

Recommended Links

Google matched content

Softpanorama Recommended

Top articles

Sites

Perl

Tutorials

Reference

Book Chapters

Finite automata

Problems

Etc