POSIX regular Expressions
Regex Tutorial -
POSIX Bracket Expressions
POSIX regular expression in the standardization of capabilities of regular expression engine used
in grep and AWK. So two engines are standardized
- The POSIX Basic Regular Expression (BRE) engine. Default on command line. Also
used in grep when it is invoked without any option (bad idea). grep -P now implements
Perl compatible regular expression which are a step up from ERE, to say nothing about BRE
and should be new default. Meanwhile you can create alias to avoid using "classic" grep.
- The POSIX Extended Regular Expression (ERE) engine. This a slight generalization of
regular expression engine used in
AWK. GNU AWK can be views as the reference implementation. grep with option -E
(or called as egrep) uses this engine. Again it does not make sense to learn its idiosyncrasies.
Use grep -P instead.
POSIX introduced "bracket expressions" which are a special kind of
character classes. POSIX
bracket expressions match one character out of a set of characters, just like regular character classes.
They use the same syntax with square brackets. A hyphen creates a range, and a caret at the start negates
the bracket expression.
One key syntactic difference is that the backslash is NOT a metacharacter in a POSIX bracket expression.
Unlike Perl compatible regular expression (PCRE) in POSIX, the regular expression [\d] matches a \ or a d.
you need to use [:digit:] to achieve the same effect, which is frustrating. To match a
], put it as the first character after the opening [ or the negating ^. To
match a -, put it right before the closing ]. To match a ^, put it before
the final literal - or the closing ]. Put together, []\d^-] matches ],
\, d, ^ or -.
The main purpose of the bracket expressions is that they adapt to the user's or application's locale.
A locale is a collection of rules and settings that describe language and cultural conventions, like
sort order, date format, etc. The POSIX standard also defines these locales.
POSIX-compliant
regular expression engines should implement POSIX bracket expressions. Some non-POSIX
regex engines also support POSIX character classes, but usually don't support collating sequences and character
equivalents. Regular expression engines that support
Unicode use Unicode properties
and scripts to provide functionality similar to POSIX bracket expressions. In Unicode regex engines,
shorthand character
classes like \w normally match all relevant Unicode characters, alleviating the need to
use locales.
Character Classes
Don't confuse the POSIX term "character class" with what is normally called a
regular expression character
class. [x-z0-9] is an example of what we call a "character class" and POSIX calls a "bracket
expression". [:digit:] is a POSIX character class, used inside a bracket expression like
[x-z[:digit:]]. These two regular expressions match exactly the same: a single character that
is either x, y, z or a digit. The class names must be written all lowercase.
POSIX bracket expressions can be negated. [^x-z[:digit:]] matches a single character that
is not x, y, z or a digit. A major difference between POSIX bracket expressions and the character classes
in other regex flavors is that POSIX bracket expressions treat the backslash as a literal character.
This means you can't use backslashes to escape the closing bracket (]), the caret (^) and the hyphen
(-). To include a caret, place it anywhere except right after the opening bracket. [x^] matches
an x or a caret. You can put the closing bracket right after the opening bracket, or the negating caret.
[]x] matches a closing bracket or an x. [^]x] matches any character that is not a
closing bracket or an x. The hyphen can be included right after the opening bracket, or right before
the closing bracket, or right after the negating caret. Both [-x] and [x-] match an
x or a hyphen.
Exactly which POSIX character classes are available depends on the POSIX locale. The following are
usually supported, often also by regex engines that don't support POSIX itself. I've also indicated
equivalent character classes that you can use in ASCII and
Unicode regular expressions
if the POSIX classes are unavailable. Some classes also have Perl-style
shorthand
equivalents.
Java does not support POSIX
bracket expressions, but does support POSIX character classes using the \p operator. Though
the \p syntax is borrowed from the syntax for
Unicode properties, the
POSIX classes in Java only match ASCII characters as indicated below. The class names are case sensitive.
Unlike the POSIX syntax which can only be used inside a bracket expression, Java's \p can be
used inside and outside bracket expressions.
POSIX |
Description |
ASCII |
Unicode |
Shorthand |
Java |
[:alnum:] |
Alphanumeric characters |
[a-zA-Z0-9] |
[\p{L&}\p{Nd}] |
|
\p{Alnum} |
[:alpha:] |
Alphabetic characters |
[a-zA-Z] |
\p{L&} |
|
\p{Alpha} |
[:ascii:] |
ASCII characters |
[\x00-\x7F] |
\p{InBasicLatin} |
|
\p{ASCII} |
[:blank:] |
Space and tab |
[ \t] |
[\p{Zs}\t] |
|
\p{Blank} |
[:cntrl:] |
Control characters |
[\x00-\x1F\x7F] |
\p{Cc} |
|
\p{Cntrl} |
[:digit:] |
Digits |
[0-9] |
\p{Nd} |
\d |
\p{Digit} |
[:graph:] |
Visible characters (i.e. anything except spaces, control characters, etc.) |
[\x21-\x7E] |
[^\p{Z}\p{C}] |
|
\p{Graph} |
[:lower:] |
Lowercase letters |
[a-z] |
\p{Ll} |
|
\p{Lower} |
[:print:] |
Visible characters and spaces (i.e. anything except control characters, etc.) |
[\x20-\x7E] |
\P{C} |
|
\p{Print} |
[:punct:] |
Punctuation and symbols. |
[!"#$%&'()*+,\-./:;<=>?@[\\\]^_`{|}~] |
[\p{P}\p{S}] |
|
\p{Punct} |
[:space:] |
All whitespace characters, including line breaks |
[ \t\r\n\v\f] |
[\p{Z}\t\r\n\v\f] |
\s |
\p{Space} |
[:upper:] |
Uppercase letters |
[A-Z] |
\p{Lu} |
|
\p{Upper} |
[:word:] |
Word characters (letters, numbers and underscores) |
[A-Za-z0-9_] |
[\p{L}\p{N}\p{Pc}] |
\w |
|
[:xdigit:] |
Hexadecimal digits |
[A-Fa-f0-9] |
[A-Fa-f0-9] |
|
\p{XDigit} |
Collating Sequences
A POSIX locale can have collating sequences to describe how certain characters or groups of characters
should be ordered. E.g. in Spanish, ll like in tortilla is treated as one character,
and is ordered between l and m in the alphabet. You can use the collating sequence
element [.span-ll.] inside a bracket expression to match ll. E.g. the regex torti[[.span-ll.]]a
matches tortilla. Notice the double square brackets. One pair for the bracket expression, and
one pair for the collating sequence.
I do not know of any regular expression engine that support collating sequences, other than POSIX-compliant
engines part of a POSIX-compliant system.
Note that a fully POSIX-compliant regex engine will treat ll as a single character when
the locale is set to Spanish. This means that torti[^x]a also matches tortilla.
[^x] matches a single character that is not an x, which includes ll in the
Spanish POSIX locale.
In any other regular expression engine, or in a POSIX engine not using the Spanish locale, torti[^x]a
will match the misspelled word tortila but will not match tortilla, as [^x]
cannot match the two characters ll.
Finally, note that not all regex engines claiming to implement POSIX regular expressions actually
have full support for collating sequences. Sometimes, these engines use the regular expression syntax
defined by POSIX, but don't have full locale support. You may want to try the above matches to see if
the engine you're using does. E.g.
Tcl's regexp command supports
collating sequences, but Tcl only supports the Unicode locale, which does not define any collating sequences.
The result is that in Tcl, a collating sequence specifying a single character will match just that character,
and all other collating sequences will result in an error.
Character Equivalents
A POSIX locale can define character equivalents that indicate that certain characters should be considered
as identical for sorting. E.g. in French, accents are ignored when ordering words. élève comes
before être which comes before événement. é and ê are all the same
as e, but l comes before t which comes before v. With the locale
set to French, a POSIX-compliant regular expression engine will match e, é, è
and ê when you use the collating sequence [=e=] in the bracket expression [[=e=]].
If a character does not have any equivalents, the character equivalence token simply reverts to the
character itself. E.g. [[=x=][=z=]] is the same as [xz] in the French locale.
Like collating sequences, POSIX character equivalents are not available in any regex engine that
I know of, other than those following the POSIX standard. And those that do may not have the necessary
POSIX locale support.
Here too
Tcl's regexp command supports character equivalents, but Unicode locale, the only one Tcl supports,
does not define any character equivalents. This effectively means that [[=x=]] and [x]
are exactly the same in Tcl, and will only match x, for any character you may try instead of
"x".
Basic regular expression
The Basic Regular Expressions or BRE flavor is essentially the same as used
by the traditional grep command. This is pretty much the oldest regular
expression flavor still in use today. One thing that sets this flavor apart is
that most metacharacters require a backslash to give the metacharacter its
flavor. Most other flavors, including POSIX ERE, use a backslash to suppress the
meaning of metacaracters. Using a backslash to escape a character that is never
a metacharacter is an error.
A BRE supports
POSIX bracket expressions, which are similar to character classes in other
regex flavors, with a few special features. Shorthands are not supported. Other
features using the usual metacharacters are the
dot to match any
character except a line break, the
caret and dollar to
match the start and end of the string, and the
star to repeat the
token zero or more times. To match any of these characters literally, escape
them with a backslash.
The other BRE metacharacters require a backslash to give them their special
meaning. The reason is that the oldest versions of UNIX grep did not support
these. The developers of grep wanted to keep it compatible with existing regular
expressions, which may use these characters as literal characters. The BRE
a{1,2} matches
a{1,2} literally, while
a\{1,2\}
matches a or aa. Some
implementations support \? and \+
as an alternative syntax to \{0,1\} and
\{1,\}, but \? and
\+ are not part of the POSIX standard. Tokens can be
grouped with \( and \).
Backreferences are the usual \1 through
\9. Only up to 9 groups are permitted. E.g.
\(ab\)\1
matches abab, while (ab)\1
is invalid since there's no capturing group corresponding to the backreference
\1. Use
\\1 to match
\1 literally.
POSIX BRE does not support any other features. Even
alternation is
not supported.
The regular expression pattern makes use of wildcard characters to represent one or more characters
in the data stream. There are plenty of instances in Linux where you can specify a wildcard character
to represent data you don't know about. You've already seen an example of using wildcard characters
with the Linux ls command for listing files and directories
Ls implements even more limited regular expression engine in which ? is used
instead of dot
Extended regular expression
Extended regular expressions are mainly used in egrep (although now the usage
of grep -P is preferable), SED and AWK.
From Wikipedia
A regular expression, often called a pattern, is an expression used to specify
a
set of strings required for a particular purpose. A simple way to specify a finite
set of strings is to list its
elements
or members. However, there are often more concise ways to specify the desired set of
strings. For example, the set containing the three strings "Handel", "Händel", and "Haendel"
can be specified by the pattern H(ä|ae?)ndel
; we say that this pattern
matches each of the three strings.
In most
formalisms, if there exists at least one regular expression that matches a particular
set then there exists an infinite number of other regular expression that also match
it—the specification is not unique. Most formalisms provide the following operations
to construct regular expressions.
- Boolean "or"
- A vertical
bar separates alternatives. For example,
gray|grey
can
match "gray" or "grey".
- Grouping
- Parentheses
are used to define the scope and precedence of the
operators (among other uses). For example,
gray|grey
and
gr(a|e)y
are equivalent patterns which both describe the set
of "gray" or "grey".
- Quantification
- A
quantifier after a
token (such as a character) or group specifies how often that preceding element
is allowed to occur. The most common quantifiers are the
question
mark
?
, the
asterisk
*
(derived from the
Kleene star),
and the plus sign
+
(Kleene
plus).
-
? |
The question mark indicates zero or one occurrences of the preceding
element. For example, colou?r matches both "color" and "colour". |
* |
The asterisk indicates zero or more occurrences of the preceding
element. For example, ab*c matches "ac", "abc", "abbc", "abbbc",
and so on. |
+ |
The plus sign indicates one or more occurrences of the preceding
element. For example, ab+c matches "abc", "abbc", "abbbc", and
so on, but not "ac". |
{n} |
The preceding item is matched exactly n times. |
{min,} |
The preceding item is matched min or more times. |
{min,max} |
The preceding item is matched at least min times, but not more than
max times. |
These constructions can be combined to form arbitrarily complex expressions, much
like one can construct arithmetical expressions from numbers and the operations +,
−, ×, and ÷. For example, H(ae?|ä)ndel
and
H(a|ae|ä)ndel
are both valid patterns which match the same strings
as the earlier example, H(ä|ae?)ndel
.
The precise syntax
for regular expressions varies among tools and with context; more detail is given in
the Syntax
section.
Softpanorama Recommended
Society
Groupthink :
Two Party System
as Polyarchy :
Corruption of Regulators :
Bureaucracies :
Understanding Micromanagers
and Control Freaks : Toxic Managers :
Harvard Mafia :
Diplomatic Communication
: Surviving a Bad Performance
Review : Insufficient Retirement Funds as
Immanent Problem of Neoliberal Regime : PseudoScience :
Who Rules America :
Neoliberalism
: The Iron
Law of Oligarchy :
Libertarian Philosophy
Quotes
War and Peace
: Skeptical
Finance : John
Kenneth Galbraith :Talleyrand :
Oscar Wilde :
Otto Von Bismarck :
Keynes :
George Carlin :
Skeptics :
Propaganda : SE
quotes : Language Design and Programming Quotes :
Random IT-related quotes :
Somerset Maugham :
Marcus Aurelius :
Kurt Vonnegut :
Eric Hoffer :
Winston Churchill :
Napoleon Bonaparte :
Ambrose Bierce :
Bernard Shaw :
Mark Twain Quotes
Bulletin:
Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient
markets hypothesis :
Political Skeptic Bulletin, 2013 :
Unemployment Bulletin, 2010 :
Vol 23, No.10
(October, 2011) An observation about corporate security departments :
Slightly Skeptical Euromaydan Chronicles, June 2014 :
Greenspan legacy bulletin, 2008 :
Vol 25, No.10 (October, 2013) Cryptolocker Trojan
(Win32/Crilock.A) :
Vol 25, No.08 (August, 2013) Cloud providers
as intelligence collection hubs :
Financial Humor Bulletin, 2010 :
Inequality Bulletin, 2009 :
Financial Humor Bulletin, 2008 :
Copyleft Problems
Bulletin, 2004 :
Financial Humor Bulletin, 2011 :
Energy Bulletin, 2010 :
Malware Protection Bulletin, 2010 : Vol 26,
No.1 (January, 2013) Object-Oriented Cult :
Political Skeptic Bulletin, 2011 :
Vol 23, No.11 (November, 2011) Softpanorama classification
of sysadmin horror stories : Vol 25, No.05
(May, 2013) Corporate bullshit as a communication method :
Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law
History:
Fifty glorious years (1950-2000):
the triumph of the US computer engineering :
Donald Knuth : TAoCP
and its Influence of Computer Science : Richard Stallman
: Linus Torvalds :
Larry Wall :
John K. Ousterhout :
CTSS : Multix OS Unix
History : Unix shell history :
VI editor :
History of pipes concept :
Solaris : MS DOS
: Programming Languages History :
PL/1 : Simula 67 :
C :
History of GCC development :
Scripting Languages :
Perl history :
OS History : Mail :
DNS : SSH
: CPU Instruction Sets :
SPARC systems 1987-2006 :
Norton Commander :
Norton Utilities :
Norton Ghost :
Frontpage history :
Malware Defense History :
GNU Screen :
OSS early history
Classic books:
The Peter
Principle : Parkinson
Law : 1984 :
The Mythical Man-Month :
How to Solve It by George Polya :
The Art of Computer Programming :
The Elements of Programming Style :
The Unix Hater’s Handbook :
The Jargon file :
The True Believer :
Programming Pearls :
The Good Soldier Svejk :
The Power Elite
Most popular humor pages:
Manifest of the Softpanorama IT Slacker Society :
Ten Commandments
of the IT Slackers Society : Computer Humor Collection
: BSD Logo Story :
The Cuckoo's Egg :
IT Slang : C++ Humor
: ARE YOU A BBS ADDICT? :
The Perl Purity Test :
Object oriented programmers of all nations
: Financial Humor :
Financial Humor Bulletin,
2008 : Financial
Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related
Humor : Programming Language Humor :
Goldman Sachs related humor :
Greenspan humor : C Humor :
Scripting Humor :
Real Programmers Humor :
Web Humor : GPL-related Humor
: OFM Humor :
Politically Incorrect Humor :
IDS Humor :
"Linux Sucks" Humor : Russian
Musical Humor : Best Russian Programmer
Humor : Microsoft plans to buy Catholic Church
: Richard Stallman Related Humor :
Admin Humor : Perl-related
Humor : Linus Torvalds Related
humor : PseudoScience Related Humor :
Networking Humor :
Shell Humor :
Financial Humor Bulletin,
2011 : Financial
Humor Bulletin, 2012 :
Financial Humor Bulletin,
2013 : Java Humor : Software
Engineering Humor : Sun Solaris Related Humor :
Education Humor : IBM
Humor : Assembler-related Humor :
VIM Humor : Computer
Viruses Humor : Bright tomorrow is rescheduled
to a day after tomorrow : Classic Computer
Humor
The Last but not Least Technology is dominated by
two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt.
Ph.D
Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org
was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP)
without any remuneration. This document is an industrial compilation designed and created exclusively
for educational use and is distributed under the Softpanorama Content License.
Original materials copyright belong
to respective owners. Quotes are made for educational purposes only
in compliance with the fair use doctrine.
FAIR USE NOTICE This site contains
copyrighted material the use of which has not always been specifically
authorized by the copyright owner. We are making such material available
to advance understanding of computer science, IT technology, economic, scientific, and social
issues. We believe this constitutes a 'fair use' of any such
copyrighted material as provided by section 107 of the US Copyright Law according to which
such material can be distributed without profit exclusively for research and educational purposes.
This is a Spartan WHYFF (We Help You For Free)
site written by people for whom English is not a native language. Grammar and spelling errors should
be expected. The site contain some broken links as it develops like a living tree...
Disclaimer:
The statements, views and opinions presented on this web page are those of the author (or
referenced source) and are
not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness
of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be
tracked by Google please disable Javascript for this site. This site is perfectly usable without
Javascript.
Last modified:
March, 12, 2019