POSIX regular Expressions

POSIX regular expression in the standardization of capabilities of regular expression engine used in grep and AWK. So two engines are standardized

POSIX introduced "bracket expressions" which are a special kind of character classes. POSIX bracket expressions match one character out of a set of characters, just like regular character classes. They use the same syntax with square brackets. A hyphen creates a range, and a caret at the start negates the bracket expression.

One key syntactic difference is that the backslash is NOT a metacharacter in a POSIX bracket expression. Unlike Perl compatible regular expression (PCRE) in POSIX, the regular expression [\d] matches a \ or a d. you need to use [:digit:] to achieve the same effect, which is frustrating. To match a ], put it as the first character after the opening [ or the negating ^. To match a -, put it right before the closing ]. To match a ^, put it before the final literal - or the closing ]. Put together, []\d^-] matches ], \, d, ^ or -.

The main purpose of the bracket expressions is that they adapt to the user's or application's locale. A locale is a collection of rules and settings that describe language and cultural conventions, like sort order, date format, etc. The POSIX standard also defines these locales.

POSIX-compliant regular expression engines should implement POSIX bracket expressions. Some non-POSIX regex engines also support POSIX character classes, but usually don't support collating sequences and character equivalents. Regular expression engines that support Unicode use Unicode properties and scripts to provide functionality similar to POSIX bracket expressions. In Unicode regex engines, shorthand character classes like \w normally match all relevant Unicode characters, alleviating the need to use locales.

Character Classes

Don't confuse the POSIX term "character class" with what is normally called a regular expression character class. [x-z0-9] is an example of what we call a "character class" and POSIX calls a "bracket expression". [:digit:] is a POSIX character class, used inside a bracket expression like [x-z[:digit:]]. These two regular expressions match exactly the same: a single character that is either x, y, z or a digit. The class names must be written all lowercase.

POSIX bracket expressions can be negated. [^x-z[:digit:]] matches a single character that is not x, y, z or a digit. A major difference between POSIX bracket expressions and the character classes in other regex flavors is that POSIX bracket expressions treat the backslash as a literal character. This means you can't use backslashes to escape the closing bracket (]), the caret (^) and the hyphen (-). To include a caret, place it anywhere except right after the opening bracket. [x^] matches an x or a caret. You can put the closing bracket right after the opening bracket, or the negating caret. []x] matches a closing bracket or an x. [^]x] matches any character that is not a closing bracket or an x. The hyphen can be included right after the opening bracket, or right before the closing bracket, or right after the negating caret. Both [-x] and [x-] match an x or a hyphen.

Exactly which POSIX character classes are available depends on the POSIX locale. The following are usually supported, often also by regex engines that don't support POSIX itself. I've also indicated equivalent character classes that you can use in ASCII and Unicode regular expressions if the POSIX classes are unavailable. Some classes also have Perl-style shorthand equivalents.

Java does not support POSIX bracket expressions, but does support POSIX character classes using the \p operator. Though the \p syntax is borrowed from the syntax for Unicode properties, the POSIX classes in Java only match ASCII characters as indicated below. The class names are case sensitive. Unlike the POSIX syntax which can only be used inside a bracket expression, Java's \p can be used inside and outside bracket expressions.

Collating Sequences

POSIX	Description	ASCII	Unicode	Shorthand	Java
`[:alnum:]`	Alphanumeric characters	`[a-zA-Z0-9]`	`[\p{L&}\p{Nd}]`		`\p{Alnum}`
`[:alpha:]`	Alphabetic characters	`[a-zA-Z]`	`\p{L&}`		`\p{Alpha}`
`[:ascii:]`	ASCII characters	`[\x00-\x7F]`	`\p{InBasicLatin}`		`\p{ASCII}`
`[:blank:]`	Space and tab	`[ \t]`	`[\p{Zs}\t]`		`\p{Blank}`
`[:cntrl:]`	Control characters	`[\x00-\x1F\x7F]`	`\p{Cc}`		`\p{Cntrl}`
`[:digit:]`	Digits	`[0-9]`	`\p{Nd}`	`\d`	`\p{Digit}`
`[:graph:]`	Visible characters (i.e. anything except spaces, control characters, etc.)	`[\x21-\x7E]`	`[^\p{Z}\p{C}]`		`\p{Graph}`
`[:lower:]`	Lowercase letters	`[a-z]`	`\p{Ll}`		`\p{Lower}`
`[:print:]`	Visible characters and spaces (i.e. anything except control characters, etc.)	`[\x20-\x7E]`	`\P{C}`		`\p{Print}`
`[:punct:]`	Punctuation and symbols.	[!"#$%&'()*+,\-./:;<=>?@[\\\]^_`{\|}~]	`[\p{P}\p{S}]`		`\p{Punct}`
`[:space:]`	All whitespace characters, including line breaks	`[ \t\r\n\v\f]`	`[\p{Z}\t\r\n\v\f]`	`\s`	`\p{Space}`
`[:upper:]`	Uppercase letters	`[A-Z]`	`\p{Lu}`		`\p{Upper}`
`[:word:]`	Word characters (letters, numbers and underscores)	`[A-Za-z0-9_]`	`[\p{L}\p{N}\p{Pc}]`	`\w`
`[:xdigit:]`	Hexadecimal digits	`[A-Fa-f0-9]`	`[A-Fa-f0-9]`		`\p{XDigit}`

A POSIX locale can have collating sequences to describe how certain characters or groups of characters should be ordered. E.g. in Spanish, ll like in tortilla is treated as one character, and is ordered between l and m in the alphabet. You can use the collating sequence element [.span-ll.] inside a bracket expression to match ll. E.g. the regex torti[[.span-ll.]]a matches tortilla. Notice the double square brackets. One pair for the bracket expression, and one pair for the collating sequence.

I do not know of any regular expression engine that support collating sequences, other than POSIX-compliant engines part of a POSIX-compliant system.

Note that a fully POSIX-compliant regex engine will treat ll as a single character when the locale is set to Spanish. This means that torti[^x]a also matches tortilla. [^x] matches a single character that is not an x, which includes ll in the Spanish POSIX locale.

In any other regular expression engine, or in a POSIX engine not using the Spanish locale, torti[^x]a will match the misspelled word tortila but will not match tortilla, as [^x] cannot match the two characters ll.

Finally, note that not all regex engines claiming to implement POSIX regular expressions actually have full support for collating sequences. Sometimes, these engines use the regular expression syntax defined by POSIX, but don't have full locale support. You may want to try the above matches to see if the engine you're using does. E.g. Tcl's regexp command supports collating sequences, but Tcl only supports the Unicode locale, which does not define any collating sequences. The result is that in Tcl, a collating sequence specifying a single character will match just that character, and all other collating sequences will result in an error.

Character Equivalents

A POSIX locale can define character equivalents that indicate that certain characters should be considered as identical for sorting. E.g. in French, accents are ignored when ordering words. élève comes before être which comes before événement. é and ê are all the same as e, but l comes before t which comes before v. With the locale set to French, a POSIX-compliant regular expression engine will match e, é, è and ê when you use the collating sequence [=e=] in the bracket expression [[=e=]].

If a character does not have any equivalents, the character equivalence token simply reverts to the character itself. E.g. [[=x=][=z=]] is the same as [xz] in the French locale.

Like collating sequences, POSIX character equivalents are not available in any regex engine that I know of, other than those following the POSIX standard. And those that do may not have the necessary POSIX locale support.

Here too Tcl's regexp command supports character equivalents, but Unicode locale, the only one Tcl supports, does not define any character equivalents. This effectively means that [[=x=]] and [x] are exactly the same in Tcl, and will only match x, for any character you may try instead of "x".

Basic regular expression

The Basic Regular Expressions or BRE flavor is essentially the same as used by the traditional grep command. This is pretty much the oldest regular expression flavor still in use today. One thing that sets this flavor apart is that most metacharacters require a backslash to give the metacharacter its flavor. Most other flavors, including POSIX ERE, use a backslash to suppress the meaning of metacaracters. Using a backslash to escape a character that is never a metacharacter is an error.

A BRE supports POSIX bracket expressions, which are similar to character classes in other regex flavors, with a few special features. Shorthands are not supported. Other features using the usual metacharacters are the dot to match any character except a line break, the caret and dollar to match the start and end of the string, and the star to repeat the token zero or more times. To match any of these characters literally, escape them with a backslash.

The other BRE metacharacters require a backslash to give them their special meaning. The reason is that the oldest versions of UNIX grep did not support these. The developers of grep wanted to keep it compatible with existing regular expressions, which may use these characters as literal characters. The BRE a{1,2} matches a{1,2} literally, while a\{1,2\} matches a or aa. Some implementations support \? and \+ as an alternative syntax to \{0,1\} and \{1,\}, but \? and \+ are not part of the POSIX standard. Tokens can be grouped with \( and \). Backreferences are the usual \1 through \9. Only up to 9 groups are permitted. E.g. \(ab\)\1 matches abab, while (ab)\1 is invalid since there's no capturing group corresponding to the backreference \1. Use \\1 to match \1 literally.

POSIX BRE does not support any other features. Even alternation is not supported.

The regular expression pattern makes use of wildcard characters to represent one or more characters in the data stream. There are plenty of instances in Linux where you can specify a wildcard character to represent data you don't know about. You've already seen an example of using wildcard characters with the Linux ls command for listing files and directories

Ls implements even more limited regular expression engine in which ? is used instead of dot

Extended regular expression

Extended regular expressions are mainly used in egrep (although now the usage of grep -P is preferable), SED and AWK.

A regular expression, often called a pattern, is an expression used to specify a set of strings required for a particular purpose. A simple way to specify a finite set of strings is to list its elements or members. However, there are often more concise ways to specify the desired set of strings. For example, the set containing the three strings "Handel", "Händel", and "Haendel" can be specified by the pattern H(ä|ae?)ndel; we say that this pattern matches each of the three strings.

In most formalisms, if there exists at least one regular expression that matches a particular set then there exists an infinite number of other regular expression that also match it—the specification is not unique. Most formalisms provide the following operations to construct regular expressions.

Boolean "or"

A vertical bar separates alternatives. For example, gray|grey can match "gray" or "grey".

Grouping

Parentheses are used to define the scope and precedence of the operators (among other uses). For example, gray|grey and gr(a|e)y are equivalent patterns which both describe the set of "gray" or "grey".

Quantification

A quantifier after a token (such as a character) or group specifies how often that preceding element is allowed to occur. The most common quantifiers are the question mark ?, the asterisk * (derived from the Kleene star), and the plus sign + (Kleene plus).

? The question mark indicates zero or one occurrences of the preceding element. For example, colou?r matches both "color" and "colour".

* The asterisk indicates zero or more occurrences of the preceding element. For example, ab*c matches "ac", "abc", "abbc", "abbbc", and so on.

+ The plus sign indicates one or more occurrences of the preceding element. For example, ab+c matches "abc", "abbc", "abbbc", and so on, but not "ac".

{n} The preceding item is matched exactly n times.

{min,} The preceding item is matched min or more times.

{min,max} The preceding item is matched at least min times, but not more than max times.

These constructions can be combined to form arbitrarily complex expressions, much like one can construct arithmetical expressions from numbers and the operations +, −, ×, and ÷. For example, H(ae?|ä)ndel and H(a|ae|ä)ndel are both valid patterns which match the same strings as the earlier example, H(ä|ae?)ndel.

The precise syntax for regular expressions varies among tools and with context; more detail is given in the Syntax section.

Softpanorama May the source be with you, but remember the KISS principle ;-)	Home	Switchboard	Unix Administration	Red Hat	TCP/IP Networks	Neoliberalism	Toxic Managers
	(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and bastardization of classic Unix

News	Recommended Books	Overview of Perl Regular expressions	Recommended Links	AWK Regular Expressions	GNU grep Regular Expressions	Text processing using regex
Perl Regular Expressions	Overview of regular expressions in Perl	More Complex Perl Regular Expressions	Greedy and Non-Greedy Matches	Perl Split function	Regular Expressions Best Practices	Typical Examples of Using the Match Operator
GNU grep regular expressions	Find regular expressions	AWK Regular Expression	Vim Regular Expressions	Microsoft Frontpage regex	Algorithms	Compilers
Javascript Regular Expressions	Patterns in HP OMU	Problems	Spam filtering	Tips	Humor	Etc

POSIX regular Expressions

Character Classes

Collating Sequences

Character Equivalents

Basic regular expression

Extended regular expression

Recommended Links

Google matched content

Softpanorama Recommended

Top articles

Sites

Etc

`?`	The question mark indicates zero or one occurrences of the preceding element. For example, `colou?r` matches both "color" and "colour".
`*`	The asterisk indicates zero or more occurrences of the preceding element. For example, `ab*c` matches "ac", "abc", "abbc", "abbbc", and so on.
`+`	The plus sign indicates one or more occurrences of the preceding element. For example, `ab+c` matches "abc", "abbc", "abbbc", and so on, but not "ac".
`{n}`	The preceding item is matched exactly n times.
`{min,}`	The preceding item is matched min or more times.
`{min,max}`	The preceding item is matched at least min times, but not more than max times.