4.4 Quick Regular Expressions reference / Private Shell Documentation

<< Previous | Table Of Contents | Next >>

4.4 Quick Regular Expressions reference

This page contains basic regular expressions usage tips. Regular Expressions are often abbreviated as regex or regexp (single) or regexes (plural).

Simple word matching

The simplest regex is just a word or, more generally, a string of characters. A regex consisting of a single word matches any string that contains that word:

/World/ matches "Hello World"

We enclose a regex into two slashes (e.g. /World/) in this topic to mark it out of other text). In this example, /World/ matches the second word in "Hello World".

/world/ does not match "Hello World" since regexes are case sensitive.
/o W/ matches "Hello World" since ' ' (space) is an ordinary char.
/World / does not match "Hello World" because there is no ' ' at the end.

Regexes are always matched at the earliest possible point in the string:

/o/ matches "Hello World" in 'Hello',
/hat/ matches "That hat is red" in 'That'.

Not all characters can be used 'as is' in a match. Some characters, called metacharacters, are reserved for use in regex notation. The metacharacters are

{}[]()^$.|*+?\

A metacharacter can be matched by putting a backslash before it:

/2+2/ does not match "2+2=4" since + is a metacharacter,
/2\+2/ matches "2+2=4", \+ is treated like an ordinary +.
/C:\\WIN/ matches "C:\WIN32".

Non-printable ASCII characters are represented by escape sequences. Common examples are \t for a tab, \n for a newline, and \r for a carriage return. Arbitrary bytes are represented by octal escape sequences, e.g., \033, or hexadecimal escape sequences, e.g., \x1B:

/0\t2/ matches "1000\t2000",
/\143\x61\x74/ matches "cat", but a weird way to spell cat.

With all of the regexes above, if the regex matched anywhere in the string, it is considered a match. To specify where it should match, we need the anchor metacharacters ^ and $. The anchor ^ means match at the beginning of the string and the anchor $ means match at the end of the string, or before a newline at the end of the string. Some examples:

/keeper/ matches "housekeeper"
/^keeper/ does not match "housekeeper"
/keeper$/ matches "housekeeper"
/keeper$/ matches "housekeeper\n"
/^housekeeper$/ matches "housekeeper"

Using character classes

A character class allows a set of possible characters, rather than just a single character, to match at a particular point in a regex. Character classes are denoted by brackets [...], with the set of characters to be possibly matched inside. Here are some examples:

/cat/ matches "cat"
/[bcr]at/ matches 'bat', 'cat', or 'rat'
/[cab]/ matches 'a' in "abc"

In the last statement, even though 'c' is the first character in the class, the earliest point at which the regex can match is 'a'.

/[yY][eE][sS]/ matches 'yes' in a case-insensitive way - 'yes', 'Yes', 'YES', etc.
/(?i)yes/ also match 'yes' in a case-insensitive way

The last example shows a match with an (?i) modifier, which makes the match case-insensitive.

Character classes also have ordinary and special characters, but the sets of ordinary and special characters inside a character class are different than those outside a character class. The special characters for a character class are -]\^ and are matched using an escape:

/[\]c]def/ matches ']def' or 'cdef'

The special character '-' acts as a range operator within character classes, so that the unwieldy [0123456789] and [abc...xyz] become the svelte [0-9] and [a-z]:

/item[0-9]/ matches 'item0' or ... or 'item9'
/[0-9a-fA-F]/ matches a hexadecimal digit

If '-' is the first or last character in a character class, it is treated as an ordinary character.

The special character ^ in the first position of a character class denotes a negated character class, which matches any character but those in the brackets. Both [...] and [^...] must match a character, or the match fails. Then

/[^a]at/ doesn't match 'aat' or 'at', but matches all other 'bat', 'cat, '0at', '%at', etc.
/[^0-9]/ matches a non-numeric character
/[a^]at/ matches 'aat' or '^at'; here '^' is ordinary

There are several abbreviations for common character classes:

\d is a digit and represents [0-9],
\s is a whitespace character and represents [\ \t\r\n\f],
\w is a word character (alphanumeric or _) and represents [0-9a-zA-Z_],
\D is a negated \d; it represents any character but a digit [^0-9],
\S is a negated \s; it represents any non-whitespace character [^\s],
\W is a negated \w; it represents any non-word character [^\w],
The period '.' matches any character but "\n".

The \d\s\w\D\S\W abbreviations can be used both inside and outside of character classes. Here are some in use:

/\d\d:\d\d:\d\d/ matches a hh:mm:ss time format
/[\d\s]/ matches any digit or whitespace character
/\w\W\w/ matches a word char, followed by a non-word char, followed by a word char
/..rt/ matches any two chars, followed by 'rt'
/end\./ matches 'end.'
/end[.]/ is the same thing, matches 'end.'

The word anchor \b matches a boundary between a word character and a non-word character \w\W or \W\w:

/\bcat/ matches cat in 'catenates' of "Housecat catenates house and cat";
/cat\b/ matches cat in 'housecat'
/\bcat\b/ matches 'cat' at end of string

In the last example, the end of the string is considered a word boundary.

Matching this or that

We can match different character strings with the alternation metacharacter '|'. To match dog or cat, we form the regex /dog|cat/. As before, the regex is tried to match at the earliest possible point in the string. At each character position, the first alternative (dog) will be tried to match first. If dog doesn't match, the next alternative (cat) will be tried. If cat doesn't match either, then the match fails and we move to the next position in the string. Some examples:

/cat|dog|bird/ matches "cat" in "cats and dogs"
/dog|cat|bird/ matches "cat" in "cats and dogs"

Even though dog is the first alternative in the second regex, cat is able to match earlier in the string.

/c|ca|cat|cats/ matches "c" in "cats"
/cats|cat|ca|c/ matches "cats" in "cats"

At a given character position, the first alternative that allows the regex match to succeed will be the one that matches. Here, all the alternatives match at the first string position, so the first matches.

Grouping things and hierarchical matching

The grouping metacharacters () allow a part of a regex to be treated as a single unit. Parts of a regex are grouped by enclosing them in parentheses. The regex house(cat|keeper) means match house followed by either cat or keeper. Some more examples are:

/(a|b)b/ matches 'ab' or 'bb'
/(^a|b)c/ matches 'ac' at start of string or 'bc' anywhere

/house(cat|)/ matches either 'housecat' or 'house'
/house(cat(s|)|)/ matches either 'housecats' or 'housecat' or 'house'. Note groups can be nested.

/(19|20|)\d\d/ matches the null alternative '()\d\d' in "20" because '20\d\d' can't match.

Extracting matches

The grouping metacharacters () also allow the extraction of the parts of a string that matched. For each grouping, the part that matched inside goes into the special variables $1, $2, etc. They can be used just as ordinary variables in other Private Shell actions.

# Extract hours, minutes, seconds:
/(\d\d):(\d\d):(\d\d)/, match hh:mm:ss format; hours go into $1, minutes into $2 and seconds into $3.

If the groupings in a regex are nested, $1 gets the group with the leftmost opening parenthesis, $2 the next opening parenthesis, etc. For example, here is a complex regex and the matching variables indicated below it:

/(ab(cd|ef)((gi)|j))/
1 2 34

Associated with the matching variables $1, $2, ... are the backreferences \1, \2, ... Backreferences are matching variables that can be used inside a regex:

/(\w\w\w)\s\1/ finds sequences like 'the the' in string

$1, $2, ... should only be used outside of a regex, and \1, \2, ... only inside a regex.

Matching repetitions

The quantifier metacharacters ?, *, +, and {} allow us to determine the number of repeats of a portion of a regex we consider to be a match. Quantifiers are put immediately after the character, character class, or grouping that we want to specify. They have the following meanings:

a? = match 'a' 1 or 0 times
a* = match 'a' 0 or more times, i.e., any number of times
a+ = match 'a' 1 or more times, i.e., at least once
a{n,m} = match at least n times, but not more than m times.
a{n,} = match at least n or more times
a{n} = match exactly n times

Here are some examples:

/[a-z]+\s+\d*/ matches a lowercase word, at least some space, and any number of digits
/(\w+)\s+\1/ matches doubled words of arbitrary length
/\d{2,4}/ matches at least 2 but not more than 4 digits, can be used to check the year in dates
/\d{4}|\d{2}/ better check; throw out 3 digit dates

These quantifiers will try to match as much of the string as possible, while still allowing the regex to match. So we have

/^(.*)(at)(.*)$/ matches 'the cat in the hat' and $1 = 'the cat in the h', $2 = 'at' and $3 = '' (0 matches)

The first quantifier .* grabs as much of the string as possible while still having the regex match. The second quantifier .* has no string left to it, so it matches 0 times.

<< Previous | Table Of Contents | Next >>