|
4.4 Quick Regular Expressions referenceThis page contains basic regular expressions usage tips. Regular Expressions are often abbreviated as regex or regexp (single) or regexes (plural). Simple word matching
The simplest regex is just a word or, more generally, a string of characters. A regex consisting of a single word matches any string that contains that word: /World/ matches "Hello World" We enclose a regex into two slashes (e.g. /World/) in this topic to mark it out of other text). In this example, /World/ matches the second word in "Hello World". /world/ does not match
"Hello World" since regexes are case sensitive.
Regexes are always matched at the earliest possible point in the string: /o/ matches
"Hello World" in 'Hello',
Not all characters can be used 'as is' in a match. Some characters, called metacharacters, are reserved for use in regex notation. The metacharacters are {}[]()^$.|*+?\ A metacharacter can be matched by putting a backslash before it: /2+2/ does not match "2+2=4"
since + is a metacharacter,
Non-printable ASCII characters are represented by escape sequences. Common examples are \t for a tab, \n for a newline, and \r for a carriage return. Arbitrary bytes are represented by octal escape sequences, e.g., \033, or hexadecimal escape sequences, e.g., \x1B: /0\t2/ matches "1000\t2000",
With all of the regexes above, if the regex matched anywhere in the string, it is considered a match. To specify where it should match, we need the anchor metacharacters ^ and $. The anchor ^ means match at the beginning of the string and the anchor $ means match at the end of the string, or before a newline at the end of the string. Some examples: /keeper/ matches "housekeeper"
Using character classes
A character class allows a set of possible characters, rather than just a single character, to match at a particular point in a regex. Character classes are denoted by brackets [...], with the set of characters to be possibly matched inside. Here are some examples: /cat/ matches "cat"
In the last statement, even though 'c' is the first character in the class, the earliest point at which the regex can match is 'a'. /[yY][eE][sS]/ matches 'yes'
in a case-insensitive way - 'yes', 'Yes', 'YES', etc.
The last example shows a match with an (?i) modifier, which makes the match case-insensitive. Character classes also have ordinary and special characters, but the sets of ordinary and special characters inside a character class are different than those outside a character class. The special characters for a character class are -]\^ and are matched using an escape: /[\]c]def/ matches ']def' or 'cdef' The special character '-' acts as a range operator within character classes, so that the unwieldy [0123456789] and [abc...xyz] become the svelte [0-9] and [a-z]: /item[0-9]/ matches 'item0' or
... or 'item9'
If '-' is the first or last character in a character class, it is treated as an ordinary character. The special character ^ in the first position of a character class denotes a negated character class, which matches any character but those in the brackets. Both [...] and [^...] must match a character, or the match fails. Then /[^a]at/ doesn't match 'aat' or 'at',
but matches all other 'bat', 'cat, '0at', '%at', etc.
There are several abbreviations for common character classes: \d is a digit and represents [0-9],
The \d\s\w\D\S\W abbreviations can be used both inside and outside of character classes. Here are some in use: /\d\d:\d\d:\d\d/ matches a
hh:mm:ss time format
The word anchor \b matches a boundary between a word character and a non-word character \w\W or \W\w: /\bcat/ matches cat in 'catenates'
of "Housecat catenates house and cat";
In the last example, the end of the string is considered a word boundary. Matching this or that
We can match different character strings with the alternation metacharacter '|'. To match dog or cat, we form the regex /dog|cat/. As before, the regex is tried to match at the earliest possible point in the string. At each character position, the first alternative (dog) will be tried to match first. If dog doesn't match, the next alternative (cat) will be tried. If cat doesn't match either, then the match fails and we move to the next position in the string. Some examples: /cat|dog|bird/ matches "cat"
in "cats and dogs"
Even though dog is the first alternative in the second regex, cat is able to match earlier in the string. /c|ca|cat|cats/ matches "c"
in "cats"
At a given character position, the first alternative that allows the regex match to succeed will be the one that matches. Here, all the alternatives match at the first string position, so the first matches. Grouping things and hierarchical matching
The grouping metacharacters () allow a part of a regex to be treated as a single unit. Parts of a regex are grouped by enclosing them in parentheses. The regex house(cat|keeper) means match house followed by either cat or keeper. Some more examples are: /(a|b)b/ matches 'ab' or
'bb'
/house(cat|)/ matches either
'housecat' or 'house'
/(19|20|)\d\d/ matches the null alternative '()\d\d' in "20" because '20\d\d' can't match. Extracting matches
The grouping metacharacters () also allow the extraction of the parts of a string that matched. For each grouping, the part that matched inside goes into the special variables $1, $2, etc. They can be used just as ordinary variables in other Private Shell actions. # Extract hours, minutes, seconds:
If the groupings in a regex are nested, $1 gets the group with the leftmost opening parenthesis, $2 the next opening parenthesis, etc. For example, here is a complex regex and the matching variables indicated below it: /(ab(cd|ef)((gi)|j))/
Associated with the matching variables $1, $2, ... are the backreferences \1, \2, ... Backreferences are matching variables that can be used inside a regex: /(\w\w\w)\s\1/ finds sequences like 'the the' in string $1, $2, ... should only be used outside of a regex, and \1, \2, ... only inside a regex. Matching repetitions
The quantifier metacharacters ?, *, +, and {} allow us to determine the number of repeats of a portion of a regex we consider to be a match. Quantifiers are put immediately after the character, character class, or grouping that we want to specify. They have the following meanings: a? = match 'a' 1 or 0 times
Here are some examples: /[a-z]+\s+\d*/ matches a lowercase word, at least
some space, and any number of digits
These quantifiers will try to match as much of the string as possible, while still allowing the regex to match. So we have /^(.*)(at)(.*)$/ matches 'the cat in the hat' and $1 = 'the cat in the h', $2 = 'at' and $3 = '' (0 matches) The first quantifier .* grabs as much of the string as possible while still having the regex match. The second quantifier .* has no string left to it, so it matches 0 times.
|