Regular expressions
Text-Matching Patterns
Regular expressions are the Linux term for text patterns that feature
"wild-card" characters. Wild-card characters are used in a number
of text manipulation utilities such as ed, vi, grep,
sed, tr, expr and awk.
It is important to realize that wild-card characters used in
text-oriented utilities
are not exactly the same as the shell's wild-card characters.
This is because text manipulation utilities require more powerful
wild-card expressions than the shell, so they are implemented a
little differently.
Within this tutorial, we will use the ed
text editor to
illustrate the pattern-matching expressions in most of our examples.
We will then conclude with a cross-reference of ed wild-cards
to shell wild-cards since new users are often confused between the two
similar but different sets.
The Dot (.): Any Character
In text utilities such as ed and grep, the single dot is
used to represent any single character. It is similar in behaviour
to the question mark in the shell. For instance, the expression
"r.ot" will match "root" and "riot", but not "rot" or "robot".
The dot is a placeholder for one and only one character, regardless
of what this character is. For instance, in ed, the command:
g/a.m/p
would have retrieved and displayed all lines containing words like
"admin" and "chasm", for example.
Note that the ed
syntax "g/re/p" stands for "globally, wherever you
find the regular expression re, print it". You may also recognize
the origin of grep's name!
The Asterisk (*): Zero or More
The asterisk represents "zero or more occurrences of the previous
character".
This is a very subtle but very important distinction
from its meaning in the shell, where the asterisk represents "zero or more
occurrences of any character".
For instance, in ed and all other text-manipulation tools
(grep, sed, vi, etc.),
the expression "ro*t" will match "rt", "rot", "root",
or even "rooot", but will not match "robot"
as it would in the shell.
To obtain the shell's equivalent of the asterisk, you must specify
".*" instead.
Indeed, since the dot stands for any character, the expression
".*" stands for "zero or more occurrences of any character".
As a result, the expression "ro.*t" in text-oriented utilities
will match "rot", "root", "rooot", and "robot".
The character preceding the asterisk loses its individual presence
and becomes part of the character-asterisk pair, which is treated
as a single expression. For instance, the expression "rob*ot" will
match "root", even if the word "root" does not contain the letter
'b'. This is because the pair 'b*' is seen as a single expression
meaning "zero or more occurrences of the letter 'b'", which means
that zero occurrences of 'b' will match the expression.
These subtle points are very important if you want to be able to use
pattern-matching expressions effectively.
If you wanted to specify
"one or more occurrences of 'b'" (as opposed to "zero or more"),
you would have to specify a "b" before the "b*" expression, which
would guarantee the presence of one "b" followed by zero or more b's.
For example, the regular expression "robb*ot" would match
the word "robot" but not the word "root."
Similarly, the expression "Diann*e" would match both "Diane" or "Dianne"
while the expression "Anne*" would match both
"Ann" and "Anne" because "e*" means zero or more e's.
As you can see, the asterisk has a much different meaning in
regular expressions than it has for the shell, and this can cause
much aggravation to shell programmers who are not aware of this
difference.
Square Brackets: Lists and Ranges
The square brackets are used to specify a range of legal values
for a single character position. It is like a restricted "dot".
For instance, the expression "r[oi]ot" matches "root" and "riot",
because the second character may be either the letter 'o' or the
letter 'i'.
There can be any number of characters between the brackets, but
each bracketed expression is a placeholder for only one character at
a time. For instance, to the expression "r[ioe][ons]t" will match
"root", "riot", "rest", and "rent".
A powerful feature of the brackets is the ability to specify ranges
in the ASCII character sequence using the dash. For instance, the
expression "[a-e]" is identical to "[abcde]", and the expression
"[0-6]" is the same as "[0123456]".
Ranges are often used to specify
upper-case or lower-case characters. For instance, the expression
"[A-Z]ohn" will match "John", but not "john".
Another feature of the square brackets is
the ability to exclude a given set
of characters (or a range) by specifying the caret (^) at the beginning
of the list.
For instance the expression "[^s]" matches any single
character except 's'. You can read this expression as "not s".
Consequently, the expression "re[^s]t" will
match "rent" but not "rest".
Similarly, the expression "[^ioe]" can be read as "not i, or or e"
and will match any single character except those three.
The caret may also be used at the beginning of a range to
exclude all members of that range. For example, the expression "[^A-Z]"
matches any single character except upper-case letters.
NOTE:
Unlike text manipulation tools such as grep and ed, the
shell uses the exclamation mark as a negation operator instead of
the caret. For instance, to list all files not beginning with the
letter 'f', the shell would use:
$ ls [!f]*
Text utilities, on the other hand, would use [^f].* to represent
the same thing (note the use of ".*" instead of "*").
Anoter important note is that
the meaning of the caret inside square brackets is totally different
from its meaning in regular expressions, which we
will examine in the next section.
The Caret (^): Beginning of Line
The caret, when used as part of a regular expression,
stands for beginning of line.
The caret does not occupy any character position, but rather forces the
expression to be matched only at the beginning of a line in
text searches.
For instance, the following ed instruction will only replace
"Mary" for "John" when "Mary" is at the beginning of a line:
1,$s/^Mary/John
Similarly, the following ed command will indent the block
of text from line 15 to line 25 by replacing the
"beginning of line" (the caret) with a few spaces:
15,25s/^/ /
This next instruction will remove leading spaces from all lines:
1,$s/^ *//
Note that we put two spaces in our search pattern between
the caret and the asterisk to
indicate we are looking for one space followed by zero or more
spaces.
The Dollar Sign ($): End of Line
The meaning of the dollar sign in regular expressions
is the exact opposite of the caret's: it stands
for "end of line." Like the caret, the dollar sign does not
occupy a character position, but rather forces the expression to match
the end of the line.
For example, this ed instruction replaces "John" for "Mary"
only when "John" is the last word on the line:
1,$s/John$/Mary/
Remember that in the range "1,$", the dollar sign means the last
line in the file, while in the regular expression "John$", the
dollar sign means the end of the line.
This next instruction removes all trailing blanks at the end
of lines in the entire file:
1,$s/ *$//
Again, please note that we used two spaces in the
regular expression to indicate "one or more spaces" at the end
of the line.
Here is an example of how you could use the caret and the dollar
sign together to delete all empty lines in a file, i.e. lines what
have nothing between the start of the line (indicated by the caret)
and the end of the line (indicated by the dollar sign):
g/^$/d
The expression above could be read as follows: "Globally, wherever
the beginning of the line is immediately followed by the end of the
line, delete the line."
Note that if there are lines in the file with no visible
text but that may contain one or more spaces, these lines would
remain intact since the command above would only delete the lines
that are truly empty.
How would you tell
ed to delete these lines as well?
Simply change the regular expression to indicate "beginning of
line, followed by zero or more spaces, followed by the end of the
line," like this:
g/^ /*$/d
Note that we are specifying the pattern " *" featuring a
single space, which means "zero or more spaces." So, if a line is
truly blank (zero space), it will be matched; similarly, if a line
has one space or multiple spaces, it will also be matched. The
result is that all lines containing no visible text will be deleted
(unless some lines also contain tab characters or other invisible
characters, which we will leave as an exercise for you, dear Reader).
Cross-Reference of Shell vs Text Wild-Card Characters
The Linux shell (i.e. the command prompt) uses a set of wild-card
characters for filename expansion which is somewhat similar but
not identical to regular expressions used by
text-manipulating utilities that we have just covered in this
tutorial.
Since it's easy for inexperienced users to confuse the two,
the following chart will hopefully be helpful in clarifying the
differences:
Shell |
Text Utilities |
Meaning |
? |
. |
Any single character |
* |
.* |
Zero or more of any characters |
[abc] |
[abc] |
Any single character in this list |
[a-z] |
[a-z] |
Any single character in this range |
[!abc] |
[^abc] |
Any single character not in this list |
Extended Expressions
The regular expressions we have covered so far are supported by
most text-processing utilities such as ed and grep.
However, additional and more powerful constructs are supported by
more advanced utilities such as the ex editor, vi and
egrep ("extended grep").
These are called "extended regular expressions";
we will examine some of the most useful ones here.
Matching Whole Words
A common requirement is to search for a string that constitutes a
whole word and not just part of one. For instance, you may need to
replace all instances of the word "ask" with "request" without
affecting words like "basket" or "task."
Two special constructs exist to give you control over this type of
search. The pair of characters "\<" can be used to specify the
beginning of a word, while "\>" indicates the end of a word.
As a result, searching for the regular expression "\<ask\>"
will retrieve all instances of the word "ask" when used as a whole
word without picking up the string "ask" when part of a larger
word.
IMPORTANT: These expressions are considered advanced constructs
and are not supported by grep,
although they are supported by
egrep (which stands for "extended grep").
Note that it is not necessary to use both "\<" and "\>" in
a pattern if you only want to specify either the beginning or the
end of a word in your search. For instance, if you want to
search for all instances of the name "John" in a file without
picking up instances of "Johnny," you would use the regular
expression "John\>" as in this example:
$ cat testfile
My friend John,
also known as Johnny,
is a jolly good fellow.
$ grep "John" testfile
My friend John,
also known as Johnny,
$ egrep "John\>" testfile
My friend John,
$ _
Note that this regular expression "understands" punctuation and is
able to determine that "John," (with the comma) constitutes a
whole word.
Similarly, the expression "\<paper" would match "paper" or
"papers" but not "newspaper."
Capturing Segments of Text
The round parentheses, when preceded with a backslash
(as in "\(" and "\)"), allow you to
bracket any arbitrary segment of text and refer to it in a
substitution.
For instance, let's assume we have a file containing the following text:
Contest Results
Dianne: 1st Place
Tyler: 2nd Place
Ryan: 3rd Place
Denise: Honorable Mention
Jack: Honorable Mention
Let's further assume that you need to edit this list so that the
contestant's position
is listed first, followed by the contestant's name, as in:
Contest Results
1st Place: Dianne
2nd Place: Tyler
3rd Place: Ryan
Honorable Mention: Denise
Honorable Mention: Jack
Using the constructs we are introducing in this section, you could do
this editing job with a single command in vi!
Here's how.
The simplest way to explain how to use the round parentheses is to show you
the command we would have used to make the editing job we just mentioned,
and then to break it down one step at a time. So, assuming you were in
vi, you would have pressed the colon to access the command line
and you would have entered the following command:
1,$s/\(.*\): \(.*\)/\2: \1/
Looks intimidating, doesn't it? Fortunately, once we break it down,
it's quite simple.
The start of the command ("1,$s") means "From line 1 to the
last line, do the following substitution." This is basic ed
syntax which you are probably familiar with by now (if not, read
our ed tutorial
on this site).
The rest of the line is what may look a little puzzling at the moment.
Keep in mind that it boils down to this familiar substitution
syntax:
s/something/something else/
The difference here is that we have surrounded specific portions of
our wild-card patterns with round parentheses preceded with a slash
to capture and store those specific line fragments for later use.
The first line segment we capture is "\(.*\):" which
means all the characters up to the colon.
Note that we close the parentheses before the colon since
we do not want to capture that character as part of our segment.
For example, in the line "Denise: 1st Place," we
capture the string "Denise" (without the colon) between parentheses.
The second set of round parentheses captures all the characters that
come after a colon and a space, which on this line be the
string "1st Place."
In the second part of the substitution, we refer to these line
fragments using positional parameters. Specifically, the expression
"\1" refers to the first fragment, "\2" refers to the second, and so on.
In our example, we only used two sets of parentheses, so we have two
fragments we can use. We are telling vi to replace the pattern
found in the first part of the substitution with "\2" (the second
fragment), followed by a colon and a space, followed by
"\1" which represents the first fragment.
As a result, vi will replace "Dianne: 1st Place"
with "\2: \1", meaning the string "1st Place" (represented by "\2"),
followed by a colon and a space, followed by the string "Denise"
(represented by "\1").
Of course, since we ran this command on all the lines (as indicated
by the "1,$" prefix), all the lines matching the pattern will have
been converted by this single command.
You will notice that nothing happened to the very first line
containing the text "Contest Results" because that line did not match
the pattern we were looking for in our substitution command (it did
not contain a colon followed by a space), so the substitution was not
performed on that line.
Changing Case
Another use of the round parentheses is to change the case of
any portion of text we capture using a wild-card pattern.
This can be particularly useful when we want to capitalize certain
words or turn entire phrases to all-caps.
To use this feature, substitute a line fragment captured using the
round parentheses as described above with the same fragment preceded with
one of the following modifiers:
\u |
First-Letter Upper-Case |
\U |
Switch to All-Caps |
\l |
First-Letter Lower-Case |
\L |
Switch to All Lower-Case |
For instance, if we wanted to capitalize the title line ("Contest Results")
in our previous example, we would position the cursor on that line in
vi, then press the colon to enter command mode, and we would type
in the following instruction:
s/\(.*\)/\U\1/
In this substitution command, we are asking vi to
replace all characters on the line (.*) with the same thing (\1),
only all upper-case (\U).
Similarly, we can use a similar technique to
convert the string "Denise" to all-caps in
this sample text:
1st Place: Denise
In vi, we would issue the following instruction to
accomplish this task:
s/\(.*\): \(.*\)/\1: \U\2/
...which would result in this:
1st Place: DENISE
This example is almost identical to what we did earlier to reverse
the column order in our list, except this time we are
keeping the order intact (we specify "\1" first, followed by
a colon, a space, and "\2") but we precede the positional parameter
"\2" with the "\U" modifier to convert it to all-caps.
Naturally, this is a lot of effort for a simple text modification
on a single line, but
the power of these substitutions is really appreciated when you have
to perform a massive search-and-replace operation on a large file containing
hundreds of lines that would otherwise have to be modified manually.