I.T. Consulting
Tutorials
Sysadmin
How RAID Works What's a RAID? Hardware vs Software RAID Striping and Mirroring RAID 2 and 3 RAID 4 and 5 Conclusion
Software RAID on Linux RAID: Quick Recap Software Tools Creating & Using an Array Monitoring an Array Removing & Re-Assembling an Array The mdadm.conf File Deleting an Array Summary & Cheat-Sheet
Network Security
Squid Proxy Server Basic Configuration Controlling Traffic Blocking Access Monitoring Traffic
SSH: Secure Shell Overview Using SSH Encryption Authentication Keys Configuring SSH Advanced Tricks
Implementing HTTPS What Is HTTPS? Setting Up The Server
Linux Skills
The ed Line Editor First Things First Navigating Entering Text Changing Text Line Maneuvers Text Searches Using ed in Real Life Summary
Regular Expressions Text Patterns Extended Expressions
The vi Editor Introduction Operating Modes Navigation Editing Summary
Intermediate vi Power Editing Cut-and-Paste Modifying Text Searches Tips & Tricks The vi Prompt Indenting
Miscellaneous
Creating an eBook Introduction Create an ePub Create a MOBI Create a PDF

Regular expressions

Text-Matching Patterns

Regular expressions are the Linux term for text patterns that feature "wild-card" characters. Wild-card characters are used in a number of text manipulation utilities such as ed, vi, grep, sed, tr, expr and awk.

It is important to realize that wild-card characters used in text-oriented utilities are not exactly the same as the shell's wild-card characters. This is because text manipulation utilities require more powerful wild-card expressions than the shell, so they are implemented a little differently.

Within this tutorial, we will use the ed text editor to illustrate the pattern-matching expressions in most of our examples. We will then conclude with a cross-reference of ed wild-cards to shell wild-cards since new users are often confused between the two similar but different sets.

The Dot (.): Any Character

In text utilities such as ed and grep, the single dot is used to represent any single character. It is similar in behaviour to the question mark in the shell. For instance, the expression "r.ot" will match "root" and "riot", but not "rot" or "robot".

The dot is a placeholder for one and only one character, regardless of what this character is. For instance, in ed, the command:

g/a.m/p

would have retrieved and displayed all lines containing words like "admin" and "chasm", for example.

Note that the ed syntax "g/re/p" stands for "globally, wherever you find the regular expression re, print it". You may also recognize the origin of grep's name!

The Asterisk (*): Zero or More

The asterisk represents "zero or more occurrences of the previous character".

This is a very subtle but very important distinction from its meaning in the shell, where the asterisk represents "zero or more occurrences of any character".

For instance, in ed and all other text-manipulation tools (grep, sed, vi, etc.), the expression "ro*t" will match "rt", "rot", "root", or even "rooot", but will not match "robot" as it would in the shell.

To obtain the shell's equivalent of the asterisk, you must specify ".*" instead. Indeed, since the dot stands for any character, the expression ".*" stands for "zero or more occurrences of any character". As a result, the expression "ro.*t" in text-oriented utilities will match "rot", "root", "rooot", and "robot".

The character preceding the asterisk loses its individual presence and becomes part of the character-asterisk pair, which is treated as a single expression. For instance, the expression "rob*ot" will match "root", even if the word "root" does not contain the letter 'b'. This is because the pair 'b*' is seen as a single expression meaning "zero or more occurrences of the letter 'b'", which means that zero occurrences of 'b' will match the expression.

These subtle points are very important if you want to be able to use pattern-matching expressions effectively.

If you wanted to specify "one or more occurrences of 'b'" (as opposed to "zero or more"), you would have to specify a "b" before the "b*" expression, which would guarantee the presence of one "b" followed by zero or more b's. For example, the regular expression "robb*ot" would match the word "robot" but not the word "root."

Similarly, the expression "Diann*e" would match both "Diane" or "Dianne" while the expression "Anne*" would match both "Ann" and "Anne" because "e*" means zero or more e's.

As you can see, the asterisk has a much different meaning in regular expressions than it has for the shell, and this can cause much aggravation to shell programmers who are not aware of this difference.

Square Brackets: Lists and Ranges

The square brackets are used to specify a range of legal values for a single character position. It is like a restricted "dot". For instance, the expression "r[oi]ot" matches "root" and "riot", because the second character may be either the letter 'o' or the letter 'i'.

There can be any number of characters between the brackets, but each bracketed expression is a placeholder for only one character at a time. For instance, to the expression "r[ioe][ons]t" will match "root", "riot", "rest", and "rent".

A powerful feature of the brackets is the ability to specify ranges in the ASCII character sequence using the dash. For instance, the expression "[a-e]" is identical to "[abcde]", and the expression "[0-6]" is the same as "[0123456]".

Ranges are often used to specify upper-case or lower-case characters. For instance, the expression "[A-Z]ohn" will match "John", but not "john".

Another feature of the square brackets is the ability to exclude a given set of characters (or a range) by specifying the caret (^) at the beginning of the list.

For instance the expression "[^s]" matches any single character except 's'. You can read this expression as "not s". Consequently, the expression "re[^s]t" will match "rent" but not "rest".

Similarly, the expression "[^ioe]" can be read as "not i, or or e" and will match any single character except those three.

The caret may also be used at the beginning of a range to exclude all members of that range. For example, the expression "[^A-Z]" matches any single character except upper-case letters.

NOTE:

Unlike text manipulation tools such as grep and ed, the shell uses the exclamation mark as a negation operator instead of the caret. For instance, to list all files not beginning with the letter 'f', the shell would use:

$ ls [!f]*

Text utilities, on the other hand, would use [^f].* to represent the same thing (note the use of ".*" instead of "*").

Anoter important note is that the meaning of the caret inside square brackets is totally different from its meaning in regular expressions, which we will examine in the next section.

The Caret (^): Beginning of Line

The caret, when used as part of a regular expression, stands for beginning of line. The caret does not occupy any character position, but rather forces the expression to be matched only at the beginning of a line in text searches.

For instance, the following ed instruction will only replace "Mary" for "John" when "Mary" is at the beginning of a line:

1,$s/^Mary/John

Similarly, the following ed command will indent the block of text from line 15 to line 25 by replacing the "beginning of line" (the caret) with a few spaces:

15,25s/^/     /

This next instruction will remove leading spaces from all lines:

1,$s/^  *//

Note that we put two spaces in our search pattern between the caret and the asterisk to indicate we are looking for one space followed by zero or more spaces.

The Dollar Sign ($): End of Line

The meaning of the dollar sign in regular expressions is the exact opposite of the caret's: it stands for "end of line." Like the caret, the dollar sign does not occupy a character position, but rather forces the expression to match the end of the line.

For example, this ed instruction replaces "John" for "Mary" only when "John" is the last word on the line:

1,$s/John$/Mary/

Remember that in the range "1,$", the dollar sign means the last line in the file, while in the regular expression "John$", the dollar sign means the end of the line.

This next instruction removes all trailing blanks at the end of lines in the entire file:

1,$s/  *$//

Again, please note that we used two spaces in the regular expression to indicate "one or more spaces" at the end of the line.

Here is an example of how you could use the caret and the dollar sign together to delete all empty lines in a file, i.e. lines what have nothing between the start of the line (indicated by the caret) and the end of the line (indicated by the dollar sign):

g/^$/d

The expression above could be read as follows: "Globally, wherever the beginning of the line is immediately followed by the end of the line, delete the line."

Note that if there are lines in the file with no visible text but that may contain one or more spaces, these lines would remain intact since the command above would only delete the lines that are truly empty.

How would you tell ed to delete these lines as well? Simply change the regular expression to indicate "beginning of line, followed by zero or more spaces, followed by the end of the line," like this:

g/^ /*$/d

Note that we are specifying the pattern " *" featuring a single space, which means "zero or more spaces." So, if a line is truly blank (zero space), it will be matched; similarly, if a line has one space or multiple spaces, it will also be matched. The result is that all lines containing no visible text will be deleted (unless some lines also contain tab characters or other invisible characters, which we will leave as an exercise for you, dear Reader).

Cross-Reference of Shell vs Text Wild-Card Characters

The Linux shell (i.e. the command prompt) uses a set of wild-card characters for filename expansion which is somewhat similar but not identical to regular expressions used by text-manipulating utilities that we have just covered in this tutorial.

Since it's easy for inexperienced users to confuse the two, the following chart will hopefully be helpful in clarifying the differences:

Shell Text Utilities Meaning
? . Any single character
* .* Zero or more of any characters
[abc] [abc] Any single character in this list
[a-z] [a-z] Any single character in this range
[!abc] [^abc] Any single character not in this list

 

 

Extended Expressions

The regular expressions we have covered so far are supported by most text-processing utilities such as ed and grep. However, additional and more powerful constructs are supported by more advanced utilities such as the ex editor, vi and egrep ("extended grep").

These are called "extended regular expressions"; we will examine some of the most useful ones here.

Matching Whole Words

A common requirement is to search for a string that constitutes a whole word and not just part of one. For instance, you may need to replace all instances of the word "ask" with "request" without affecting words like "basket" or "task."

Two special constructs exist to give you control over this type of search. The pair of characters "\<" can be used to specify the beginning of a word, while "\>" indicates the end of a word. As a result, searching for the regular expression "\<ask\>" will retrieve all instances of the word "ask" when used as a whole word without picking up the string "ask" when part of a larger word.

IMPORTANT: These expressions are considered advanced constructs and are not supported by grep, although they are supported by egrep (which stands for "extended grep").

Note that it is not necessary to use both "\<" and "\>" in a pattern if you only want to specify either the beginning or the end of a word in your search. For instance, if you want to search for all instances of the name "John" in a file without picking up instances of "Johnny," you would use the regular expression "John\>" as in this example:

$ cat testfile
My friend John,
also known as Johnny,
is a jolly good fellow.
$ grep "John" testfile
My friend John,
also known as Johnny,
$ egrep "John\>" testfile
My friend John,
$ _

Note that this regular expression "understands" punctuation and is able to determine that "John," (with the comma) constitutes a whole word.

Similarly, the expression "\<paper" would match "paper" or "papers" but not "newspaper."

Capturing Segments of Text

The round parentheses, when preceded with a backslash (as in "\(" and "\)"), allow you to bracket any arbitrary segment of text and refer to it in a substitution.

For instance, let's assume we have a file containing the following text:

Contest Results
Dianne: 1st Place
Tyler: 2nd Place
Ryan: 3rd Place
Denise: Honorable Mention
Jack: Honorable Mention

Let's further assume that you need to edit this list so that the contestant's position is listed first, followed by the contestant's name, as in:

Contest Results
1st Place: Dianne
2nd Place: Tyler
3rd Place: Ryan
Honorable Mention: Denise
Honorable Mention: Jack

Using the constructs we are introducing in this section, you could do this editing job with a single command in vi! Here's how.

The simplest way to explain how to use the round parentheses is to show you the command we would have used to make the editing job we just mentioned, and then to break it down one step at a time. So, assuming you were in vi, you would have pressed the colon to access the command line and you would have entered the following command:

1,$s/\(.*\): \(.*\)/\2: \1/

Looks intimidating, doesn't it? Fortunately, once we break it down, it's quite simple.

The start of the command ("1,$s") means "From line 1 to the last line, do the following substitution." This is basic ed syntax which you are probably familiar with by now (if not, read our ed tutorial on this site).

The rest of the line is what may look a little puzzling at the moment. Keep in mind that it boils down to this familiar substitution syntax:

s/something/something else/

The difference here is that we have surrounded specific portions of our wild-card patterns with round parentheses preceded with a slash to capture and store those specific line fragments for later use.

The first line segment we capture is "\(.*\):" which means all the characters up to the colon. Note that we close the parentheses before the colon since we do not want to capture that character as part of our segment. For example, in the line "Denise: 1st Place," we capture the string "Denise" (without the colon) between parentheses.

The second set of round parentheses captures all the characters that come after a colon and a space, which on this line be the string "1st Place."

In the second part of the substitution, we refer to these line fragments using positional parameters. Specifically, the expression "\1" refers to the first fragment, "\2" refers to the second, and so on. In our example, we only used two sets of parentheses, so we have two fragments we can use. We are telling vi to replace the pattern found in the first part of the substitution with "\2" (the second fragment), followed by a colon and a space, followed by "\1" which represents the first fragment.

As a result, vi will replace "Dianne: 1st Place" with "\2: \1", meaning the string "1st Place" (represented by "\2"), followed by a colon and a space, followed by the string "Denise" (represented by "\1").

Of course, since we ran this command on all the lines (as indicated by the "1,$" prefix), all the lines matching the pattern will have been converted by this single command.

You will notice that nothing happened to the very first line containing the text "Contest Results" because that line did not match the pattern we were looking for in our substitution command (it did not contain a colon followed by a space), so the substitution was not performed on that line.

Changing Case

Another use of the round parentheses is to change the case of any portion of text we capture using a wild-card pattern. This can be particularly useful when we want to capitalize certain words or turn entire phrases to all-caps.

To use this feature, substitute a line fragment captured using the round parentheses as described above with the same fragment preceded with one of the following modifiers:

\u First-Letter Upper-Case
\U Switch to All-Caps
\l First-Letter Lower-Case
\L Switch to All Lower-Case

For instance, if we wanted to capitalize the title line ("Contest Results") in our previous example, we would position the cursor on that line in vi, then press the colon to enter command mode, and we would type in the following instruction:

s/\(.*\)/\U\1/

In this substitution command, we are asking vi to replace all characters on the line (.*) with the same thing (\1), only all upper-case (\U).

Similarly, we can use a similar technique to convert the string "Denise" to all-caps in this sample text:

1st Place: Denise

In vi, we would issue the following instruction to accomplish this task:

s/\(.*\): \(.*\)/\1: \U\2/

...which would result in this:

1st Place: DENISE

This example is almost identical to what we did earlier to reverse the column order in our list, except this time we are keeping the order intact (we specify "\1" first, followed by a colon, a space, and "\2") but we precede the positional parameter "\2" with the "\U" modifier to convert it to all-caps.

Naturally, this is a lot of effort for a simple text modification on a single line, but the power of these substitutions is really appreciated when you have to perform a massive search-and-replace operation on a large file containing hundreds of lines that would otherwise have to be modified manually.

 

 


Did you find an error on this page or do you have a comment?

Services
Sponsors