Regular Expressions - Page 2
February 9, 2001
"11:15. Restate my assumptions:
Mathematics is the language of nature.
Everything around us can be represented and understood through
numbers.
If you graph these numbers, patterns emerge. Therefore: There are
patterns everywhere in nature."
- Max Cohen in Pi, 1998
Whether or not you agree that Max's assumptions give rise to his
conclusion is your own opinion, but his case is much easier to
follow in the field of computers — there are certainly
patterns everywhere in programming.
Regular expressions allow us look for patterns in our
data. So far we've been limited to checking a single value
against that of a scalar variable or the contents of an array or
hash. By using the rules outlined in this chapter, we can use
that one single value (or pattern) to describe what we're looking
for in more general terms: we can check that every sentence in a
file begins with a capital letter and ends with a full stop, find
out how many times James Bond's name is mentioned in
'Goldfinger', or learn if there are any repeated sequences of
numbers in the decimal representation of p greater than five in
length.
However, regular expressions are a very big area — they're
one of the most powerful features of Perl. We're going to break
our treatment of them up into six sections:
- Basic patterns
- Special characters to use
- Quantifiers, anchors and memorizing patterns
- Matching, substituting, and transforming text using patterns
- Backtracking
- A quick look at some simple pitfalls
Generally speaking, if you want to ask Perl something about a
piece of text, regular expressions are going to be your first
port of call — however, there's probably one simple
question burning in your head.
What Are They?
The term "Regular Expression" (now commonly abbreviated to
"RegExp" or even "RE") simply refers to a pattern that follows
the rules of syntax outlined in the rest of this chapter. Regular
expressions are not limited to Perl — Unix utilities such
as sed and egrep use the same notation
for finding patterns in text. So why aren't they just called
'search patterns' or something less obscure?
Well, the actual phrase itself originates from the mid-fifties
when a mathematician called Stephen Kleene developed a notation
for manipulating 'regular sets'. Perl's regular expressions have
grown and grown beyond the original notation and have
significantly extended the original system, but some of Kleene's
notation remains, and the name has stuck.
Patterns
History lessons aside, it's all about identifying patterns in
text. So what constitutes a pattern? And how do you compare it
against something?
The simplest pattern is a word — a simple sequence of
characters — and we may, for example, want to ask Perl
whether a certain string contains that word. Now, we can do this
with the techniques we have already seen: We want to split the
string into separate words, and then test to see if each word is
the one we're looking for. Here's how we might do that:
#!/usr/bin/perl
# match1.plx
use warnings;
use strict;
my $found = 0;
$_ = "Nobody wants to hurt you... 'cept,
I do hurt people sometimes, Case.";
my $sought = "people";
foreach my $word (split) {
if ($word eq $sought) {
$found = 1;
last;
}
}
if ($found) {
print "Hooray! Found the word 'people'\n";
}
[Lines 6 and 7 above are one line. They have been split for
formatting purposes.]
Sure enough the program returns success:
>perl match1.plx
Hooray! Found the word 'people'
>
But that's messy! It's complicated, and it's slow to boot! Worse
still, the split function (which breaks each of our
lines up into a list of 'words' — we'll see more of this,
later on in the chapter) actually keeps all the punctuation
— the string 'you ' wouldn't be found in the
above, whereas 'you... ' would. This looks like a
hard problem, but it should be easy. Perl was designed to make
easy tasks easy and hard things possible, so there should be a
better way to do this. This is how it looks using a regular
expression:
#!/usr/bin/perl# match1.plxuse warnings;use strict;
$_ = "Nobody wants to hurt you... 'cept,
I do hurt people sometimes, Case.";
if ($_ =~ /people/) {
print "Hooray! Found the word 'people'\n";
}
[Lines 2 and 3 above are one line. They have been split for
formatting purposes.]
This is much, much easier and yeilds the same result. We place
the text we want to find between forward slashes — that's
the regular expression part — that's our pattern, what
we're trying to match. We also need to tell Perl which particular
string we're looking for in that pattern. We do this with the
=~ operator. This returns 1 if the pattern match was
successful (in our case, whether the character sequence 'people'
was found in the string) and the undefined value if it wasn't.
Beginning Perl
Beginning Perl
Checking the Syntax - Page 3
|