The Log-Analysis Script - Page 4
December 11, 2001
Now that the hostname lookups are taken care of, it's time to
write the log-analysis script. Example 8-2 shows the first
version of that script.
Example 8-2: log_report.plx, a web log-analysis script (first version)
#!/usr/bin/perl -w
# log_report.plx
# report on web visitors
use strict;
while (<>) {
my ($host, $ident_user, $auth_user, $date, $time,
$time_zone, $method, $url, $protocol, $status, $bytes) =
/^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] "(\S+) (.+?)
(\S+)" (\S+) (\S+)$/;
print join "\n", $host, $ident_user, $auth_user, $date, $time,
$time_zone, $method, $url, $protocol, $status,
$bytes, "\n";
}
This first version of the script is simple. All it does is read
in lines via the <> operator, parse those
lines into their component pieces, and then print out the parsed
elements for debugging purposes. The line that does the printing
out is interesting, in that it uses Perl's join
function, which you haven't seen before. The join
function is the polar opposite, so to speak, of the
split function: it lets you specify a string (in its
first argument) that will be used to join the list comprising the
rest of its arguments into a scalar. In other words, the Perl
expression join '-', 'a', 'b', 'c' would return the
string a-b-c. And in this case, using
\n to join the various elements parsed by our
script lets us print out a newline-separated list of those parsed
items.
The Mammoth Regular Expression
The real juicy part of this script, though, is that giant regular
expression used to parse each log file line into its component
parts. Here's that part of the script again:
my ($host, $ident_user, $auth_user, $date, $time,
$time_zone, $method, $url, $protocol, $status, $bytes) =
/^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] "(\S+) (.+?)
(\S+)" (\S+) (\S+)$/;
There are a couple of important things to note here. The first is
that it is actually fairly tricky to represent this regular
expression, which is meant to be on a single line, within the
limited width of this book's pages. It's particularly tricky in
this case because the spaces between the various elements are
important, but it's hard to keep track of those spaces when the
expression is broken to fit onto multiple lines. If you are going
to test this script yourself, be sure that your version of the
expression is all on one line, with a single space character
between the right parenthesis that ends the first line and the
begin parenthesis that begins the second line. (Or you can just
download the example from the book's web site, at
http://www.elanus.net/book/, since the downloadable example
doesn't feature those problematic line breaks.) You also can
refer to the version of this expression created using the
/x modifier, which is described in the accompanying
sidebar, "Regular Expression Extensions," and use that
version instead of the one-line version given here.
Regular Expression Extensions
Putting the /x modifier at the end of a regular
expression lets you use regular expression
"extensions." This means that you can put
whitespace characters (like spaces, tabs, and newlines)
into the expression, and they will be ignored by Perl when trying
to make a match. (The one exception to this is inside a square-
bracketed character class, where literal whitespace characters
will still count.) To get a literal whitespace character outside
a character class you need to precede it by a backslash. Also,
you can embed comments in the expression by preceding them with
the hash symbol (#), just like you can with regular
Perl statements. The idea is that you can break your expression
across multiple lines and use indenting and comments in an effort
to make it more easily understood.
With a substitution expression, by the way, the /x
modifier applies only to the search pattern (the first half of
the expression). The replacement part (the second half) still
treats whitespace and the # sign as literal
characters.
Here's how you might use the /x modifier to
represent the regular expression in Example 8-2:
my ($host, $ident_user, $auth_user, $date, $time,
$time_zone, $method, $url, $protocol, $status,
$bytes) =
/ # regexp begins
^ # beginning-of-string anchor
(\S+) # assigned to $host
\ # literal space
(\S+) # assigned to $ident_user
\ # literal space
(\S+) # assigned to $auth_user
\ # literal space
\[([^:]+) # assigned to $date
: # literal :
(\d+:\d+:\d+) # assigned to $time
\ # literal space
([^\]]+) # assigned to $time_zone
\]\ " # literal string '] "'
(\S+) # assigned to $method
\ # literal space
(.+?) # assigned to $url
\ # literal space
(\S+) # assigned to $protocol
"\ # literal string '" '
(\S+) # assigned to $status
\ # literal space
(\S+) # assigned to $bytes
$ # end-of-string anchor
/x; # regexp ends, with x modifier
|
Converting IP Addresses (con't) - Page 3
Perl for Web Site Management
The Mammoth Regular Expression (con't) - Page 5
|