Storing the Data - Page 7
December 14, 2001
Now that we're successfully parsing out the individual elements
from each line in the log file, what are we going to do with
them? It's time to think about what sorts of things we want to
keep track of, and how to represent them in our data structure.
One good thing to keep track of is the time of the first and last
access processed. When printed out in our report, this will let
us see what range of time is covered by the analyzed log file
lines. Another obvious thing to keep track of is how many raw
hits are in the log file. Similarly, we can track the total
amount of data (in megabytes) sent out by the server, and the
number of HTML page views. We'll begin implementing these
features by adding the following to the top of the
log_report.plx script, just before the start of the
while loop that parses the log file lines:
my($begin_time, $end_time, $total_hits, $total_mb, $total_views);
This establishes a number of scalar variables that will be
visible throughout the script, and will be used to store the
various categories of information we're interested in tracking.
Now, at the end of the while loop, we'll comment out
that debugging print statement and add the new lines shown here
in order to store those various pieces of data:
# print join "\n", $host, $ident_user, $auth_user, $date, $time,
# $time_zone, $method, $url, $protocol, $status,
# $bytes, $referer, $agent, "\n";
unless ($begin_time) {
$begin_time = "$date:$time";
}
$end_time = "$date:$time";
++$total_hits;
$total_mb += ($bytes / (1024 * 1024));
next if $url =~ /\.(gif|jpg|jpeg|png|xbm)$/i;
# don't care about these for visit-tracking purposes
++$total_views;
&store_line($host, $date, $time, $url, $referer, $agent);
}
We stick the assignment to $begin_time inside an
unless block that checks to see if the variable has
been assigned already, so it only gets assigned when the first
line of the log file is processed. The $end_time
variable is just overwritten with the current values of
$date and $time for every line, such
that we end up with the date and time of the last access when
we're done parsing the log file. Adding one to
$total_hits each time through the loop using the
auto-increment operator (++) is easy
enough to understand. $total_mb is assigned using
the interesting += operator, which does what you
would probably guess it does: it takes whatever number is on the
right and adds it to the contents of the variable on the left,
storing the new sum in the variable. It is thus the equivalent
of:
$total_mb = $total_mb + ($bytes / (1024 * 1024));
except it's a bit easier to write. Dividing $bytes
by the product of 1024 * 1024 simply converts that
number to megabytes. The next line uses that handy condensed form
of an if statement: do something if
something else. In this case, it says to bail out
and go to the next cycle through the while loop
(which in this case means going to the next line in the log file)
if the contents of $url end in .gif,
.jpg, .jpeg, .png, or
.xbm. This reflects the fact that we're only
interested in actual "page views" at this point, and
don't care about the image files whose requests also end up in
the log file. We could instead have used something like:
next unless $url =~ /\.html?$/;
which would skip to the next line from the log file unless the
current line's $url ended in .htm or
.html, but this would skip requests for CGI scripts
and for directories that return a default page such as
index.html. It probably makes sense to count those
requests in $total_views. Next, now that we've
gotten rid of those extraneous log file entries, it's time to add
one to the contents of $total_views. And finally, we
invoke a subroutine called &store_line with the
arguments $host, $time,
$url, $referer, and
$agent. We'll be using that subroutine in an effort
to generate statistics on something more interesting: the
activities of the individual visitors to our site.
Different Log File Formats (con't) - Page 6
Perl for Web Site Management
The "Visit" Data Structure - Page 8
|