Parsing Web Access Logs
December 4, 2001
Web server access logs are an excellent source of information
about what your site's visitors are up to. The information on
separate visitors is all mixed together, though, and for all but
the smallest sites the raw access logs are too large to read
directly. What you need is log analysis software to make the
information in the log more easily accessible. You can buy
commercial log analysis software to do this, but Perl makes it
easy to write your own. The next three chapters describe how to
build such a home-grown log-analysis tool.
This chapter focuses on the first part of the process: extracting
and storing the information we're interested in. We talk about
log file structure, converting IP addresses, and creating regular
expressions capable of parsing web access logs. We also talk
about creating a suitable data structure for storing the
extracted data, so we can answer interesting questions about what
our site's visitors have been doing. Along the way we discuss the
difficulty of identifying those visitors in the web server's log
entries and devise an approach for extracting at least an
approximate version of that information.
The example continues in Chapter 9, which focuses on how to do
computations involving dates and times, and finishes in Chapter
10, which covers the specifics of how we manipulate the
"visit" information from our logs, as well as the
actual output of the finished report.
Log File Structure
Most web servers store their access log in what is called the
"common log format." Each time a user requests a file
from the server, a line containing the following fields is added
to the end of the log file:
- host: This is either the IP address (like
207.71.222.231) or the corresponding hostname (like
pm9-31.sba1.avtel.net) of the remote user requesting
the page. For performance reasons, many web servers are
configured not to do hostname lookups on the remote host. This
means that all you end up with in the log file is a bunch of IP
addresses. A bit later in this chapter, you'll develop a Perl
script that you can use to convert those IP addresses into
hostnames.
- identd result: This is a field for logging the
response returned by the remote user's identd server.
Almost no one actually uses this; in every web log I've ever
seen, this field is always just a dash (
-).
- authuser: If you are using basic 'ecHTTP
authentication (which we'll be talking about in Chapter 19) to
restrict access to some of your web documents, this is where the
username of the authenticated user for this transaction will be
recorded. Otherwise, it will be just a dash (
-).
- date and time: Next comes a date and time string
inside square brackets, like:
[06/Jul/1999:00:09:12 -
0700]. That's the day of the month, the abbreviated month
name, and the four-digit year, all separated by slashes. Next
come the time (expressed in 24-hour format, so 11:30 P.M. would
be 23:30:00) and a time-zone offset (in this example, -0700,
because the web server this log was from was using Pacific
Daylight Time, which is seven hours behind Universal
Time/Greenwich Mean Time).
- request: This is the actual request sent by the remote
user, enclosed in double quotes. Normally it will look something
like:
"GET / HTTP/1.0". The
GET part means it is a GET request (as opposed to a
POST or a HEAD request). The next part is the path of the URL
requested; in this case, the default page in the server's top-
level directory, as indicated by a single slash (/).
The last part of the request is the protocol being used, at the
time of this writing typically HTTP/1.0 or HTTP/1.1.
- status code: This is the status code returned by the
server; by definition this will be a three-digit number. A status
code of
200 means everything was handled okay,
304 means the document has not changed since the
client last requested it, 404 means the document
could not be found, and 500 indicates that there was
some sort of server-side error. (More detail on the various
status codes can be found in RFC 1945, which describes the
HTTP/1.0 protocol. See
http://www.w3.org/Protocols/rfc1945/rfc1945.)
- bytes sent: The amount of data returned by the server,
not counting the header line.
An extended version of this log format, often referred to as the
"combined" format, includes two additional fields at
the end:
- referer: The referring page, if any, as reported by
the remote user's browser. Note that referer is
consistently misspelled (with a single "r" in the
middle) in the HTTP specification, and in the name of the
corresponding environment variable.
- user agent: The user agent reported by the remote
user's browser. Typically, this is a string describing the type
and version of browser software being used.
Assuming you have control over your web server's configuration,
or can get your ISP to modify it for you, the combined format's
extra fields can provide some very interesting information about
the users visiting your site. The log analysis script described
in this chapter will work with either format, however.
Perl for Web Site Management
Converting IP Addresses - Page 2
|