Web Server Log Rotation and Analysis

Workshop Requirements

In order to make the most of this workshop you should have:


Unlike many of the other workshops that are aimed at learning within a development environment, this workshop is really only of use within a production environment where you are interested in logs, hits and web traffics in general. That is not to say that you should not first use a development environment on which to practice, but you will only see the benefit when properly deployed.

This Virtual Workshop will cover how to filter information that you don't want out of your web server logs, how to implement a log rotation strategy and how to automate the creation of log analysis reports using a variety of different software. As I run my production servers in a Linux environment, there may be a slight bias towards a Linux way of doing things (and because everything seems much easier to do with Linux). Efforts have been made throughout to ensure that the majority of methods work with Windows as well.

What Don't We Want to Keep?

This may seem like a strange question, but it is necessary to think about this as Apache can and will record a hit for every file that is requested unless you tell it differently. This means that for one view of a web page every file that is used on that page (images, CSS, external javascript etc) will be recorded as a hit. Most log analysis software will sort out the actual page views etc when producing the stats, but unless you have specific desire to look at data to do with images or stylesheets it is better to not even record it. There are also other things you may not wish to record, such as search engine robots trawling your site or worms checking for files that IIS comes with in an attempt to gain access to the server (we can adopt an air of superiority due to using Apache ;-). The effect of not recording all this extra data means that the log files are kept significantly smaller as can be seen looking at two test logs which record one week's data.


-rw-r--r-- 1 keith keith 8967670 Mar 1 00:00 hits_log


-rw-r--r-- 1 keith keith 2562468 May 1 00:00 access_log

The unfiltered log is over three times as big as the filtered one. So having decided that we don't want to record all hits on the server we next need to set up the filters on the httpd.conf file. This is done by creating a custom environmental variable then the existence of which (or not) acts as a filter when given as an argument to the the CustomLog directive (can be preceded by the not operator '!'). For example if we have created a filter called 'mylogs':

CustomLog logs/access.log combined env=mylogs

This would only log things hits that appeared in the filter. Or to ensure that everything EXCEPT the filter was logged:

CustomLog logs/access.log combined env=!mylogs

Obviously before a filter can be applied it must first be defined.


This directive is used to set the environmental variables that are used as a filter. Without wishing to get too complicated, a custom environmental variable can be created to contain 3 different values.

  1. A literal value e.g.myenv=keith
  2. Set the presence of a variable (actually has a value of 1) e.g.myenv
  3. The removal of a set variable. !myenv

We are interested in the second option, simple creating the existence of the variable to allow us to see if it exists of not when passed to the customlog directive. This is done using the SetEnvIf directive to check attributes created by a hit on the server for elements we don't want to log. These attributes can include header values (e.g. user-agent), environmental variables set by Apache when serving the file or even a previous environmental variable set using SetEnvIf. Environmental variables set by Apache can include:

A simplified syntax for the SetEnvIf directive is as follows.

SetEnvIf    Attribute    match_condition    set_new_variable_name

So for example if we did not want to log the googlebot visits to your site you could choose the User-Agent attribute to trigger a match and create a environmental variable called dontlog.

SetEnvIf    User-Agent   Googlebot          dontlog 

Or you could use the Remote_Host, so to match any bot from AltaVista you would enter:

SetEnvIf    Remote_Host  "sv.av.com$"       dontlog

Notice the use if the dollar sign '$'. This is part of PERL compatible regular expressions that Apache uses to allow flexible matching and in this instance it means all that end in 'sv.av.com'. This is because AltaVista runs it's robots from domains such as trek32.sv.av.com or drone7.sv.av.com. At this point it is worth examining regular expressions a little closer.


Regular Expressions can be a powerful means of matching exactly what you require and are much more flexible that 'ordinary' wildcards. Below follows a brief selection of the most common regular expressions and some example usage.

Meta Character Purpose
^ Matches anything at the start of the string
$ Matches anything at the end of the string
foo|bar Matches either foo or bar
(foo) Matches foo as a whole
[abc] Matches any character in the set 'abc'
[a-z] Matches any character in the range a-z
[^abc] Matches any character that is not in the set 'abc'
\s Matches a white space character

So lets consider two lines from a logfile.

spider2.cpe.ku.ac.th - - [08/Aug/2003:05:34:52 +0100] "GET /robots.txt HTTP/1.1" 200 169 "-" "SpiderKU/0.9

and - - [08/Aug/2003:10:49:59 +0100] "GET / HTTP/1.0" 200 4313 "-" "Mozilla/4.0   (compatible; MSIE 6.0; Windows NT 5.1)"

So lets look at some examples using these two lines of the logfile:

To match the Remote_Host:

SetEnvIf  Remote_Host  "^spider"       dontlog

Would catch the first line as it begins with spider. Specifying a range of letters only (upper and lower case) would again only match the first line as the second contains numbers.

SetEnvIf  Remote_Host [a-zA-Z]         dontlog

Or turning the range into a negative, so that only that which doesn't contain letters matches.

SetEnvIf  Remote_Host [^a-zA-Z]        dontlog

To match a User_Agent:

SetEnvIf  User_Agent "(MSIE)"          dontlog

Or a requested page.

SetEnvIf Request_URI "^/\s?" dontlog

This will only match the root of the site as the requested directory must be succeeded by whitespace.


This directive is the same as SetEnvIf above except that it the match is evaluated as case insensitive. This is particularly useful for banning images from our logs as often images made on windows platforms can be uppercase.

SetEnvIfNoCase Request_URI "\.(gif)|(jpg)|(png)|(css)|(js)|(ico)|(eot)$" dontlog

Note also the use of the use of regex to state any file ending in dot gif or jpg etc etc

Example Robot Bans

Just to get you started here are some example robot exclusions:

SetEnvIFNoCase User-Agent  "Slurp/cat"        dontlog
SetEnvIFNoCase User-Agent  "Ask Jeeves/Teoma" dontlog
SetEnvIFNoCase User-Agent  "Googlebot"        dontlog
SetEnvIFNoCase Remote_Host "fastsearch.net$"  dontlog

Other things to ban

It is also useful to ban the Request_URI's of files requested by the nimda or code red worms.

SetEnvIf Request_URI "^/default.ida"  dontlog
SetEnvIf Request_URI "^/scripts"      dontlog 
SetEnvIf Request_URI "^/c/winnt"      dontlog
SetEnvIf Request_URI "^/_mem_bin"     dontlog
SetEnvIf Request_URI "^/_vti_bin"     dontlog
SetEnvIf Request_URI "^/MSADC"        dontlog
SetEnvIf Request_URI "^/msadc"        dontlog
SetEnvIf Request_URI "^/d/winnt"      dontlog
SetEnvIf Request_URI "^/cmd.exe$"     dontlog
SetEnvIf Request_URI (.*)cmd\.exe     dontlog
SetEnvIf Request_URI (.*)root\.exe    dontlog
SetEnvIf Request_URI (.*)default\.ida dontlog

Watching the Robots

It is also a good idea to create a logfile that logs requests for the robots.txt file from bots and spiders NOT already filtered using the dontlog variable. This new log can then be examined monthly to add new exclusion entries to your httpd.conf file. To do this create a new environmental variable called robots by specifying a match for the robots.txt file...

SetEnvIf Request_URI "^/robots\.txt$" robots

...next check to see if the dontlog environmental variable has been set already (to the default value of 1), and if so remove the newly created robots environmental variable as we don't want to record these values.

SetEnvIf dontlog 1 !robots

... then create a robots_log file including those hits that set the robots variable..

CustomLog /var/log/httpd/robots_log combined env=robots

Of course it would also be possible to create an automated process to collect the user agents or host names of those requesting the robots.txt file and then add them to our exclusion 'dontlog' list, but this could cause errors (for example if someone with Internet Explorer looked at your robots.txt) that would lead to blocking legitimate users from your logs. It is also worth being 100% in control of the actual changes you make to server.

Final Note About Robots

One last thing is that if you are running several web server machines, it would be worth storing all the exclusions in an external 'robots.conf' file, that could easily be transferred between servers and would only require an apache include in the main httpd.conf files.

 Include conf/robots.conf

Log Rotation

Log rotation is necessary to avoid logs getting huge and also to allow us to perform analysis on these logs. There are two main methods that can be used for log rotation.

Either method can be used quite successfully depending on your requirements.

Moving the logfile

The first method is fairly self explanatory and can easily be scripted (following from the Apache manual):

mv access_log access_log.old
mv error_log error_log.old
apachectl graceful
sleep 600
gzip access_log.old error_log.old

This simple method is also used frequently on Linux systems by the logrotate program which moves the file then issues a command to restart Apache 'postrotate'. If you plan to use a different log rotation strategy on Linux it would be worth checking that your distribution has not configured logrotate for you by checking the /etc/logrotate/apache file and commenting out any unwanted rotations (I personally think keeping the error_log is OK):

#/var/log/httpd/access_log {
#    missingok
#    postrotate
#       /bin/kill -HUP `cat /var/run/httpd.pid 2>/dev/null` 2> /dev/null || true
#    endscript

/var/log/httpd/error_log {
        /bin/kill -HUP `cat /var/run/httpd.pid 2>/dev/null` 2> /dev/null || true

Using Piped logs.

Simply put a 'pipe' transfers the output from one command to another command using the '|' symbol between the commands. Commands are given left to right. Apache can use this to pipe the output of the CustomLog directive to another program rather than to a file. This is done by specifying the pipe, the program and any arguments that the program may take, in place of the file name:

CustomLog "| /path/to/program arg1 arg2 etc" combined env=dontlog

To give an example, Apache comes with a rotatelogs program that takes a filename and the length of time in seconds (or size of file) to record logs for. This creates files like /path/to/logfile.nnnn where 'nnnn' is the system time when the logfile was created. Thus we may construct the directive like so:

 CustomLog "|/usr/local/apache/bin/rotatelogs /var/log/access_log 86400" combined env=dontlog

Which tells the server to use the rotatelogs program, use /var/log/access_log as the base location and filename and to create a new logfile every 24 hours. eg:


While this solution is quite good there is a 3rd party solution called cronolog that is in my opinion superior.

Using Cronolog

Cronolog is a better solution as it allows date variable to be used in the path and filename and thus combines the strengths from each of the examples above. You can just let Apache and cronolog get on with serving files and creating logs without having to restart the server and also you know the exact location and name of any old log file (system times are more obfuscated ).

You can download cronolog from here and once installed it is simple to use, placing a series of format strings in path to the logfile. These include:

String Description Example
%b Abbreviated month name Jan..Jul..Dec
%d The day of month 01..15..31
%m Month 01..07..12
%W Week of the year starting with monday. 00..24..53
%x Date representation 20/07/03
%y Year without the century 00..73..99
%Y Year with the century 1989..2009

These strings (and there are more) can then be used as part of an argument to cronolog specifying the location of the log files:

CustomLog "|/usr/local/sbin/cronolog /var/log/httpd/%b-%y/access_log" combined env=!dontlog

This will create files like:


Other common combinations include stacking the files in a hierarchal order according to date and frequency of log rotation.

Rotation Frequency Format Examples
Once a month /var/log/%Y/%m/access_log /var/log/2003/03/access_log
Once a week /var/log/%Y/%W/access_log /var/log/2003/23/access_log
Once a day /var/log/%Y/%m/%d/access_log var/log/2003/03/02/access_log

I used to prefer using the month-year ('%b-%y') option but have recently adopted the the format '/var/log/%m-%Y/access_log' as it ties in with one of the log analysis programs discussed below. So for the rest of this workshop the path to the logs will use this method. i.e:


That's about it for cronolog. If you like cronolog, and use it, can I suggest buying one of the author's books in way of support. The Apache Pocket Reference is probably most relevant (and the one I bought for this purpose).

Logfile Analysis

Having put all this effort into collecting logs, we probably want to do something more with them and thus in this section we are going to look at 3 log analysis tolls:

Any of these are good, but as I started out using Analog and have added Report Magic and AWStats, I tend to use all three. I will cover each below but it is worth noticing that there are far more comprehensive sources of information about their installation and use than in this Virtual Workshop. What I will explain is my 'real-life' usage that may (or may not) be of some help.


Analog is "Analog is a program to measure the usage on your web server. It tells you which pages are most popular, which countries people are visiting from, which sites they tried to follow broken links from, and all sorts of other useful information". I've used it for at least 6 years and is getting better with version 6 (currently in beta) supporting XHTML and CSS output.

For now you can download analog from here.

Once installed you can configure the output to suit your needs by editing the analog.cfg file (the location of which will differ depending on installation platform). The best way to do this is to use a test logfile and a test output location and continually run the analog program until you are happy with the results. Some of the changes I make to the analog.cfg file include:

# fairly obvious...
HOSTNAME "www.mysite.co.uk"
# as I use the Apache Combined logformat this is required

# I'm not interested in referrals within my site (although you might).
REFREPEXCLUDE http://www.mysite.co.uk/* 

# I ensure that Analog will include .shtml and .php files in the analysis

# In addition to the default reports I turn these on

# And I make sure that these reports are turned off

# I also include an external Search Engine Configuration List for Analog
# Maintained by Israel Hanukoglu @ 
# http://www.science.co.il/analog/SearchQuery.txt
CONFIGFILE SearchQuery.txt

Another addition is a list of robots that is maintained by the Wadsack-Allen Digital Group just in case we missed some in the filtering process.

It's that simple but of course there maybe other options that you may like to enable or disable and that is why testing your logfile is a good idea. Once testing is complete you should comment out the two lines referring to the LOGFILE and OUTFILE.

# LOGFILE /var/log/httpd/testlog/testlog.log 
# OUTFILE /usr/local/apache/htdocs/logs/testlog/analog.html

As the LOGFILE and OUTFILE location are going to differ each time you will pass these values as arguments to the analog program when it is run. Like most programs that are executed from the command line, analog will accept values determined by flags. These flags are generally [ - | + char ] to toggle a report on or off. For example:

(-|+)D Toggle the Daily Report.

Or in a real world usage...

$ analog +D

...would override any directive in the analog.cfg file. As we should have setup the analog.cfg file to meet out requirements, there is no real reason to use most of these, however there are 3 arguments that are of particular interest:

As you can imagine (and as we will see) this last option could be quite useful. If you want a full list of command line options, see the analog MAN page. But first let's look at a simple example whereby we are specifying the LOGFILE and OUTFILE.

$ analog /var/log/httpd/07-2003/access_log +0/usr/local/apache/htdocs/logfiles/jul-03/analog.htm

Also note the more 'URL friendly' location (using month name rather than numbers) of the OUTPUT path so that when we view the logs it is a location like:


If we were running several virtual hosts we could also specify alternative locations for the files and use the +C flag to specify alternative HOSTNAME and REFREPEXCLUDE values.

$ analog /var/log/httpd/newsite/07-2003/access_log \
> +0/usr/local/apache/newsite/logfiles/jul-03/analog.htm \
> +C"HOSTNAME www.newsite.co.uk" \
> +C"REFREPEXCLUDE http://www.newsite.co.uk/*"

Of course you could just have a separate config file for each virtual host and specify an alternative using the -G (do not use default config file) and +g (add file to configuration) flags...

$ analog -G -g"/etc/newsite.cfg"

The final thing worth noting about analog is that there a number of helper applications that can aid configuration and output and can be found on the Analog site.

Report Magic

As stated above, Report Magic does no analysis of it's own, rather it takes an output from analog and uses that to produce nicer reports. Thus before be even use Report Magic, we need to create a computer data file of the analysis rather than the HTML output that analog normally produces:

$ analog /var/log/httpd/07-2003/access_log \ 
> +0/usr/local/apache/logfiles/jul-03/rmagic.dat \ 

This creates the rmagic.dat file and outputs data that looks like this:

x       VE      analog 5.24
x       HN      Nov-2002
x       PS      2002    12      01      00      01
x       FR      2002    11      01      00      08
x       LR      2002    11      30      21      26
x       E7      2002    12      01      00      01
x       SR      2947
x       S7      680
x       PR      2788

Assuming that you have installed Report Magic (get it from here) we can once more tweak the configuration, this time using a file called rmagic.ini which should be created in the same directory as the rmagic.pl file. I tend to use the samples/noframes.ini as a basis for my rmagic.ini file with a few changes:

# In and out files for testing 
File_In = /usr/local/apache/logfiles/testlog/rmagic.dat
File_Out = /usr/local/apache/htdocs/logfiles/testlog/rmagic.htm

# webmaster, URL and title details
Webmaster = webmaster@mysite.co.uk
Base_URL = http://www.mysite.co.uk
Title = Report Magic Report for Mysite

As with Analog above once we have tested the configution we can override the File_In and File_Out with appropriate monthly values at the command line.

$ rmagic.pl -statistics_File_In=/usr/local/apache/logfiles/07-2003/rmagic.dat \
> -reports_File_Out=/usr/local/apache/logfiles/jul-03/rmagic.htm

This will generate the Report Magic output and viola, you have the nicer prettier reports.


AWStats is slightly different from the Analog in that it creates it's own datafiles from the log files and this allows dynamic (as well as static if you wish) results and thus also requires frequent updates (usually daily) to add new info from the logs to these datafiles. We will look at this in more detail, but initially get AWStats installed and configured.

You can get AWStats from here. And once installed the configuration file is in the cgi-bin and has the format awstats.host.conf eg awstats.mysite.conf or awstats.newsite.conf. Once more the standard conf file is pretty OK, except this time we ARE going to keep the reference to the logfile location, but use similar dynamic date references as we did with cronolog so that when run awstats always adds records from the current logfile. The format strings are explained in the sample awstats.model.conf file but are basically in the format string-nn where nn is usually 'at so many hours ago'. Using a value of zero means the current time and date. So to specify a location that matches that of the cronolog output.


We also need to specify the logformat type, which is Apache Combined...


...the host name of the web site...


...the area where the datafiles will be created...


...and that's pretty much it. There are of course many more options that you can experiment with but the defaults are enough to get the job done for now. Next we need to generate the datafiles (where config is the name given as part of the conf file).

$ perl awstats.pl -config=mysite -update

We can now generate static output from the command line by specifying not only the config, but output details.

perl awstats.pl -config=mysite -output -staticlinks > /usr/local/apache/htdocs/logfiles/jul-03/awstats.html

You can now also generate 'real-time' via the browser by calling the awstats.pl script in the cgi-bin. You pass name/value pairs to the script to convey the same settings as you would on the command line.


This will display the data for the current month. You can also add parameters to display specific months for example July 2003.


Or you could also provide filters such as for a specific sub-directory (urlfilter) and specific outputs (e.g. urldetail). For example this 'unix' one.


Or alternatively you could construct a web page to pass these values to a script as in this one that I recently deployed. The final thing to do is to schedule the command to add data to the datafiles so that it is run daily. This can usually be done using either Cron on unix systems or the Windows Task Scheduler on windows.

Scripting the Analysis

While the use of cronolog means that we don't have to script the log rotation, it is still useful to script the monthly (or more frequent if you wish) generation of the log analysis report. While this could easily be done using shell or batch scripts, this workshop will use PERL as it is readily available for all platforms. We will run the script at 1am on the 1st of each month. What the script needs to do is:

  1. Work out the dates to use in paths etc (for the previous month)
  2. Generate AWStats logdata for the period between the last update and the end of the month.
  3. Generate the Analog files
  4. Generate the Report Magic files
  5. Optionally generate static AWStats files (or the URL for dynamic monthly stats)
  6. Compress original logfile

You could also optionally create an index page for the month (using very simple templating) and add the month to an overall logs index.

  1. Create this months index page
  2. Add this month to the logs index.

As this is not a perl tutorial I will not explain the code, but rather I'll just annotate the source. The following is for a linux installation, but can easily be altered for use on Windows etc. Also note that it uses the module Date::Format to get the date strings that we want.

use Date::Format;
# Sets the variables used in the script
$logpath = "/var/log/httpd";                       # path to your server log folder
$webpath = "/usr/local/apache/htdocs/logfiles";    # path the output directory
$analog = "/usr/bin/analog";                       # location of analog 
$rmagic = "/usr/local/bin/rmagic-2.15/rmagic.pl";  # location of rmagic
$logfile = "access_log";                           # name of logfile
$awstats = "/usr/local/apache/cgi-bin/awstats.pl"; # location of awstats
$awstatsCFG = "mysite";                            # awstats config domain
# Sorts out all the date stuff using time minus one day
$year = time2str("%Y", (time-86400));
$nummonth = time2str("%m", (time-86400));
$month = time2str("%b", (time-86400));
# adds logs from last month not added to awstats datafiles

$CmdLine = "$awstats -config\=$awstatsCFG -update -logfile=$logpath/$nummonth-$year/$logfile";
system ("$CmdLine");
# creates analog html page
$CmdLine = "$analog $logpath/$nummonth-$year/$logfile -O$webpath/$month-$year/analog.htm";
system ("$CmdLine");
# produces report magic dat file

$CmdLine= "$analog $logpath/$nummonth-$year/$logfile -O$webpath/$month-$year/rmagic.dat +C\"OUTPUT COMPUTER\"";
system ("$CmdLine");

# Runs rmagic

$CmdLine = "$rmagic -statistics_File_In=$webpath/$month-$year/rmagic.dat -reports_File_Out=$webpath/$month-$year/rmagic.htm";
system ("$CmdLine");

# creates URL for dynamic display of this months AWStats

$awstatsURL = "/cgi-bin/awstats.pl?month=$nummonth&year=$year&config=$awstatsCFG";

# compresses log file to save disk space then deletes it

$CmdLine = "tar -zcf $logpath/$nummonth-$year/access_log.tar.gz $logpath/$nummonth-$year/access_log;";
$CmdLine .= "rm -fr $logpath/$nummonth-$year/access_log";
system ("$CmdLine"); # write default page for the monthly directory # using rudimentary template file containing # 'LOGSTUFF' as an HTML comment at the point at # which to include the variable HTML open (HTML, ">$webpath/$month-$year/default.htm"); open (TEMPLATE, "$webpath/template.txt"); while ($line = <TEMPLATE>) { if ($line =~ /LOGSTUFF/i) { print HTML "<h1><a href=\"\" class=\"hlinks\">Logs For $newpath</a></h1>\n"; print HTML "<ul>\n<li><a href=\"analog.htm\">Analog Analysis</a></li>\n"; print HTML "<li><a href=\"rmagic.htm\">Report Magic Analysis</a></li>\n"; print HTML "<li><a href=\"$awstatsURL\">AWStats Analysis</a></li>\n</ul>\n"; } else { print HTML "$line"; } } close (TEMPLATE); close (HTML); # append this month to logfile index file # again using the LOGSTUFF comment in an # existing HTML file $CmdLine = "mv -f $webpath/default.htm $webpath/default.htm.old"; system ("$CmdLine"); open (HTML, ">$webpath/default.htm"); open (OLDHTML, "$webpath/default.htm.old"); while ($line = <OLDHTML>) { if ($line =~ /LOGSTUFF/i) { print HTML "<li><a href=\"$month-$year/\">$month-$year</a>&nbsp&nbsp\n"; print HTML "<a href=\"$month-$year/analog.htm\">Analog Analysis</a>&nbsp&nbsp|&nbsp&nbsp\n"; print HTML "<a href=\"$month-$year/rmagic.htm\">Report Magic Analysis</a>&nbsp&nbsp|&nbsp&nbsp\n"; print HTML "<a href=\"$awstatsURL\">Awstats Analysis</a>\n"; print HTML "$line"; # prints the comment back out for use next time } else { print HTML "$line"; } } close (OLDHTML); close (HTML);

As you can see this script is fairly straightforward and just uses a bunch of system() functions to execute the same command line instructions as above, but with date variables included. As with most of the Virtual Workshops, the code itself is very simple as I hope that a novice could understand it easily. If however anyone feels that the script can be improved (and it can!) then feel free to leave any improvements in the comments below for more advanced users. I will not be making any changes to the above though (unless I've made a complete hash of it of course ;-).

Also note: The 'template' files (template.txt and the existing default.htm) have an HTML comment in each to act as a place holder:

<!-- LOGSTUFF --> 

So with all this done, you will end up with something a bit like this example.


I was also going to include a section on Peep, which is a fun application that plays sounds when there is log activity, but I think this workshop is long enough and thus I think I'll call a halt to things. Hopefully this has been of use (and if you've reached this far maybe it has). Now I have purged myself of my obsession with logs, I'm not sure what else to do in this 'unix/ server' section. Suggestions below........

In this section

Related Reading

Related Books

Apache Cookbook

Related Ads