mgzta - collecting the data for evaluation

Requirements for the data to be collected

To allow for an evaluation of the requests to the web server and the effects caused by mod_gzip of course the corresponding data have to be collected first.

For this purpose the usual Apache access_log informations don't suffice, regardless whether in Common or in Combined format:

mgzta wants to get a log file containing the following fields for its analysis:

Apache parameter Content supplied by
%r 1st line of the HTTP request ("method URL protocol version") Apache logging handler
%{Content-type}o MIME Content-Type of the generated result document HTTP result header
%>s HTTP status code for serving this request Apache logging handler
%B effective size of the generated result document content in bytes, when serving the document(after the eventual compression) Apache logging handler
%{mod_gzip_result}n mod_gzip status code mod_gzip handler
%{mod_gzip_input_size}n Size (in Bytes) of the input before compression (or 0 if no compression was done) mod_gzip handler
%{mod_gzip_output_size}n Size (in Bytes) of the output after compression (or 0 if no compression was done) mod_gzip handler
%{User-agent}i name of the HTTP UserAgent in use (browser) HTTP request header

Fields of the created log file

Unfortunately two of these information fields can be manipulated by user actions:

Actually some distict special characters should not be legal within the URL; but in reality this doesn't prevent anyone from sending these characters to a server within a HTTP request where they finally go into the Apache log.

The same applies to the UserAgent string which can be manipulated in some browsers or by filter programs (WebWasher etc.). Thus I'm afraid there is no absolutely secure method to analyze the fields of such a log line (to do so one would have to make Apache URLencode all characters within an URL, which then of course would no longer be the original URL of the request ...).

I tried to make the best out of this situation by using the field separator '# which normally isn't used in UserAgent strings and separates URL from link targets within a document, thus should never be part of any normal URL. In case of problems this string can easily be modified inside the source code ...

Excluding specific requests from calculation

Not everything delivered by the web server via HTTP does really leave your machine via the line. Especially on this very machine there may be processes which communicate via HTTP as well and should be excluded explititly from the traffic analysis. These processes may well not support HTTP Content-Encoding.

There are different methods to detect such requests:

If one is able to detect these requests then the Apache module mod_setEnvIf allows for setting environment variables depending upon HTTP headers contents of an incoming request. And the definition of the log file may contain a restriction to only log entries depending upon the existance (or some specific value) of an Environment variable. By doing so one may especially hide all requests not to be evaluated by mgzta.

Another type of requests which might be excluded from mgzta analysis are those where a specific application already creates compressed data. For these data we don't have any information about their original size before compression; but it might as well be unfair to handle all these data as being 'uncompressed' in our statistics. On the other had - if we exclude them from the evaluation then they are missing when calculating the overall traffic ... finally it must be left to the program user what to do with this kind of requests. Therefore these requests should be collected inside the log file (and it might not be too easy to prevent that ...); mgzta itselfs offers a flag for switching between ignoring and counting these requests.

Apache configuration directives for the mgzta log file

Altogether we need the following Apache configuration directives:

  <IfModule mod_gzip.c>
# =============================================================================
# detect internal requests (examples - adapt to your needs)
  SetEnvIf  User-Agent  my_own_agent    ignore-this-request
  SetEnvIf  Remote_Addr 123.123.123.123 ignore-this-request
# (to exclude these from statistical evaluation)
# =============================================================================
# log format for the mod_gzip traffic analysis program (mgzta)
# (einige Felder enthalten whitespaces, andere sogar '|'-Zeichen!)
  LogFormat "%r#%{Content-type}o#%>s#%B#%{mod_gzip_result}n#%{mod_gzip_input_size}n#%{mod_gzip_output_size}n#%{User-agent}i" mgzta
# %r                       = request             (1st line: method, URL, protocol)
# %{Content-type}o         = MIME content-type   ('o' = result [output] headers)
# %>s                      = HTTP status code    ('>' = of original requests, in case of redirections)
# %B   (nicht %b!)         = result bytes        (if no data: '0' instead of '-')
# %{mod_gzip_result}n      = mod_gzip status     ('n' = variable set by module)
# %{mod_gzip_input_size}n  = mod_gzip input      ('n' = variable set by module)
# %{mod_gzip_output_size}n = mod_gzip output     ('n' = variable set by module)
# %{User-agent}i           = browser name        ('i' = request [input] headers)
  CustomLog logs/mgzta_log mgzta env=!ignore-this-request
# (only log if not to be ignored)
# =============================================================================
  </IfModule>