The cache model of `gzip_cnc`

Objective

Mostly, the pages of a web space are being read much more frequently than their content changes - at least if they are static pages.

On the other hand, the content of dynamic pages (like the output of a search engine) changes with each new page request - either

by some variable URL part (in the case of the search engine: the search term) or
because the data the content was dynamically computed from has changed (e. g; the search engine now knows more recent information about some pages).

Compressed variants of static documents

In general you can't tell from a URL whether the corresponding content is of static or dynamic nature - this depends on nothing but the web server configuration.

But the provider of a web space knows exactly which of his own pages are static and which ones are dynamic. Thus he can offer his static pages in compressed form as well - and if he has to take care of old browsers that cannot handle compressed content yet he can offer both forms of the pages and let the web server decide (based on the HTTP headers sent by the browser) which one of these forms ought to be served.

Maintaining the compressed variants

Of course this means that each time he changes the content of one of his pages the administrator has to keep in mind to modify the compressed form of the page as well.

If he misses that then the web server will serve the current variant of the page to some users and some outdated variant to other users, according to their browser version and configuration - the webserver cannot know that both files had better contained the same content in respect to semantics.

Automatisation

Wouldn' it be more comfortable if the web server (having to decide which form to serve anyway) could assume these two duties as well:

create the compressed variant to start the mechanism and
check whether its content is outdated and replace it by an updated version if necessary.

The page provider should have to care about nothing but his original files.

Exactly this is provided by gzip_cnc: It extends the Apache server by a handler which may optionally create, maintain and serve a compressed variant for each request it has been declared responsible.

And this creation is performed as rarely as possible (to save CPU time) but as often as necessary (to prevent serving an outdated variant of the content).

Repository for the compressed document variants

To be able to utilize the form of conditionally serving page contents provided by the Apache web server itself (Content Negotiation) gzip_cnc would have to store the compressed form within the directory where the original file resides.

But proceeding this way would have two disadvantages:

the directory content would become less transparent for the creator of these files if containing each file twice while these files would only differ by an additional name extension.
The compressed variant of the file would be directly accessible via a corresponding URL as well - and this also for clients that cannot handle such files, like the most search engines.

Therefore gzip_cnc has decided to completely rebuild the document tree it is responsible for inside a separate cache directory tree and store the compressed variants of the file contents only within this cache, hidden from the web server.

Directories and URLs

A web space may consist of a number of separate directory trees that are mapped into the common URL space based upon configuration directives of the web server.

If gzip_cnc took its bearings from the path name of the original file while maintaining its compressed variants it would potentially have to mirror the complete directory tree of the operating system - within this directory tree itself! Even under the assumption that only a small part of the directory tree is in fact visible (within the URL space) for the webspace's visitors this would lead to using unnecessarily long directory paths (bearing the additional danger to conflict with a limit of the operating system in question).

Therefore within its cache gzip_cnc doesn't reference path names but URLs of the original files and mirrors the URL tree of the web space. (And that means the complete tree from the root of the domain.)
So in case of one file being accessible under several URLs gzip_cnc would create a corresponding cache file for each of these URLs. (Unless the administrator who himself created this situation by complex use of configuration directives adds appropriate links within the cache tree at the points where two URLs should have identical meaning - for example in UNIX this could be achieved by using symbolic links.)

Potential improvements

One could imagine to even shorten the cache directory names by using an appropriate hash function - the Apache module mod_proxy is doing this. If one day the use of the original's URLs as path names within the cache tree would prove to be a problem then a corresponding modification would be possible at this point without problems.

Besides this it cannot be foreseen generally how fast the file system (that may have to hold a multitude of compressed files within the cache tree) can handle a large number of files within the same directory - a sequential search for the requested file might increase the CPU load on the server and thus decelerate the delivery of the file.

(Michael Schröpl, 2002-08-04)

The cache model of gzip_cnc