DWiki's caching system

DWiki has optional caching in order to speed up generating results repeatedly. DWiki uses a disk-based cache for this (although the interface is abstracted and alternate forms of caching may be introduced someday). There are three caches, which can be enabled separately: the renderer cache, a brute force page cache, and an in-memory brute force page cache that is only used if DWiki is running as a preforking SCGI server.

DWiki never removes the files of out of date cache entries from the disk cache; instead, it stops considering out of date ones to be valid. Cleaning out the detritus is left for an external process. ChrisSiebenmann considers this safer; giving a program an automated unlink() makes him nervous.

See ConfigurationFile for the options controlling the behavior of the caches.

In theory DWiki's caching is optional. In practice a decent sized DWiki is simply too slow without caching for some of the more expensive operations and caching becomes more or less a necessity. ChrisSiebenmann now believes that you should configure all levels of caching in basically any DWiki unless you have some unusual need and are sure.

The brute force cache

The brute force page cache is about as simple as you can get: it caches complete requests for a configured time (called a time-to-live, or TTL). That's it. The BFC is intended as a load-shedding measure when DWiki is under significant load, so it only acts under certain circumstances:

only on GET or HEAD requests.
only on requests without a Cookie: header.
requests only get put into the cache if the system seems loaded.

(For speed, when something is valid in the cache DWiki just serves it without checking the system load.)

A good BFC TTL is on the order of 30 seconds to three minutes or so; long enough to shed significant load if you are getting a lot of hits to a few pages and short enough that dynamic pages won't become too outdated. (And that waiting to see a comment show up or whatever is not too annoying.)

Because Atom syndication requests are among the most expensive pages to compute, the BFC can be set to give them a longer TTL than usual. There is a second TTL that can be set for Atom requests that aren't using conditional GET; the idea is that if requesters cannot be bothered to be polite, we can't be bothered to serve fresh content. Setting this option always caches the results of such requests, even if the load is low, which means that even people doing proper conditional GET requests will use the cached results for as long as their (lower) TTL says to.

It's actually faster to serve static pages from the static page server code than from the BFC, so the BFC doesn't try to cache static pages.

The two sides of the BFC

It's important to understand that the BFC does not check load when it is checking to see if something is in its cache. This means there are two stages to processing a request: deciding what TTL to use for cache checks, and deciding whether to cache something that was not current in the cache.

The TTL used is:

bfc-atom-nocond-ttl if this is an unconditional request for an Atom view, if set.
bfc-atom-ttl for Atom view requests in general, if set.
bfc-cache-ttl otherwise.

Pages enter the BFC cache either because the system seems to be loaded or because bfc-atom-nocond-ttl was set and they were an unconditional request for an Atom view.

Once something is in the cache, it will be served from the cache if it is not older than the check TTL. Different requests can use different check TTLs for the same cached page; for example, conditional GETs versus other requests for Atom views.

The in-memory cache

The in-memory cache is essentially a version of the brute force cache that holds pages in memory instead of on disk. It's only effective in environments where DWiki serves multiple requests from the same process; currently it's only used if DWiki is running as a preforking SCGI server. Because it holds pages in memory as page response objects, the in-memory cache is about the fastest way that DWiki can serve requests. In particular it's faster to serve static pages from the IMC than from disk, so unlike the BFC the IMC does cache static pages.

Because IMC entries disappear automatically and are essentially free to create, the IMC caches pages unconditionally when active (unlike the BFC). This means that it should normally have a relatively low TTL, often lower than the BFC's TTL. Note that because the IMC is before the BFC, it can load its cache from BFC cache hits.

For obvious reasons, it's pointless to set the IMC cache size to be larger than the number of requests a preforked SCGI process will serve before exiting.

To keep IMC memory usage under control, the IMC has a settable maximum page size that it will cache. Tune this as appropriate for your environment.

The IMC can be deliberately forced on with imc-force-on, in case you're running DWiki in some other preforking environment (for example as a WSGI application under a preforking WSGI server such as uWSGI).

Considerations for the IMC TTL

Under some setups, DWiki will only be running as a (preforking) SCGI server when it's under heavy load; in others DWiki is running this way all of the time, even when the load is light. Because the IMC unconditionally caches pages the latter situation can be annoying; it means that someone who, say, writes and posts a new comment may not see that comment until the IMC TTL expires. DWiki makes some attempt to bypass the IMC (and the BFC) in the common case of someone leaving a comment. However this is not perfect (in part because it requires the web browser to accept cookies from DWiki).

(This also applies if you're running DWiki in some other preforking environment and have forced the IMC on.)

If you're running DWiki full time in an IMC-on environment, you likely want to set a quite low IMC TTL, such as 15 to 30 seconds. If you're running DWiki with the IMC on only under heavy load you can set a higher IMC TTL, such as two minutes (120 seconds).

The renderer cache

The renderer cache is actually two caches. The renderer cache proper caches the output of various renderers (cf TemplateSyntax). The output is cached with a validator and the cached results are fully validated before they get used; this means that renderer cache entries do not normally use a TTL and in theory could be valid for years.

The (heuristic) generator cache caches the output of some expensive precursor generator routines. These cache entries only have heuristic validators, where DWiki can be fooled if people try hard enough. Generator cache entries do have a TTL, so that if the heuristic is fooled DWiki will pick up the new result sooner or later. Some cache entries can also explicitly invalidated by DWiki in a pretty reliable process; by default, these have a much longer TTL than plain heuristic cache entries. These are called 'flagged' (heuristic) generator entries and various ConfigurationFile settings controlling how they behave are render-heuristic-flagged-....

(Trivia: the 'flagged' name is because such entries are invalidated using a flag file, or more accurately a flag cache entry.)

Currently the main renderer cache caches the output of various wikitext to HTML rendering routines while the generator cache caches the results of various filesystem 'find all descendents' walks that are used to build lists of comments (for Atom comments feeds and some wikitext macros; this uses explicit invalidation) and lists of pages (for Atom feeds and various blog renderers such as blog::prevnext).

Unfortunately, a DWiki page that has comment or access restrictions must be cached separately for each DWiki user that views it. Under some situations this can result in a number of identical copies being cached under different names. If you want to avoid this, DWiki lets you turn off renderer caching for non-anonymous users.

Force-invalidating list of pages caches

The general validator for blog::prevnext cache entries is the modification time for all of the directories involved that had files in them at the time (the latter condition is for technical reasons). The heuristical validator checks that some of the file timestamps are still the same, but it can't check all of them and still be a useful cache.

So the easy way to invalidate this is to change the modification time of a directory involved, for example with touch.

The 'list of pages' cache is similarly invalidated by changing a directory modification time. Unlike the blog::prevnext case, the directory times are the only thing that this cache checks. This is a bit of a pity but the performance improvements from caching this information are very visible.

Disk space usage and directories

Much like comments, each page that has something cached for it becomes a subdirectory, with the various cached things in files. The different sorts of caches use different top-level directories under the cachedir, so you have paths like cachedir/bfc, cachedir/renderers, and cachedir/generators.

Because some results include absolute URLs that mention the current hostname, DWiki must maintain separate caches for each Host: header it sees in the BFC and the general renderers cache. These are handled as subdirectories in each cache directory, so cachedir/bfc/localhost/... and so on. Entries in the generator cache don't depend on the current Host: header, so there is only one (sub)cache for all requests, cachedir/generators/all/....

Generally the general renderers cache uses the largest amount of disk space, followed by the BFC, and the generator cache is the smallest.

If you're using caching (and as mentioned, you probably want to), you'll want to periodically trim the caches. ChrisSiebenmann just does this by hand every so often by removing the cache directories entirely; DWiki will then rebuild them as necessary.