= DWiki's caching system

DWiki has optional caching in order to speed up generating results
repeatedly. DWiki uses a disk-based cache for this (although the
interface is abstracted and alternate forms of caching may be introduced
someday). There are three caches, which can be enabled separately: the
renderer cache, a brute force page cache, and an in-memory brute force
page cache that is only used if DWiki is running as a preforking SCGI
server.

DWiki never removes the files of out of date cache entries from the
disk cache; instead, it stops considering out of date ones to be
valid. Cleaning out the detritus is left for an external process.
ChrisSiebenmann considers this safer; giving a program an automated
_unlink()_ makes him nervous.

See ConfigurationFile for the options controlling the behavior of
the caches.

In theory DWiki's caching is optional. In practice a decent sized DWiki
is simply too slow without caching for some of the more expensive
operations and caching becomes more or less a necessity. ChrisSiebenmann
now believes that you should configure all levels of caching in
basically any DWiki unless you have some unusual need and are sure.

== The brute force cache

The brute force page cache is about as simple as you can get: it caches
complete requests for a configured time (called a time-to-live, or TTL).
That's it. The BFC is intended as a load-shedding measure when DWiki is
under significant load, so it only acts under certain circumstances:

* only on _GET_ or _HEAD_ requests.
* only on requests without a _Cookie:_ header.
* requests only get put into the cache if the system seems loaded.

(For speed, when something is valid in the cache DWiki just serves
it without checking the system load.)

A good BFC TTL is on the order of 30 seconds to three minutes or so;
long enough to shed significant load if you are getting a lot of hits
to a few pages and short enough that dynamic pages won't become too
outdated. (And that waiting to see a comment show up or whatever is
not *too* annoying.)

Because Atom syndication requests are among the most expensive pages to
compute, the BFC can be set to give them a longer TTL than usual. There
is a second TTL that can be set for Atom requests that aren't using
conditional _GET_; the idea is that if requesters cannot be bothered to
be polite, we can't be bothered to serve fresh content. Setting this
option always caches the results of such requests, even if the load is
low, which means that even people doing proper conditional GET requests
will use the cached results for as long as their (lower) TTL says to.

It's actually faster to serve static pages from the static page
server code than from the BFC, so the BFC doesn't try to cache
static pages.

=== The two sides of the BFC

It's important to understand that the BFC does not check load when it
is checking to see if something is in its cache. This means there are
two stages to processing a request: deciding what TTL to use for cache
checks, and deciding whether to cache something that was not current in
the cache.

The TTL used is:
# _bfc-atom-nocond-ttl_ if this is an unconditional request for an Atom
  view, if set.
# _bfc-atom-ttl_ for Atom view requests in general, if set.
# _bfc-cache-ttl_ otherwise.

Pages enter the BFC cache either because the system seems to be
loaded or because _bfc-atom-nocond-ttl_ was set and they were an
unconditional request for an Atom view.

Once something is in the cache, it will be served from the cache if it
is not older than the check TTL. Different requests can use different
check TTLs for the same cached page; for example, conditional GETs
versus other requests for Atom views.

== The in-memory cache

The in-memory cache is essentially a version of the brute force cache
that holds pages in memory instead of on disk. It's only effective in
environments where DWiki serves multiple requests from the same process;
currently it's only used if DWiki is running as a preforking SCGI
server. Because it holds pages in memory as page response objects, the
in-memory cache is about the fastest way that DWiki can serve requests.
In particular it's faster to serve static pages from the IMC than from
disk, so unlike the BFC the IMC does cache static pages.

Because IMC entries disappear automatically and are essentially free to
create, the IMC caches pages unconditionally when active (unlike the
BFC). This means that it should normally have a relatively low TTL,
often lower than the BFC's TTL. Note that because the IMC is before the
BFC, it can load its cache from BFC cache hits.

For obvious reasons, it's pointless to set the IMC cache size to be
larger than the number of requests a preforked SCGI process will serve
before exiting.

To keep IMC memory usage under control, the IMC has a settable maximum
page size that it will cache. Tune this as appropriate for your
environment.

The IMC can be deliberately forced on with _imc-force-on_, in case
you're running DWiki in some other preforking environment (for example
as a WSGI application under a preforking WSGI server such as uWSGI).

=== Considerations for the IMC TTL

Under some setups, DWiki will only be running as a (preforking) SCGI
server when it's under heavy load; in others DWiki is running this
way all of the time, even when the load is light. Because the IMC
unconditionally caches pages the latter situation can be annoying; it
means that someone who, say, writes and posts a new comment may not see
that comment until the IMC TTL expires. DWiki makes some attempt to
bypass the IMC (and the BFC) in the common case of someone leaving a
comment. However this is not perfect (in part because it requires the
web browser to accept cookies from DWiki).

(This also applies if you're running DWiki in some other preforking
environment and have forced the IMC on.)

If you're running DWiki full time in an IMC-on environment, you likely
want to set a quite low IMC TTL, such as 15 to 30 seconds. If you're
running DWiki with the IMC on only under heavy load you can set a higher
IMC TTL, such as two minutes (120 seconds).

== The renderer cache

The renderer cache is actually two caches. The renderer cache proper
caches the output of various renderers (cf TemplateSyntax). The output
is cached with a validator and the cached results are fully validated
before they get used; this means that renderer cache entries do not
normally use a TTL and in theory could be valid for years.

The (heuristic) generator cache caches the output of some expensive
precursor generator routines. These cache entries only have heuristic
validators, where DWiki can be fooled if people try hard enough.
Generator cache entries do have a TTL, so that if the heuristic is
fooled DWiki will pick up the new result sooner or later. Some cache
entries can also explicitly invalidated by DWiki in a pretty reliable
process; by default, these have a much longer TTL than plain heuristic
cache entries. These are called 'flagged' (heuristic) generator entries
and various ConfigurationFile settings controlling how they behave are
_render-heuristic-flagged-..._.

(Trivia: the 'flagged' name is because such entries are invalidated
using a flag file, or more accurately a flag cache entry.)

Currently the main renderer cache caches the output of various wikitext
to HTML rendering routines while the generator cache caches the results
of various filesystem 'find all descendents' walks that are used to
build lists of comments (for Atom comments feeds and some wikitext
macros; this uses explicit invalidation) and lists of pages (for Atom
feeds and various blog renderers such as _blog::prevnext_).

Unfortunately, a DWiki page that has comment or access restrictions
must be cached separately for each DWiki user that views it. Under some
situations this can result in a number of identical copies being cached
under different names. If you want to avoid this, DWiki lets you turn
off renderer caching for non-anonymous users.

=== Force-invalidating list of pages caches

The general validator for _blog::prevnext_ cache entries is the
modification time for all of the directories involved that had files in
them at the time (the latter condition is for technical reasons). The
heuristical validator checks that some of the file timestamps are still
the same, but it can't check all of them and still be a useful cache.

So the easy way to invalidate this is to change the modification
time of a directory involved, for example with _touch_.

The 'list of pages' cache is similarly invalidated by changing a
directory modification time. Unlike the _blog::prevnext_ case, the
directory times are the only thing that this cache checks. This is a bit
of a pity but the performance improvements from caching this information
are very visible.

== Disk space usage and directories

Much like comments, each page that has something cached for it
becomes a subdirectory, with the various cached things in files. The
different sorts of caches use different top-level directories under the
_cachedir_, so you have paths like _~~cachedir~~/bfc_,
_~~cachedir~~/renderers_, and _~~cachedir~~/generators_.

Because some results include absolute URLs that mention the
current hostname, DWiki must maintain separate caches for each
_Host:_ header it sees in the BFC and the general renderers cache.
These are handled as subdirectories in each cache directory, so
_~~cachedir~~/bfc/localhost/..._ and so on. Entries in the generator
cache don't depend on the current _Host:_ header, so there is only one
(sub)cache for all requests, _~~cachedir~~/generators/all/..._.

Generally the general renderers cache uses the largest amount of disk
space, followed by the BFC, and the generator cache is the smallest.

If you're using caching (and as mentioned, you probably want to), you'll
want to periodically trim the caches. ChrisSiebenmann just does this by
hand every so often by removing the cache directories entirely; DWiki
will then rebuild them as necessary.