Improving Cacheability

At the moment the majority of my dynamic pages are not cacheable. Time to take a look at this.

Cacheable Pages

At present most (all?) static pages are cacheable. CSS and JS files are cacheable, as are images and other binary resources I have added the MIME/extension for.

In the nginx configuration for this site all .css, .jpg, .jpeg, and .png are passed to lighttpd on my home server as the backend.

Also, all files in /img/, /js/, and /css/ are passed directly to lighttpd on my home server.

Everything in /domains/ (currently non-existant) is configured in lighttpd to be non-cacheable.

In lighttpd I have configured .gif, .jpg, .jpeg, .png, .ico, .css, .js, .svg, .woff, .ttf, .otf, and .eot to be cacheable for 1 month.

All extensions listed have Pragma=Private and Cache-Control=private. Some extensions also have Access-Control-Allow-Origin=* and X-Content-Type-Options=nosniff.

Some extensions (images) have ETag disabled so that last modified time and URI are the only things that determine if a file has changed. It isn't that important to serve the latest version of an image file (e.g. if it has since been optimised) if the client already has a copy.

ESI Includes

Another thing that is cacheable are ESI Includes. For example, /inc/esi/site_name is cached for 2 hours (client, max-age) to 1 day (server, s-max-age).

That means even if my backend goes down varnish will still know that the name of this site is WatfordJC UK. Technically only caching servers with ESI support should be fetching these URIs anyway, so max-age is currently irrelevant.

While ESI Includes are a topic I need to revisit soon, they aren't the focus of this article.

PHP Pages

All content pages on this site have a similar list of included files. For the most part these included files rarely change.

The problem with an overall last modified time of a page is that it would require a filemtime() of all the included files plus the requested file and return the most recent timestamp.

In the case of this page, for example, that would be a lookup of the modification time of 7 included files plus the requested file. For the home page, that also means a lookup of the modification time of all the included posts.

I could generate a list of included files with timestamps, or iterate through the filelist looking for the latest modification time, but that sounds like a lot of disk accesses for every (uncached) page load.

I am not yet sure what the most elegant solution to this problem is, but I am thinking redis on my home server might be something worth looking at.

redis

Redis is a key-value store that is supposedly very fast.

This is what I'm currently thinking:

  • Global includes: path+filename -> modified time
  • Posts: path+filename -> modified time
  • Year: year -> modified time

I somehow need to learn if redis and some bash script could do what I want.

Since find can recursively look for any files modified since a particular time, it should be possible to do what I want.

The question, though, is what is the best way to organise the keys and values?

Creating a Redis List

What I think I need is a hashed set for keys and values (filenames, modified time) and a set for filenames so I can quickly sort them.

FILE="inc/header.php"; MODIFIED=`stat -c %Y $FILE`; redis-cli hset file:$FILE modified $MODIFIED; redis-cli zadd includes $MODIFIED $FILE
FILE="inc/footer.php"; MODIFIED=`stat -c %Y $FILE`; redis-cli hset file:$FILE modified $MODIFIED; redis-cli zadd includes $MODIFIED $FILE
FILE="inc/breadcrumbs.php"; MODIFIED=`stat -c %Y $FILE`; redis-cli hset file:$FILE modified $MODIFIED; redis-cli zadd includes $MODIFIED $FILE
redis-cli hget file:`redis-cli zrevrange includes 0 0 | sed 's/$//'` modified | sed 's/$//'

What this does is add the filenames to a Redis hashed set with a modified value equal to their unix timestamp, and to a Redis sorted set with a priority equal to the unix timestamp of the file.

By using ZREVRANGE to get the filename of the most recently modified file, I then use sed to get just the filename, and then ask redis for the modification time of that filename. I then use sed again and am left with a pure number equal to the unix timestamp of the most recently modified of all files in the 'includes' set.

Sortable Timestamp

Unix timestamps are just too difficult to translate to different formats. RFC 2822 timestamps, although close to RFC 2616 timestamps, aren't sortable. RFC 3339 timestamps are sortable and easily translatable to other formats.

Since RFC 3339 timestamps are also easy to use in find, such as find inc/ -type f -newermt "1970-01-01" giving a list of files modified since the Unix epoch (useful for an initial file scan).

I think the best thing to do would be to use the unix timestamp for the sorted list, and the RFC 3339 timestamp for the modified value in the hashed list.

Bash Script

Bringing everything together I have come up with the following script.

nano ~/Scripts/johncook_co_uk-redis-includes.sh
#!/bin/sh
cd /home/www/var/www/johncook_co_uk

LATEST=`redis-cli zrevrange johncook.uk:includes 0 0`
LATESTMODTIME=`redis-cli hget johncook.uk:includes:file:"$LATEST" modified`

#LATESTMODTIME="1970-01-01"

find inc/ -type f -newermt "$LATESTMODTIME" | while read -r line; do
	FILE="$line"
	MODTIME=`stat -c %y "$line"`
	MODTIMERFC=`date -u --rfc-3339=ns -d "$MODTIME"`
	MODTIMEUNIX=`date -u -d "$MODTIME" +%s`
	cat <<EOF | redis-cli
MULTI
hset johncook.uk:file:"$FILE" modified "$MODTIMERFC"
zadd johncook.uk:includes "$MODTIMEUNIX" "$FILE"
EXEC
EOF
done

I am using MULTI and EXEC so that there are no race conditions during updates. It isn't essential for this use case, but it is wise to make these type of updates immediately after one another.

The commented out variable setting is in case I need to go through all the files instead of just those modified recently (e.g. to repopulate the database).

Having got the data into the database, the next step is to use PHP to create Last-Modified tags.

PHP Code

I am going to be using the php5-redis module, thus:

sudo apt-get update
sudo apt-get dist-upgrade
sudo apt-get install php5-redis
sudo ~/Scripts/sync-webroot-after-updates.sh

I already have some caching code scattered around, so the easiest way to add Redis lookups will be to adjust what I have already coded.

First, I need to create a function to do Redis lookups (like what I'm doing at the start of the bash script. I also want to ensure that I am reusing as much code as possible.

function lookup_lastmodified($category) {
	$redis = new Redis();
	
	$redis->pconnect(127.0.0.1);
	$latest_file = $redis->zRevRange("johncook.uk:".$category, 0, 0)[0];
	$latest_modtime = $redis->hGet('johncook.uk:file:'.$latest_file, 'modified');
	$latest_modtime_unix = DateTime::createFromFormat("Y-m-d H:i:s.u???T",$latest_modtime)->format('U');
	return $latest_modtime;
}

I also need to modify my cache_headers function so that it uses the latest date out of the includes and the requested page.

function cache_headers($cache_public,$client_cache_for,$client_revalidate,$proxy_cache_for,$proxy_revalidate) {
	$if_modified = isset($_SERVER['HTTP_IF_MODIFIED_SINCE']) ? $_SERVER['HTTP_IF_MODIFIED_SINCE'] : "nothing";
	$last_modified = gmdate("D, d M Y H:i:s",max(lookup_lastmodified("includes"),filemtime($_SERVER['DOCUMENT_ROOT'].$_SERVER['SCRIPT_NAME'])))." GMT";
	header("Last-Modified: ".$last_modified);
	$caching = $cache_public ? "Public" : "Private";
	$c_cache_secs = $client_cache_for !== false ? ", max-age=".$client_cache_for " "";
	$c_revalidate = $client_revalidate ? ", must-revalidate" : "";
	$p_cache_secs = $proxy_cache_for !== false ? ", s-maxage=".$proxy_cache_for : "";
	$p_revalidate = $proxy_revalidate ? ", proxy-revalidate" : "";
	header("Cache-Control: ".$caching.$c_cache_secs.$c_revalidate.$p_cache_secs.$p_revalidate);
	if ($if_modified == $last_modified) {
		header("HTTP/1.1 304 Not Modified");
		header("Status: 304 Not Modified");
		exit();
	}
}

Finally, I need to call cache_headers before any non-header content is sent to the browser. I have a check in htmlheader.php so that NSFW content returns a 404 on Web.JohnCook.UK and exits, and I want that to happen before there is any chance of sending a 304. Therefore, underneath my include_it() function I have added a switch test.

switch ($script_name) {
	// Test if page is one that can be cached based on __FILE__
	case "/gallery":
	case "/gallery/3d-gardening-photos":
	case "/music":
	case "/links":
	case "/about":
	case "/status":
		cache_headers(true,3600,true,3600,true);
		break;
}

There is still some work to do. I need to modifiy things so that cache_headers can check multiple categories (e.g. on the home page I need to check includes, articles, and blogs), but for the moment I have improved the cacheability of some pages.

I have also commented out the previous code in /robots.php and replaced it with an include of htmlheader.php and a cache_headers() command.

Future Ideas

At some point I plan on adding more functionality to the bash scripts (includes/articles/blogs) so that I can add cache invalidation.

For example, when something in /inc/ changes I could invalidate the cache of all the top level pages (/, /articles, /blogs, /status, etc.).

When / changes, I could invalidate the home page in my two varnish instances and in Cloudflare.

Thinking about it, there is no need to compare multiple lists looking for the latest date. I can just add the /inc pages to johncook.uk:articles as well.

Having added my bash scripts for /inc, /articles, and /blogs to my crontab, I can probably do away with using getmtime at some point. Not yet though, because to do that I will need to make sure every PHP page on my site is listed in the database.

Further Modifications

I developed the above further by bringing all the file scanning into a single bash script, and added a full file scan of the site so that every file's modification time is in the Redis database.

With every file in the site's folder structure having an entry in the database, I could remove the filemtime() check of the current file.

With a further optimisation I brought the total number of lookups per page to two—one for the requested page, and one for the most relevant group of files.

New Bash Script

nano ~/Scripts/johncook.uk-redis.sh
#!/bin/sh
cd /home/www/var/www/johncook_co_uk

#---

LATEST=`redis-cli zrevrange johncook.uk:includes 0 0`
LATESTMODTIME=`redis-cli hget johncook.uk:file:"$LATEST" modified`

#LATESTMODTIME="1970-01-01"

find inc/ -type f -newermt "$LATESTMODTIME" | while read -r line; do
	FILE="$line"
	MODTIME=`stat -c %y "$line"`
	MODTIMERFC=`date -u --rfc-3389=ns -d "$MODTIME"`
	MODTIMEUNIX=`date -u -d "$MODTIME" +%s`
	cat <<EOF | redis-cli
MULTI
hset johncook.uk:file:"$FILE" modified "$MODTIMERFC"
zadd johncook.uk:includes "$MODTIMEUNIX" "$FILE"
zadd johncook.uk:articles "$MODTIMEUNIX" "$FILE"
zadd johncook.uk:blogs "$MODTIMEUNIX" "$FILE"
zadd johncook.uk:home "$MODTIMEUNIX" "$FILE"
EXEC
EOF
done

#---

LATEST=`redis-cli zrevrange johncook.uk:articles 0 0`
LATESTMODTIME=`redis-cli hget johncook.uk:file:"$LATEST" modified`

#LATESTMODTIME="1970-01-01"

find articles/ -type f -newermt "$LATESTMODTIME" | while read -r line; do
	FILE="$line"
	MODTIME=`stat -c %y "$line"`
	MODTIMERFC=`date -u --rfc-3389=ns -d "$MODTIME"`
	MODTIMEUNIX=`date -u -d "$MODTIME" +%s`
	cat <<EOF | redis-cli
MULTI
hset johncook.uk:file:"$FILE" modified "$MODTIMERFC"
zadd johncook.uk:articles "$MODTIMEUNIX" "$FILE"
zadd johncook.uk:home "$MODTIMEUNIX" "$FILE"
EXEC
EOF
done

#---

LATEST=`redis-cli zrevrange johncook.uk:blogs 0 0`
LATESTMODTIME=`redis-cli hget johncook.uk:file:"$LATEST" modified`

#LATESTMODTIME="1970-01-01"

find blogs/ -type f -newermt "$LATESTMODTIME" | while read -r line; do
	FILE="$line"
	MODTIME=`stat -c %y "$line"`
	MODTIMERFC=`date -u --rfc-3389=ns -d "$MODTIME"`
	MODTIMEUNIX=`date -u -d "$MODTIME" +%s`
	cat <<EOF | redis-cli
MULTI
hset johncook.uk:file:"$FILE" modified "$MODTIMERFC"
zadd johncook.uk:blogs "$MODTIMEUNIX" "$FILE"
zadd johncook.uk:home "$MODTIMEUNIX" "$FILE"
EXEC
EOF
done

#---

LATEST=`redis-cli zrevrange johncook.uk:files 0 0`
LATESTMODTIME=`redis-cli hget johncook.uk:file:"$LATEST" modified`

#LATESTMODTIME="1970-01-01"

find -type f -newermt "$LATESTMODTIME" | while read -r line; do
	FILE="$line"
	MODTIME=`stat -c %y "$line"`
	MODTIMERFC=`date -u --rfc-3389=ns -d "$MODTIME"`
	MODTIMEUNIX=`date -u -d "$MODTIME" +%s`
	cat <<EOF | redis-cli
MULTI
hset johncook.uk:file:"$FILE" modified "$MODTIMERFC"
zadd johncook.uk:files "$MODTIMEUNIX" "$FILE"
EXEC
EOF
done

New PHP Functions

By modifying my cache_headers() PHP function, adding a new parameter for the relevant category to lookup the latest modification time of (inc,articles,blogs,home in the above bash script), I can directly lookup the latest modification times of the relevant category without hard-coding it in the cache_headers() function.

For the latest modification time of a particular file, I just use zScore to get the score ($MODTIMEUNIX in the bash script) of that file.

For the latest modification time of a category, I use zRevRange with scores to get the file (and score) of the file in that category with the highest score (latest modification time) which returns an array. I then use array_values on the resultant array to return the value of the zeroth item.

nano inc/htmlheader.php
function cache_headers($cache_public,$client_cache_for,$client_revalidate,$proxy_cache_for,$proxy_revalidate,$category) {
	$if_modified = isset($_SERVER['HTTP_IF_MODIFIED_SINCE']) ? $_SERVER['HTTP_IF_MODIFIED_SINCE'] : "nothing";
	$last_modified = gmdate("D, d M Y H:i:s",max(lookup_lastmodified($category),lookup_file_lastmodified($_SERVER['SCRIPT_NAME'])))." GMT";
	header("Last-Modified: ".$last_modified);
	$caching = $cache_public ? "Public" : "Private";
	$c_cache_secs = $client_cache_for !== false ? ", max-age=".$client_cache_for : "";
	$c_revalidate = $client_revalidate ? ", must-revalidate" : "";
	$p_cache_secs = $proxy_cache_for !== false ? ", s-maxage=".$proxy_cache_for : "";
	$p_revalidate = $proxy_revalidate ? ", proxy-revalidate" : "";
	header("Cache-Control: ".$caching.$c_cache_secs.$c_revalidate.$p_cache_secs.$p_revalidate);
	if ($if_modified == $last_modified) {
		header("HTTP/1.1 304 Not Modified");
		header("Status: 304 Not Modified");
		exit();
	}
}

function lookup_lastmodified($category) {
	$redis = new Redis();
	$redis->pconnect('127.0.0.1');
	$latest_modtime_unix = array_values($redis->zRevRange("johncook.uk:".$category, 0, 0, true))[0];
	return $latest_modtime_unix;
}

function lookup_file_lastmodified($file) {
	$redis = new Redis();
	$redis->pconnect('127.0.0.1');
	$latest_modtime_unix = $redis->zScore("johncook.uk:files",".".$file);
	return $latest_modtime_unix;
}

Vary by User-Agent

At present pages on WatfordJC.UK & JohnCook.UK are dynamic PHP pages. Whether or not they go through Cloudflare, they do go through nginx, two varnish caches, and lighttpd.

Neither lighttpd or nginx are configured for caching. The varnish caches located on my home server and vps3 combine the pages with any ESI includes.

At the moment there are several different versions of each page that can be cached. A copy for the latest browsers and bots, a copy for each outdated browser (e.g. Chrome < 43), and a duplicate of each page already mentioned for whether the browser has got the fonts cookie or not (bots are treated as having the fonts cookie).

An Internet Explorer specific header can be moved to a <meta> tag. Although doing so will add a few bytes to every page loaded with a browser other than IE (which would have received the header anyway previously), if I can remove vary by user-agent, many times as many bytes will be saved between my nginx and any intermediate caching proxy (e.g. Cloudflare).

As for everything else browser specific I should be able to move it outside of the page into something loaded into the page, such as using JSON.