Creating Sitemaps - Part 2:
RSS Sitemap

With XML sitemap generation now in place, creating an RSS feed of new pages should be fairly simple.

Recap and Plan

In part one of this series, Creating Sitemaps, I took my HTML sitemap and scripted a way to turn it into an XML sitemap.

Looking at Bing Webmaster Tools, Bing doesn't like the XML sitemap for some reason.

Also, the XML sitemap is modified whenever a page (or an include) is modified. When I'm working on the backend, that can be quite often.

An RSS sitemap feed could be used for just new pages. When a new page is created, there is a new item in the RSS feed.

This also won't be that difficult once I Google the correct syntax for several commands.

At the moment I delete the temporary parsed lynx dump file after it has been used. If I instead move it I could compare it with the new one and any new lines are new files.

Useful Information

Date and Time for pubDate

RSS feeds use RFC 822 datetime strings. The date command has the -R argument which, according to the man page, outputs RFC 2822 formatted datetime strings.

From a glance, the difference between RFC 822 and RFC 2822 in terms of formatting the date and time is that RFC 822 specifies years must be 2 digits, whereas RFC 2822 also allows 4 digits.

This actually looks like an error in the RSS 2.0 specification. The specification itself says RFC 822 with the exception that four digit years are allowed (and preferred). Surely specifying RFC 2822 would have been easier?

— John Cook (@WatfordJC) October 29, 2015

Test if New RSS File is Needed

$RSSUPDATE is set to 0 near the top of the script. If there is a new link (that gets successfully fetched by wget) $RSSUPDATE is set to 1.

If wget fails, that link in the sitemap is skipped over. This will make the RSS sitemap less prone to errors than the XML sitemap, which yesterday had a link that didn't exist because the HTML sitemap had a typo.

A new RSS file is not created if $RSSUPDATE is 0, and it therefore doesn't receive an updated modified time (making HTTP 304 Not Modified status codes more likely).

Newer Items at Top

Some RSS clients do not sort by date/time, they just display items in the order they appear in the RSS file.

By piping the code for the new item to cat, it is possible to prepend a file with new text. This does, however, require using a temporary file to output the result to and then renamin the temporary file to the original filename.

Trimming Old Items

Each item takes up 7 lines in the .rss.part file (file only contains items) so it is simple to get the 30 most recent items when creating the RSS XML file.

By using head -n210 I am able to pull the latest 30 items.

Valid XML for Empty RSS Feeds

There are a lot of possible protocol/hostname combinations for accessing my sites, and there is the possibility of someone reaching the feed from a hostname that is neither web.johncook.uk or web.watfordjc.uk.

An XML document must have a root node.

To ensure that both the XML and RSS sitemaps are valid XML in most cases I have done the following:

  • If the script crashes (errors) while running, the new file doesn't replace the current file—the mv command is the last thing.
  • The <rss> and <urlset> elements are outside of the PHP if/else if statements.

Invalid XML is still possible. During testing I accidentally forgot to rename a closing tag and ended up with an opening tag and no matching closing tag. That shouldn't happen now, however.

Character Entitites

After the code was in use for a few days Google Webmaster Tools started reporting an error with my RSS feed.

The problem? I use &ldquo; and &rdquo; in place of " in the meta description and og:description of a page because ASCII double quotes would require either escaping them or using HTML entities to avoid a problem in my code.

Had I used UTF-8 for the content of the HTML (as it is declared) rather than HTML entities I wouldn't have had this issue.

The fact is I have a British English keyboard layout, and typing &copy; is easier than Googling for, and copying/pasting, © whenever I want to use it.

After a bit of Googling I found out about the recode program. If &copy; is piped to recode html..utf-8 then any HTML character entities are converted to their UTF-8 equivalent.

While good as a first step, there is always the chance that recode is converting a character entity that is also used in XML.

The only characters that need converting to XML entities are & (because it indicates the start of entity markup with a semi-colon indicating the end) and < (because it indicates the start of element markup which ends with a >).

The XML character entities are &amp; (&), &lt; (<), &gt; (>), &quot; ("), and &apos; (').

As I will never be using &apos; (apostrophe/single quote) in the meta tags because I use double quotes to quote attributes, I have decided to pipe the output of recode to sed and convert the other four characters that have XML character entities to character entities.

After doing that, my code is now valid again, although I will have to wait to see if there are any issues with my new code.

A New Workflow

It isn't really that much different than how I was doing things before I added an XML sitemap.

The difference, however, is that with an XML (and now RSS) sitemap being generated whenever a new page is added to the site, I need to make sure pages I'm still working on aren't added before they are ready.

The first step in ensuring that was by only adding pages to the XML sitemap that have been added to the HTML sitemap.

Not only does that mean that I have to update the HTML sitemap file before the XML sitemap is updated, but it also means I'm more likely to keep the HTML sitemap up to date as Googlebot now uses the XML and RSS sitemaps based upon it.

The second step was modifying my script that updates Redis with the latest file modification times. By excluding directories I am now able to work on draft pages on the live site without them being recorded in Redis.

The Updated Sitemap XML/RSS Generation Script

#!/bin/sh

LATEST=`redis-cli zrevrange johncook.uk:files 0 0`
LATESTMODTIME=`redis-cli hget johncook.uk:file:"$LATEST" modified`
LATESTMODTIMEUNIX=`date -u -d "$LATESTMODTIME" +%s`
SITEMAPLASTMOD=`redis-cli hget johncook.uk:file:./links/sitemap.xml.php modified`
RSSPUBDATE=`date -u -R -d "$LATESTMODTIME"`
RSSUPDATE=0

if [ `date -u -d "$LATESTMODTIME" +%s` -eq `date -u -d "$SITEMAPLASTMOD" +%s` ]; then
	exit 0
elif [ `date -u -d "$LATESTMODTIME" +%s` -lt `date -u -d "$SITEMAPLASTMOD" +%s` ]; then
	exit 1
fi

INCLUDESLASTMODFILE=`redis-cli zrevrange johncook.uk:includes 0 0`
INCLUDESLASTMOD=`redis-cli hget johncook.uk:file:"$INCLUDESLASTMODFILE" modified`
HOMELASTMODFILE=`redis-cli zrevrange johncook.uk:home 0 0`
HOMELASTMOD=`redis-cli hget johncook.uk:file:"$HOMELASTMODFILE" modified`
ARTICLESLASTMODFILE=`redis-cli zrevrange johncook.uk:articles 0 0`
ARTICLESLASTMOD=`redis-cli hget johncook.uk:file:"$ARTICLESLASTMODFILE" modified`
BLOGSLASTMODFILE=`redis-cli zrevrange johncook.uk:blogs 0 0`
BLOGSLASTMOD=`redis-cli hget johncook.uk:file:"$BLOGSLASTMODFILE" modified`

cd /home/thejc/Scripts/


if [ ! -w /home/thejc/Scripts/sitemap_links_web_johncook_uk.rss.part ]; then
	touch /home/thejc/Scripts/sitemap_links_web_johncook_uk.rss.part
fi

if [ ! -w /home/thejc/Scripts/sitemap_links_web_watfordjc_uk.rss.part ]; then
	touch /home/thejc/Scripts/sitemap_links_web_watfordjc_uk.rss.part
fi

cat << EOF > sitemap.xml.php
<?php
\$if_modified = isset(\$_SERVER['HTTP_IF_MODIFIED_SINCE']) ? \$_SERVER['HTTP_IF_MODIFIED_SINCE'] : "nothing";
\$last_modified = gmdate("D, d M Y H:i:s","$LATESTMODTIMEUNIX")." GMT";
header("Cache-Control: Public, max-age=900, must-revalidate, s-maxage=600, proxy-revalidate");
header("X-Robots-Tag: noindex, nosnippet, noarchive, noodp");
if (\$if_modified == \$last_modified) {
	header("HTTP/1.1 304 Not Modified");
	header("Status: 304 Not Modified");
	exit();
}
header("Content-Type: application/xml;charset=UTF-8");
header("Last-Modified: ".\$last_modified);
?>
<?xml version="1.0" encoding="utf-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<?php if (\$_SERVER['HTTP_HOST'] == "web.johncook.uk") { ?>
EOF

tr -d "\n\r" < /home/www/var/www/johncook_co_uk/links/complete-sitemap.php | perl -pe 's/<\?php if \(\$site_name != "John Cook UK"\) { \?\>[\s\S]*?} \?>//g' | lynx -stdin -dump -listonly | grep file | sed 's,.*file://,,' > sitemap_links_web_johncook_uk.tmp

# 's,^/$,/index,; s,$,.php,; s,^,.,'

while read line; do
	FILE="$line"
	FILELOCAL=`echo "$line" | sed 's,^/$,/index,; s,$,.php,; s,^,.,' -`
	FILEMOD=`redis-cli hget johncook.uk:file:"$FILELOCAL" modified`
	PRIORITY="0.0"

	case "$FILE" in
/)
	if [ `date -u -d "$HOMELASTMOD" +%s` -gt `date -u -d "$FILEMOD" +%s` ]; then
		FILEMOD="$HOMELASTMOD"
	fi
	PRIORITY="1.0"
;;
/archives/johncook-[0-9][0-9][0-9][0-9])
	if [ `date -u -d "$HOMELASTMOD" +%s` -gt `date -u -d "$FILEMOD" +%s` ]; then
		FILEMOD="$HOMELASTMOD"
	fi
	PRIORITY="0.9"
;;
/gallery|/music|/links|/about|/status|/downloads)
	if [ `date -u -d "$INCLUDESLASTMOD" +%s` -gt `date -u -d "$FILEMOD" +%s` ]; then
		FILEMOD="$INCLUDESLASTMOD"
	fi
	PRIORITY="0.8"
;;
/articles/?*/?*|/blogs/?*/?*|/gallery/?*|/archives/?*|/links/?*|/?*/?*/?*)
	if [ `date -u -d "$INCLUDESLASTMOD" +%s` -gt `date -u -d "$FILEMOD" +%s` ]; then
		FILEMOD="$INCLUDESLASTMOD"
	fi
	PRIORITY="0.5"
;;
/articles/?*)
	if [ `date -u -d "$ARTICLESLASTMOD" +%s` -gt `date -u -d "$FILEMOD" +%s` ]; then
		FILEMOD="$ARTICLESLASTMOD"
	fi
	PRIORITY="0.7"
;;
/articles)
	if [ `date -u -d "$ARTICLESLASTMOD" +%s` -gt `date -u -d "$FILEMOD" +%s` ]; then
		FILEMOD="$ARTICLESLASTMOD"
	fi
	PRIORITY="0.8"
;;
/blogs/?*)
	if [ `date -u -d "$BLOGSLASTMOD" +%s` -gt `date -u -d "$FILEMOD" +%s` ]; then
		FILEMOD="$BLOGSLASTMOD"
	fi
	PRIORITY="0.7"
;;
/blogs)
	if [ `date -u -d "$BLOGSLASTMOD" +%s` -gt `date -u -d "$FILEMOD" +%s` ]; then
		FILEMOD="$BLOGSLASTMOD"
	fi
	PRIORITY="0.8"
;;
/music/?*)
	if [ `date -u -d "$INCLUDESSLASTMOD" +%s` -gt `date -u -d "$FILEMOD" +%s` ]; then
		FILEMOD="$INCLUDESLASTMOD"
	fi
	PRIORITY="0.7"
;;
*)
	if [ `date -u -d "$INCLUDESLASTMOD" +%s` -gt `date -u -d "$FILEMOD" +%s` ]; then
		FILEMOD="$INCLUDESLASTMOD"
	fi
	PRIORITY="0.1"
;;
	esac

	FILEMODW3C=`date -u -d "$FILEMOD" +%FT%T`"+00:00"
	cat << EOF >> sitemap.xml.php
	<url>
		<loc>https://web.johncook.uk$FILE</loc>
		<lastmod>$FILEMODW3C</lastmod>
		<priority>$PRIORITY</priority>
	</url>
EOF
done < sitemap_links_web_johncook_uk.tmp

cat << EOF >> sitemap.xml.php
<?php } ?>
EOF

##=========================##

sort /home/thejc/Scripts/sitemap_links_web_johncook_uk.tmp /home/thejc/Scripts/sitemap_links_web_johncook_uk.tmp2 | uniq -u > sitemap_links_web_johncook_uk.rss.tmp

mv -f sitemap_links_web_johncook_uk.tmp sitemap_links_web_johncook_uk.tmp2

while read line; do
	FILELOCAL=`echo "$line" | sed 's,^/$,/index,; s,$,.php,; s,^,.,' -`
	FILEMOD=`redis-cli hget johncook.uk:file:"$FILELOCAL" modified`
	FILEMODRFC=`date -u -R -d "$FILEMOD"`

	wget -q -O sitemap_links_web_johncook_uk.rss.link_src "https://web.johncook.uk$line" || break
	RSSUPDATE=1

	ARTICLE_TITLE=`sed -n '/meta/s/.*property="og:title"\s\+content="\([^"]\+\).*/\1/p' sitemap_links_web_johncook_uk.rss.link_src | head -n1 | recode html..utf-8 | sed 's,&,\&amp;,g;s,",\&quot;,g;s,<,\&lt;,g;s,>,\&gt;,g'`
	ARTICLE_DESCRIPTION=`sed -n '/meta/s/.*property="og:description"\s\+content="\([^"]\+\).*/\1/p' sitemap_links_web_johncook_uk.rss.link_src | head -n1 | recode html..utf-8 | sed 's,&,\&amp;,g;s,",\&quot;,g;s,<,\&lt;,g;s,>,\&gt;,g'`
	ARTICLE_URL=`sed -n '/meta/s/.*property="og:url"\s\+content="\([^"]\+\).*/\1/p' sitemap_links_web_johncook_uk.rss.link_src | head -n1`

	rm sitemap_links_web_johncook_uk.rss.link_src

	cat << EOF | cat -s - /home/thejc/Scripts/sitemap_links_web_johncook_uk.rss.part > sitemap_links_web_johncook_uk.rss.part.tmp && mv sitemap_links_web_johncook_uk.rss.part.tmp sitemap_links_web_johncook_uk.rss.part
 <item>
  <title>$ARTICLE_TITLE</title>
  <description>$ARTICLE_DESCRIPTION</description>
  <link>$ARTICLE_URL</link>
  <pubDate>$FILEMODRFC</pubDate>
  <guid isPermaLink="true">$ARTICLE_URL</guid>
 </item>
EOF

done < sitemap_links_web_johncook_uk.rss.tmp
rm sitemap_links_web_johncook_uk.rss.tmp

#=========================#

cat << EOF >> sitemap.xml.php
<?php if (\$_SERVER['HTTP_HOST'] == "web.watfordjc.uk") { ?>
EOF

cat /home/www/var/www/johncook_co_uk/links/complete-sitemap.php | lynx -stdin -dump -listonly | grep file | sed 's,.*file://,,' > sitemap_links_web_watfordjc_uk.tmp

while read line; do
	FILE="$line"
	FILELOCAL=`echo "$line" | sed 's,^/$,/index,; s,$,.php,; s,^,.,' -`
	FILEMOD=`redis-cli hget johncook.uk:file:"$FILELOCAL" modified`
	PRIORITY="0.0"
	IGNOREFILE=`head -n37 "/home/www/var/www/johncook_co_uk/$FILELOCAL" | egrep "\$NSFW = \"(NULL|NSFW)\""`
if [ "$IGNOREFILE" != "" ]; then

	case "$FILE" in
/)
	if [ `date -u -d "$HOMELASTMOD" +%s` -gt `date -u -d "$FILEMOD" +%s` ]; then
		FILEMOD="$HOMELASTMOD"
	fi
	PRIORITY="1.0"
;;
/archives/johncook-[0-9][0-9][0-9][0-9])
	if [ `date -u -d "$HOMELASTMOD" +%s` -gt `date -u -d "$FILEMOD" +%s` ]; then
		FILEMOD="$HOMELASTMOD"
	fi
	PRIORITY="0.9"
;;
/gallery|/music|/links|/about|/status|/downloads)
	if [ `date -u -d "$INCLUDESLASTMOD" +%s` -gt `date -u -d "$FILEMOD" +%s` ]; then
		FILEMOD="$INCLUDESLASTMOD"
	fi
	PRIORITY="0.8"
;;
/articles/?*/?*|/blogs/?*/?*|/gallery/?*|/archives/?*|/links/?*|/?*/?*/?*)
	if [ `date -u -d "$INCLUDESLASTMOD" +%s` -gt `date -u -d "$FILEMOD" +%s` ]; then
		FILEMOD="$INCLUDESLASTMOD"
	fi
	PRIORITY="0.5"
;;
/articles/?*)
	if [ `date -u -d "$ARTICLESLASTMOD" +%s` -gt `date -u -d "$FILEMOD" +%s` ]; then
		FILEMOD="$ARTICLESLASTMOD"
	fi
	PRIORITY="0.7"
;;
/articles)
	if [ `date -u -d "$ARTICLESLASTMOD" +%s` -gt `date -u -d "$FILEMOD" +%s` ]; then
		FILEMOD="$ARTICLESLASTMOD"
	fi
	PRIORITY="0.8"
;;
/blogs/?*)
	if [ `date -u -d "$BLOGSLASTMOD" +%s` -gt `date -u -d "$FILEMOD" +%s` ]; then
		FILEMOD="$BLOGSLASTMOD"
	fi
	PRIORITY="0.7"
;;
/blogs)
	if [ `date -u -d "$BLOGSLASTMOD" +%s` -gt `date -u -d "$FILEMOD" +%s` ]; then
		FILEMOD="$BLOGSLASTMOD"
	fi
	PRIORITY="0.8"
;;
/music/?*)
	if [ `date -u -d "$INCLUDESSLASTMOD" +%s` -gt `date -u -d "$FILEMOD" +%s` ]; then
		FILEMOD="$INCLUDESLASTMOD"
	fi
	PRIORITY="0.7"
;;
*)
	if [ `date -u -d "$INCLUDESLASTMOD" +%s` -gt `date -u -d "$FILEMOD" +%s` ]; then
		FILEMOD="$INCLUDESLASTMOD"
	fi
	PRIORITY="0.1"
;;
	esac

	FILEMODW3C=`date -u -d "$FILEMOD" +%FT%TZ`
	cat << EOF >> sitemap.xml.php
	<url>
		<loc>https://web.watfordjc.uk$FILE</loc>
		<lastmod>$FILEMODW3C</lastmod>
		<priority>$PRIORITY</priority>
	</url>
EOF

fi

done < sitemap_links_web_watfordjc_uk.tmp

cat << EOF >> sitemap.xml.php
<?php } ?>
EOF

##=========================##

sort /home/thejc/Scripts/sitemap_links_web_watfordjc_uk.tmp /home/thejc/Scripts/sitemap_links_web_watfordjc_uk.tmp2 | uniq -u > sitemap_links_web_watfordjc_uk.rss.tmp

mv -f sitemap_links_web_watfordjc_uk.tmp sitemap_links_web_watfordjc_uk.tmp2

while read line; do
	FILELOCAL=`echo "$line" | sed 's,^/$,/index,; s,$,.php,; s,^,.,' -`
	FILEMOD=`redis-cli hget johncook.uk:file:"$FILELOCAL" modified`
	FILEMODRFC=`date -u -R -d "$FILEMOD"`

	wget -q -O sitemap_links_web_watfordjc_uk.rss.link_src "https://web.watfordjc.uk$line" || break
	RSSUPDATE=1

	ARTICLE_TITLE=`sed -n '/meta/s/.*property="og:title"\s\+content="\([^"]\+\).*/\1/p' sitemap_links_web_watfordjc_uk.rss.link_src | head -n1 | recode html..utf-8 | sed 's,&,\&amp;,g;s,",\&quot;,g;s,<,\&lt;,g;s,>,\&gt;,g'`
	ARTICLE_DESCRIPTION=`sed -n '/meta/s/.*property="og:description"\s\+content="\([^"]\+\).*/\1/p' sitemap_links_web_watfordjc_uk.rss.link_src | head -n1 | recode html..utf-8 | sed 's,&,\&amp;,g;s,",\&quot;,g;s,<,\&lt;,g;s,>,\&gt;,g'`
	ARTICLE_URL=`sed -n '/meta/s/.*property="og:url"\s\+content="\([^"]\+\).*/\1/p' sitemap_links_web_watfordjc_uk.rss.link_src | head -n1`

	rm sitemap_links_web_watfordjc_uk.rss.link_src
		
	if [ ! "$ARTICLE_URL" = https://web.watfordjc.uk/* ]; then
		break;
	fi

	cat << EOF | cat -s - /home/thejc/Scripts/sitemap_links_web_watfordjc_uk.rss.part > sitemap_links_web_watfordjc_uk.rss.part.tmp && mv sitemap_links_web_watfordjc_uk.rss.part.tmp sitemap_links_web_watfordjc_uk.rss.part
 <item>
  <title>$ARTICLE_TITLE</title>
  <description>$ARTICLE_DESCRIPTION</description>
  <link>$ARTICLE_URL</link>
  <pubDate>$FILEMODRFC</pubDate>
  <guid isPermaLink="true">$ARTICLE_URL</guid>
 </item>
EOF

done < sitemap_links_web_watfordjc_uk.rss.tmp
rm sitemap_links_web_watfordjc_uk.rss.tmp

#=======================#

echo -n "</urlset>" >> sitemap.xml.php

touch -d "$LATESTMODTIME" sitemap.xml.php
mv -f sitemap.xml.php /home/www/var/www/johncook_co_uk/links/

#=======================#

if [ $RSSUPDATE -eq 1 ]; then

cat << EOF > sitemap-rss.xml.php
<?php
\$if_modified = isset(\$_SERVER['HTTP_IF_MODIFIED_SINCE']) ? \$_SERVER['HTTP_IF_MODIFIED_SINCE'] : "nothing";
\$last_modified = gmdate("D, d M Y H:i:s","$LATESTMODTIMEUNIX")." GMT";
header("Cache-Control: Public, max-age=600, must-revalidate, s-maxage=600, proxy-revalidate");
header("X-Robots-Tag: noindex, nosnippet, noarchive, noodp");
if (\$if_modified == \$last_modified) {
        header("HTTP/1.1 304 Not Modified");
        header("Status: 304 Not Modified");
        exit();
}
header("Content-Type: application/rss+xml;charset=UTF-8");
header("Last-Modified: ".\$last_modified);
?>
<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
<?php if (\$_SERVER['HTTP_HOST'] == "web.johncook.uk") { ?>
<channel>
 <title>John Cook UK</title>
 <description>New canonical pages created at John Cook UK.</description>
 <link>https://web.johncook.uk/</link>
 <lastBuildDate>$RSSPUBDATE</lastBuildDate>
 <pubDate>$RSSPUBDATE</pubDate>
 <language>en-GB</language>
 <ttl>20</ttl>
EOF
head -n 210 /home/thejc/Scripts/sitemap_links_web_johncook_uk.rss.part >> sitemap-rss.xml.php
echo "</channel>" >> sitemap-rss.xml.php
cat << EOF >> sitemap-rss.xml.php
<?php } else if (\$_SERVER['HTTP_HOST'] == "web.watfordjc.uk") { ?>
<channel>
 <title>WatfordJC UK</title>
 <description>New canonical pages created at WatfordJC UK.</description>
 <link>https://web.watfordjc.uk/</link>
 <lastBuildDate>$RSSPUBDATE</lastBuildDate>
 <pubDate>$RSSPUBDATE</pubDate>
 <language>en-GB</language>
 <ttl>20</ttl>
EOF
head -n 210 /home/thejc/Scripts/sitemap_links_web_watfordjc_uk.rss.part >> sitemap-rss.xml.php
echo "</channel>" >> sitemap-rss.xml.php
echo "<?php } ?>" >> sitemap-rss.xml.php
echo "</rss>" >> sitemap-rss.xml.php

touch -d "$LATESTMODTIME" sitemap-rss.xml.php
mv -f sitemap-rss.xml.php /home/www/var/www/johncook_co_uk/links/

fi