New Domain Names - Part 3:
Googlebot and the Google Index

In part 3 of this series of articles I look at Googlebot and the Google Index.

Recap

At the end of the previous article in this series, New Domain Names - Part 2: Moving URLs, I created a permanent (HTTP/1.1 301 Moved Permanently) redirect for two articles.

Since then I have created redirects for most of the content that is located on either JohnCook.UK or WatfordJC.UK, with some entire sections of WatfordJC.co.uk (such as the image gallery) being fully redirected to URLs at Web.JohnCook.UK.

Until I have re-created all of the old content on the new domains, my old site will look like it is in a state of flux with a visitor not knowing which links are still internal and which links will result in a redirection to one of the new domains.

Let's get on with the purpose of this article: Googlebot and the Google Index.

Googlebot

Googlebot is a spider that follows (or revisits) links so that when you Google something there is a chance that, if someone has wrote about it, a link to it is returned to you.

There are many Googlebots, as a single machine is not capable of indexing the entire indexable Web, and there are many types of Googlebots. There are the Mobile Googlebots (Googlebot-Mobile), the Images Googlebots (Googlebot-Image), the Web Search Googlebots (Googlebot), and others.

Googlebots visit a page, much like humans do, and analyse the data of the page, also like humans do. Unlike humans, however, Googlebot does a lot more than we do when looking at a page, as it uses an algorithm that magically turns the page into a link that can be searched for in under a second.

Well, it does if Google deems the content worth indexing and you know the exact search term that will return that page in the Google index on the first page of Google results.

Crafting a page (or site) so it is seen by Google's algorithm as the most relevant for a search term has spawned an entire industry that focusses on Search Engine Optimisation (or SEO for short). As you can imagine, SEOs use SEO so they rank high in Google's Index for... SEO.

If a Search Engine Optimiser (also SEO for short) is not in the top 10 results of Google when you Google SEO, and they have done all the SEO optimising they can, that shows just how limited SEO is.

Search Engine Optimisation can not work miracles, but there are best practises.


Search Engine Optimisation

The Web has changed a lot over the years.

If you compare the source code of this page to one I created before XHTML came along you will likely see some differences because XHTML changed "self-closing tags" so that a forward slash was required. For example <img ...> became <img ... />. In HTML5 we call these "void elements" as they do not require an "end tag", and the forward slash is now optional.

Compare this page to one I created back in the 90's, and you will undoubtedly notice another difference: almost all of my HTML elements these days are in lower case whereas they used to be in upper case.

HTML5 elements are case-insensitive, yet I continue to use lower case. Why? The same reason a line-break (<br />) might include a forward slash: muscle memory (and my mind) have made it so some things now look more natural. <BODY> just looks wrong to me these days.

These two changes might not appear like much to humans, but a robot/spider needs to be able to understand what is the essence of a Web page. Tables and flash intro pages have given way to CSS and optimising for robots. SEO at its simplest is structuring pages so that humans and robots alike can see the essence of a page.

HTML5 and Web Page Optimisation

If I were just starting to learn HTML now I wouldn't have some remnants of habits left over from the old days. Instead, I have to learn HTML5 as if it is something new even though I have been using paragraph (<p>Paragraph</p>) tags for over a decade.

Some things have changed, some new things have come along, and some things are no more. HTML5 has brought with it some accessibility improvements, so a page optimised and using HTML5 elements correctly is not only accessible for humans and Googlebots, but for screen readers and other accessibility programs as well.

In order to optimise Web pages for Googlebot, I need to know what the rules of HTML5 are. For that, I need to do some Googling.


Page Structure and Sectioning Elements

HTML5 has introduced sectioning elements. Here is a list of these new elements, taken from How to Use The HTML5 Sectioning Elements:

  • <header>
  • <main>
  • <footer>
  • <section>
  • <article>
  • <aside>
  • <nav>

These new elements have rules, and the newest element (<main>) is only permitted to be used once per page.

The reason I am bringing this up now is because I told Googlebot to index https://Web.JohnCook.UK/ (and all linked pages) yesterday, and today the Google results for site:JohnCook.UK are a bit of a mess. Perhaps if I were using the sectioning elements corretly the results would look consistent.

So let us take each sectioning element, and determine what each should be used for by going in the opposite direction: what sectioning element should surround this thing?

Site Navigation Bar

The site navigation bar is at the top of every page. If you're using a device with a small screen it is collapsed. If you don't have CSS (and/or JavaScript) enabled it is at the bottom of the page. It is also located at the bottom of the page if you are viewing the page source.

I have one known bug with the navigation bar: the padlock icon does not have link text. It is on my list of things to do, so I will come back to that another time.

Where do site-wide navigation bars belong in HTML5?

It obviously belongs inside <nav>Navigation</nav> tags. Does this <nav> element belong directly inside the <body> element?

The answer appears to be: yes. The <header>Header</header> element is meant for the introductory text of a section, and as for the <footer>Footer</footer> element:

The footer element represents the footer for the section it applies to.

W3C, footer

To understand this, we need to know what a section is. I have listed above what are sectioning elements, but we also need to know what are sectioning roots.

The following is a list of sectioning roots:

  • <blockquote>
  • <body>
  • <details>
  • <fieldset>
  • <figure>
  • <td>

Therefore, if we put the <nav> element inside a <footer> element that is a direct descendent of <body>, it shall be considered to be a footer for the <body> element.

From the look of things, it would be correct to place the site navigation within the footer for the body element. It should, however, be kept within a nav element as it is semantically the correct thing to do.

Site Footer

The site footer includes the copyright and license information for the site. As has already been established, this does indeed belong inside a <footer> element that is the direct descendent of <body>.

As I would consider the copyright and license information to be more important than links to elsewhere on the site, the site navigation <nav> should come after the copyright and license paragraphs within the <footer>.

Alert Boxes

Every page on the site has 3 alert boxes. The first is for global alerts which are alerts for all domains this site serves. The second is for site alerts which are site-wide alerts for the requested domain. The third is for client alerts which are alerts pertinent to the user requesting the page, such as information about font loading, reminders about upgrading to newer browser versions, et cetera.

The reason I see the Google results for this site as inconsistent is that some of the search results snippets contain the text from the global alert rather than the description for the URL.

Obviously the content of all the alert boxes is not part of the content of the requested page. But if I want to draw attention to them (since they are alerts) then they do belong at the top of the page in their current position, especially since I tested numerous positions and decided that the current location is the best place visually.

These alert boxes are not part of the document, so do not belong within <main>. The reason I haven't yet used <main> on my site is because of these alerts, which should be outside of the article body (semantically) but inside the article body (visually). Would <aside> be perfect for this use?

The answer appears to be: no. <aside> within an <article> is considered to be related to the article, therefore using <aside> would be semantically incorrect.

There is, however, the new hidden attribute. I don't know how Googlebot will deal with it, so there is only one option remaining: delete the content of the alert boxes, populate the content using JavaScript, and only do so if the user-agent is not a Googlebot.

After modifying some PHP scripts, and modifying varnish so a User-Agent string that includes "Googlebot" gets passed on to lighttpd as "Crawler" (as opposed to being stripped unless the browser is old), my site now appears to look OK using Fetch as Googlebot.

I have kept the divs and paragraphs for the alert boxes for the time being, but have removed the content as well as implementing tests in PHP so the JavaScript for font loading and updating the alert boxes is not included in the page. I will have to revisit this when I start moving the inline JavaScript into files that I include.

Articles and Sections

The next thing on every page of this site is an <article> or <section> element, that encompasses the entirety of the page. As I have already determined where the tricky things belong, this part is simple. I need to edit every page on my site and wrap the outermost article/section element with a main element.

With a bit of CSS tweaking, I have now got my site navigation bar back to how it was, the copyright/license how it was, and links inside the site navigation bar functioning as they were.

Previously, the <nav> element for site navigation was not within a parent element that had a class of row. That is no longer the case and required some tweaking. Since the navigation links are just links, their functionality changed after moving it inside the footer. But, since the content of each page is now going to be within a <main> element, I can now prepend those .row CSS styles with main.

That just leaves me with the majority of the site looking ugly because no <main> element exists on the page. No, wait, every page on this site has an include before the main content (headertail.php) and after the main content (copyright.php) so if I add <main> to the end of the former and </main> to the beginning of the latter the entire site will have <main>Main</main> tags.

Articles

The <article>Article</article> element is a lot like the <section>Section</section> element. An <article> element is meant for content that can be considered self-contained and reusable, whereas a <section> element is meant for content that is related.

You know an <article> should be used if the content could have a pubdate attribute (published date).

I will use a library of books as an example. Each book has a published date, therefore each book could be considered an article.

A bookshelf itself does not have a published date, so would be a section.

A type of bookshelf, such as one where the works of an author are collected, could have a published date if you were to consider the date of first collation, and a modified date when the books on the shelf are rearranged, or additions/deletions to the shelf are made.

Take, for instance, my pages /articles and /blogs. At the moment I manually update them, and group things that are related together.

For example, this article will be grouped as /blogs (Home/Blogs), /blogs/website (Home/Blogs/Website), and "New Domain Names" (Home/Blogs/Website/New Domain Names) if I decide there are now enough posts in this series to create a sub-group under Website.

Sections

Both /blogs and /articles look wrong with a "Posted by" footer, as does /gallery and /status. Thus those 4 shall be converted to <section>s if they are not already, and the posted/modified dates removed.

On the other hand, /music, /links and /about look right with a "Posted by" footer, so I will convert those to <article>s if they are not already, and keep the posted/modified dates.

The home page is a special case. It has an introductory section for which a posted/modified date looks correct, but it also includes content from elsewhere on the site that themselves have a posted/modified date. For the time being, the Home Page post is within a section element that is scoped as a schema.org Article, with the Home Page post within a div.

Also within the <section> on the Home Page are numerous <article>s, which are snippets of individual articles. As they are somewhat related to the Home Page, I shall leave things as is for the time being.

I have just made another tweak to the site, and this page is the first that I know of to utilise it fully/properly. A <section> within an <article> can best be described as a section of content that has a natural start and end. The way I have determined that is not only having my <h1>s within <section>s styled the same way as the <h1>s that are the main heading for a page, but I have inserted a <hr> between sections.

The heading Googlebot was, before I sectioned this page, a level 2 heading. It became a level 1 heading after sectioning, but after styling I asked myself if a horizontal rule truly belongs before that heading.

Although the Googlebot section could be self-contained, it is closely related to the previous section. Thus I gave Googlebot a level two heading also. Since the Recap section also doesn't look right with a horizontal rule above it, it is within the <article> element but not within its own <section> element.

So, that is how I am going to determine whether to section up an article. If part of an article has a header, and it is more like a new chapter where a mention of "above" after this point almost certainly refers to something below this header, and the word "previous" might be used to refer to something above this point, and a horizontal rule above this heading looks logical, then it is obviously a section of this article.

It is not possible to "break-out" of a section once it has been entered. To return to a higher level, such as the same level of the Googlebot heading, there is no styling that would indicate it. A horizontal rule can indicate the area between sections, but not the end of a section and the return to the previous level in a hierarchy (at least to my mind).

Thus, it may well be difficult to write in a structured way such that sectioning makes more sense than having level 6 headings.

Breadcrumbs

Unlike typical Web sites, I have breadcrumbs multiple times on the Home Page. When an <article> is included, it brings with it the breadcrumbs as on my site they have multiple uses.

For example, the breadcrumb on this page for the current page is not styled like a link, although it does have a link icon next to it.

Hover over it with a mouse (or tab to it) and it is styled like a link but without the clicky hand cursor (navigating to the current page sounds redundant).

Thus, if you click the link to this article on the Home Page (the breadcrumb) you not only get brought to this page but the link you clicked is still there, albeit not styled in link colours (yet the icon and the black text, as opposed to the blue/purple text for the other breadcrumbs, might make you wonder if it is also a hyperlink).

What use is a link to the current page? Right-clicking. Dragging. In Firefox you can drag a link to the tab collection and open it in a new tab. Although you can do the same by double-/triple-clicking the URL bar text to highlight it, or CTRL+L, you might want to use the right-click context menu to do something that you've added functionality to your browser for which is not available from the URL bar or right-clicking the current tab (e.g. Firefox does not have a duplicate tab option).

Breadcrumbs for the current page obviously belong in a <nav>Navigation</nav> element, within the <header> element of the page. I am currently using unordered lists, however.


HTML5 Outline Algorithm

HTML5 also brings with it the HTML5 Outline Algorithm. The best way I have found to think of this is to imagine a doctype within a doctype. The html doctype tells a user agent, such as a Web browser, that the document should be read as HTML5. The HTML5 Outline Algorithm kicks in when the user agent comes across certain tags.

Thus, if you don't include these tags, your browser will process the page in the HTML5 equivalent of quirks mode. Everything will be within the sectioning root "body", and the Outline Algorithm will play no part in the rendering of the page.

Thumbnail of YouTube video: HTML5 overview: Introducing the outline algorithm | lynda.com

Video: HTML5 overview: Introducing the outline algorithm | lynda.com

Play Video Embedded Watch Video on YouTube Google Privacy Policy Google Cookies

In the above video the outlining algorithm is briefly described. There is, however, a cause for more confusion...

If every sectioning element should have a header, what about navigation bars? More to the point, I have just wrapped the main content of every page within a <main> element that does not have a <h1> child because the heading for the page is within another sectioning element!

So, here is a rather odd situation. The <main> element is to replace the ARIA role="main", but most main content will already be within an <article> or <section>. Is <main> a more specific type of <article>, like <article> is a more specific type of <section>?

Should I get rid of the <main></main> tags I have just added to my site and instead edit the source code for every page changing the main (outermost) article/section so it has a role of main?

Should I replace the <article> tags with <main> tags and give them role="main" and itemscope itemtype="http://schema.org/Article"?

When the main content of the page (i.e. excluding footers, headers, navigation blocks, and sidebars) is all one single self-contained composition, the content should be marked up with a main element and the content may also be marked with an article, but it is technically redundant in this case (since it's self-evident that the page is a single composition, as it is a single document).

W3C, HTML5, Sections

The answer appears to be: no. In fact, upon inspection of the extremely long HTML5 specification (at the time of reading), <main> is "flow content" rather than "sectioning content".

The following list is straight from The Sections section of the HTML5 Specification:

  • <body> - Category: Sectioning root; Content model: Flow content.
  • <article> - Categories: Flow content (must have no <main> descendents), Sectioning Content, Palpable Content; Content model: Flow content.
  • <section> - Categories: Flow content, Sectioning content, Palpable content; Content model: Flow content.
  • <nav> - Categories: Flow content, Sectioning content, Palpable content; Content model: Flow content (must have no <main> descendants).
  • <aside> - Categories: Flow content, Sectioning content, Palpable content; Content model: Flow content (must have no <main> descendants).
  • <h1>-<h6> - Categories: Flow content, Heading content, Palpable content; Content model: Phrasing content.
  • <header> - Categories: Flow content, Palpable content; Content model: Flow content (must have no <header>, <footer>, or <main> descendants).
  • <footer> - Categories: Flow content, Palpable content; Content model: Flow content (must have no <header>, <footer>, or <main> descendants).
  • <address> - Categories: Flow content, Palpable content; Content model: Flow content (must have no Heading content or Sectioning content descendants; must have no <header>, <footer>, or <address> descendants).

With the above list, I can now generate a list of elements in the Sectioning content category:

  1. <article>
  2. <section>
  3. <nav>
  4. <aside>

As for sectioning roots, we have one as far as the sectioning page of the spec is concerned: <body>.

We also have the elements in the Flow content category:

  1. <article>
  2. <section>
  3. <nav>
  4. <aside>
  5. <h1>-<h6>
  6. <header>
  7. <footer>
  8. <address>

What is flow content? Pretty much everything in the body of an HTML document, with the exception of a few things, like some metadata. We can ignore this list for the time being.

We have <h1>-<h6> which are the only things in the HTML5 Sections section that are Heading content elements.

Heading content is that which defines the header of a section, whether a Sectioning content element exists or has been implied by the <h1>-<h6> element.

What I think this means is that you don't need a <body> tag, because if you have a html doctype, and you go from the <head> element straight to a <h1> the <body> section will be implied.

I'm not sure how it would apply to multiple <h1>s in a document without explicit sectioning elements in practise, however.

Palpable content basically means it must have an attribute or other child in the DOM that is not hidden. Apparently exceptions are made if it is a placeholder for filling by a script (e.g. my alert boxes).

Phrasing content is pretty much what you put around text so it is phrased. A <h1>, for example, might contain the same text as a <p>, but the chosen element adds meaning (e.g. heading versus paragraph).

The HTML5 Outlining Algorithm, as with Microdata, is a big subject that deserves its own article. For now, I think I understand the basics.


Microdata, Rich Snippets, and Social Media

I am writing an article that deals with HTML5, Microdata, and Rich Snippets, so I won't go that in depth here.

Rich Snippets is, basically, what Google Search shows in the search results making a link richer. Whether it be breadcrumbs, a picture/video next to it, a picture of the author, these things make the search results less dull.

Rich Snippets also, when it comes to Search Engine Optimisation (SEO), are something it is recommended to do because richer links are more likely to get clicked than bog standard links.

Microdata is a way of adding metadata and making the content of a page more machine-readable (semantic). Rich Snippets use Microdata (among other things) to help the Google Algorithm decide whether a page is relevant to a search term or not.

Open Graph and Twitter Cards, used by Facebook and Twitter (among others), are additional metadata added to a page so that when sharing a link on a social network you get more than a link. It is the Social Media equivalent of rich snippets.

Unfortunately, all of this metadata and Microdata is not standardised. Twitter falls back on Open Graph for some things, but you still need a twitter username to say who wrote the page. Google wants an author to be linked in the way of a Google+ profile. If you want an image to appear next to your link in Google, Facebook, and Twitter you probably need 3 different sized images... plus 2 or 3 more for an image of the author.

Thankfully, most of this can be automated. If you already have something inside <h1> and it is repeated (perhaps partly) inside <title> then it can be repeated elsewhere, such as og:title. Likewise your <meta name="description" content="something blah"> tag contains content (something blah) that can be repeated elsewhere, such as og:description. Likewise the meta canonical tag and the og:url meta tag.

Microdata is cool. It also adds bloat to code. But code bloat is compressible if it is repeated a lot, such as itemprop=, because the more often it occurs the more likely it is to be compressed (depending on compression algorithm).

Microdata adds more semanticness to HTML5, so that although some things would appear redundant to us humans (such as defining an <article> with a schema.org schema of Article), by letting machines know the code follows the rules Googlebots and the like can see the essence of our pages more easily once they have support added to them.

An image gallery is a good example, as a web crawler in the old days would just see a load of <img> tags. If an <img> was surrounded by <a> tags it could be inferred that the image is either (a) a thumbnail, or (b) an image that contains something relevant to the linked document/file (e.g. a download icon). Without an alt attribute for an image of a download icon, search engines would have trouble guessing.

With Microdata, you can surround images with a Sectioning element given an ImageGallery or CollectionPage scope. Give the images (or figures) the relevant itemprops and Googlebots and other spiders no longer need to guess, they can see something is a thumbnailUrl, contentUrl, caption, etc.

Although everything in this section is not necessary, making things easier for machines to understand whilst making them more enticing for humans to click/tap/activate is something that is part of SEO.

First you make content worth reading, then you make it accessible to search engines, then you try and optimise things so people looking for something think your link is relevant. If a lot of links are relevant, it might then just come down to which link looks prettier and/or more enticing.

Redirecting to the Canonical Domain

It is perhaps too early to tell how Google does things, but the fact some of the results for site:JohnCook.UK show non-https URLs it appears Google ignores canonical URLs/links.

Therefore I am going to need several redirects in place:

  • https://www.watfordjc.com -> https://web.watfordjc.uk
  • http://www.watfordjc.com -> https://web.watfordjc.uk
  • https://watfordjc.com -> https://web.watfordjc.uk
  • http://watfordjc.com -> https://web.watfordjc.uk
  • https://web.watfordjc.co.uk -> https://web.watfordjc.uk
  • http://web.watfordjc.co..uk -> https://web.watfordjc.uk
  • http://johncook.co.uk -> https://web.johncook.uk
  • https://johncook.co.uk -> https://web.johncook.uk
  • http://web.johncook.co.uk -> https://web.johncook.uk
  • https://web.johncook.co.uk -> https://webjohncook.uk

Yes, I am going to need to bring a couple of hostnames back from the graveyard in order to get SSL certificates for (and make redirects from) watfordjc.com.

One of the issues I currently have is that nginx does not appear to be configured for terminating all http connections, meaning that connections not over https are bypassing my varnish cache.

I think the only way I will be able to resolve that will be to shift lighttpd from my public IPv6 addresses to my private ULA IPv6 addresses.

I also need to set a date for when 149.255.99.49 will stop being web.watfordjc.co.uk on port 443 and start being web.johncook.uk instead.

I have just tested https://www.watfordjc.com using openssl as follows:

openssl s_client -connect www.watfordjc.com:443 -servername www.watfordjc.com
GET / HTTP/1.1
Host: www.watfordjc.com:443

The response I got back?

HTTP/1.1 307 Temporary Redirect
Server: nginx/1.7.9
Date: Sun, 18 Jan 2015 21:55:49 GMT
Content-Length: 0
Connection: keep-alive
Location: https://web.watfordjc.co.uk/news.php

DONE

Obviously, this is because https://www.watfordjc.com is the same port and IP as https://web.watfordjc.co.uk, and web.watfordjc.co.uk is the default for that IP address (deliberate decision).

My .com is still alive because I haven't fully killed it off. As my .com is pretty much duplicate content in Google for my .co.uk, and I intent to kill both of them off, I think it would be best (once I have transferred all my old content) to 301 all the URLs at both domains to their new locations.

In fact, some of the Google results for site:watfordjc.com are now showing breadcrumbs and no https, and a quick test does indeed show that the link in Google for one such URL is for a page that (at watfordjc.co.uk) is 301'd. Why is it still in the Google index, then? Perhaps more time is needed.

I have trillions of IPv6 IP addresses, with a /116 dedicated to each protocol. Thus, for Web sites, I have 2^(128-116) = 2^12 = 4,096 IP addresses available. Unfortunately, my VPS provider does not supply a /64 (or a /80) so for native (rather than using 6in4 tunnels) IPv6 I only have 2 IPv6 IP addresses - a limitation meaning that if I use native IPv6 I will need to require SNI support.

Googlebot supports SNI, but at the time of writing (according to Qualys) neither BingBot nor Yahoo Slurp do. So, I have a dilemma as even if Bing and Yahoo support IPv6 most home connections do not. I do have 3 IPv4 IP addresses, however.

The most logical thing would appear to be to use one IPv4 IP address for web.watfordjc.co.uk (and watfordjc.com), and another for web.watfordjc.uk.

Although doing that would make web.watfordjc.uk the default domain for TLS on that IP address, that domain contains the pages I have flagged as NSFW (Not Suitable For Work) so using that domain when the requested domain is unknown is unlikely to result in an erroneous 404.

As I want to keep my webmail on the same IP address it is on, what I am going to have to do is move 149.255.97.82:443 from a default domain of web.johncook.co.uk to a default domain of web.watfordjc.uk. That will not be such a bad thing, as johncook.co.uk was only a placeholder for WatfordJC.UK and JohnCook.UK and I will need to redirect that domain anyway.

So, the plan is as follows:

149.255.99.49
Web.WatfordJC.co.uk
149.255.97.82
Web.WatfordJC.UK

In order to do what I want (nginx in front of varnish in front of lighttpd), however, I need to move http to nginx as well. That might prove problematic as at present it is assumed that all connections to varnish must be coming from nginx and must, therefore, be HTTPS.

It doesn't look like getting varnish to add whether a connection is https or not to the hash is that difficult, and passing the scheme from nginx to varnish also appears simple.

The problem: I have hundreds of lines of messy code in numerous Web server configuration files (half of which are not even being used) and I just don't know how to do what I want.

Well, if I want to move http, then I will need the IP addresses for one of the domains I'm moving. Let's take a look at JohnCook.co.uk:

  1. 149.255.97.82
  2. 2001:470:1f09:38d::80:c

And a quick grep "^\+.*149.255.97.82.*$" /etc/service/tinydns/root/data | wc -l and I find have 6 A-Records for that IP address. Likewise, a grep "^3.*0080000c.*$" /etc/service/tinydns/root/data | wc -l gave 6 AAAA-Records for that IP address.

Fortunately, and surprisingly, the only domains I want on that IPv4 address are those on that IP address. Also, I already have the varnish and lighttpd stuff ready to go because varnish connects to lighttpd's other (non-published) IPv6 address for those domains.

Actually, it did take a bit of work in nginx, varnish, lighttpd, and the PHP code for these sites, but the end result appears to be working how I want it to for the moment. I have added more crawlers/spiders to the regex check in varnish, and varnish is now in front of all three domains.

My varnish cache normalises the requested domain (so it is lower case if it is an expected domain), User-Agent (all IE, outdated Opera, outdated Firefox, outdated IE, outdated Chrome, outdated IceWeasel, Crawlers, everything else deemed generic), Cookies (fonts=1, or no cookies), protocol (https, http), and Accept-Encoding (gzip or uncompressed).

Thus all crawlers should be served the same page, which is a stripped down regular page, with the fonts loaded in the <head> on the first load and no cookie set as a Flash Of Unstyled Content (FOUC) doesn't really matter to non-humans. As I don't need to load fonts using JavaScript for Googlebots then I don't need to send the font loading JavaScript.

On the occasions when I update the varnish .vcl script adding in more outdated browser versions, the number of cached copies of a page will be similar to visitor browser choice - so if a lot of people haven't updated Chrome, then a page with an "outdated Chrome" alert will be cached and served to those users.

As time goes on my check will gradually become outdated and people will restart Chrome so that eventually most people will be served the generic page.


Finishing Touches and Testing

With a couple of more tweaks and whitelisting my domains for Twitter Cards, I posted my first tweet in the early hours GMT (i.e. the quiet time for the Internet) and did some further testing whilst keeping an eye on my server logs.

Thankfully all the hits were from bots, so I made some further tweaks adding twitter:site to the meta data (pointing to @JohnCookUK unless the NSFW flag was set in which case it gets set @WatfordJC), and used the testing tool to test the URL again.

I refreshed twitter several times (unusual - I usually only refresh when something goes wrong) and noticed that the card updated if I used the testing tool again. I finally got twitter looking how I wanted it, and moved on to Facebook.

After eventually guessing my correct password, I went back in my browser to Google and clicked the link for the Facebook Open Graph Object Debugger testing tool. I pasted in the URL I had pasted in twitter and was pleasantly surprised: everything looked fine.

I looked at my Facebook profile page and there was my tweet, the twitter t.co shortened URL, the title of the page, the description of the page, the canonical URL of the page, and to the left a large image of me shaving, with "@WatfordJC on Twitter" underneath showing where the Facebook post had come from.

I went back to the Facebook debugger, pasted in the shortened twitter URL, and Facebook came back with the fact that twitter not only uses a 301 redirect, but that Facebook will honour the canonical URL for that page.

A further bonus is that the link on Facebook remains the t.co URL, so it is feasible that I'll be able to monitor clicks from Twitter Card analytics.

Whilst watching the logs, I noticed someone attempting to use my server as an open proxy. Curious, I tested it myself. Although I couldn't GET http://bbc.co.uk HTTP/1.1 using any Host:, I did notice a flaw in my coding: I hadn't set a default domain.

That's correct, connecting over TLS and giving an invalid host resulted in my server taking that hostname (or URI from the GET) and making it the canonical link of the page. That looks open to potential abuse (set the A record of a domain pointed at my IP and it could screw things up!)

So with a bit more tweaking in varnish, I added an else statement so if the hostname is not one that is expected, it is assumed to be web.watfordjc.uk. With a couple more tweaks to the PHP code, everything looked good.

While I was doing all this I had a bit of bad PHP code and was surprised that it failed graciously - the top menu didn't load, nor did the fonts, and for a while the twitter card metadata was replaced with a HTML comment saying there was an error, but the content of the page still loaded.

I also had an issue with the padlock icon (far right of the site navigation menu) not working properly and it took me a couple of hours until I realised it was because I hadn't modified my nginx configuration so the backend was lighttpd rather than varnish.

Whilst trying to fix that, I fixed the empty link text issue. Unfortunately, I do not yet know enough to be able to replace the duplicated text in the links and the list item titles, so have replaced one accessibility issue with another.

The issue I have is this: how do I have a title attribute for the list items so that mousing over shows what the icon means when all the text is hidden on medium sized displays without screen readers reading the title attribute?

If screen readers ignore the title attribute then it is fine. Although keyboard (and touchscreen) users probably don't have access to the title tooltip, tabbing to (or long-tapping) the icon will show the URL either in the status bar or in a "what do you want to do with this link?" dialog. Since the URLs are descriptive (e.g. /articles, /blogs, /status) and are in fact what is shown in the link text on larger screens, I don't think it is that much of an issue unless screen readers read the text twice.

Rich Snippets and the Google Index

I have just dug a little deeper in the actual Google results, and copy/pasted one of the links that has breadcrumbs (but no https:// prefix to the domain) into a shell to visit it with lynx.

I am actually surprised by the result: although the breadcrumbs do not indicate that the link is for a secure site, Google does actually redirect to the correct (canonical) URL.

I am no longer concerned about Google not showing https:// in front of URLs that have breadcrumbs in the Google index, as it does look like Google have deliberately decided to strip the https:// to make the breadcrumbs look prettier. That is fine with me.

https://watfordjc.com/ still lives in Google, and is actually the top result for 'John Cook Watford'. I still have some content to recreate from that domain (now known as web.watfordjc.co.uk) but if what I have read about "link juice" and 301 redirects is true, then 301'ing watfordjc.com to web.johncook.uk should mean my new site will be on the first page of Google results by the time I let my old domains expire.