New Domain Names - Part 4:
Redirects and Deindexing

In part 4 of this series of articles I look at redirecting and deindexing old domains.

Recap

In the previous article in this series, New Domain Names - Part 3: Googlebot and the Google Index, I modified the code of this site so it was more semantic and equally good for Web spiders and humans.

Since January Googlebot et al have been around and most of the content on this site is now indexed.

With less than 3 months to go until watfordjc.com expires, and less than 5 months until watfordjc.co.uk expires, I have reached the point where redirecting everything on the old domains is a priority.

Redirects

After resurrecting watfordjc.com and having a domain-wide redirect in place for many months, some pages have been indexed under https://watfordjc.com. As I have transferred content across to this site and added redirects the number of indexed links for https://watfordjc.com has gradually fallen, from a 12-month peak of 37 pages to the current 11 pages in Google's index.

With the final content transferred across, watfordjc.com was 301 redirected to watfordjc.co.uk, and almost every URL at watfordjc.co.uk returned either a 301 Moved Permanently redirect to the new URL at this site or a 410 Gone (content forever deleted) status error.

I then tried to use the Google Webmaster Tools site move feature, but because / at watfordjc.co.uk had a 307 redirect to /news.php the tool wouldn't let me get through all the steps. Eventually I worked out how to replace the redirect with an internal rewrite in nginx so a GET of / returned the same content as a GET of /news.php:

	location = / {
		rewrite "^/$" /news.php;
	}

Webmaster Tools finally accepted the site move, so now it should only be a matter of time before watfordjc.com disappears from Google's index.

robots.txt

It is probably worth mentioning at this point that some URLs could have a problem due to robots.txt.

robots.txt basically tells spiders such as Googlebot "You can't visit this page, but you may include it in your index".

Meta robots (X-Robots-Tag for headers) noindex, on the other hand, tells spiders "you can visit, but you can't include it in your index".

The problem? If an URL is included in robots.txt the meta tag/header will never be seen by the spiders because they will not visit the page. That can result in search engines including links to pages on your site that you'd rather weren't there, because you told the spiders they can't crawl your page rather than telling them they can't index your page.

After realising this error I have removed the pages I have 410'd from robots.txt (where applicable) so that Googlebot et al can request the URL and then see the page no longer exists.

While not an issue on watfordjc.com, watfordjc.co.uk has a lot of hidden similar results in Google because Googlebot indexed (without crawling) the print version of every page on the site. /print.php is not only no longer in robots.txt, but it is also 410'd (this site doesn't have print versions, it has print-only styles).

I'm going to give Googlebot a month and if anything is still listed for watfordjc.com I'll turn on a site-wide "noindex, nosnippet, noarchive" X-Robots-Tag header. If that doesn't work then 2 weeks before domain expiry I'll use Webmaster Tools to "temporarily" delete all pages and, with the exception of robots.txt, 410 the entire domain.

Deindexing

In theory there are several ways to remove URLs from Google's index.

Use URL removal in Webmaster Tools and remove "/" (entire site).
Return a 410 status code header.
Return a 404 status code header (Googlebot might come back just to double-/triple-check).
Return a 301 redirect (the old URL should, in theory, eventually be replaced by the new URL).
Remove the A/AAAA record(s) for the (sub)domain/hostname.
Use change of address in Webmaster Tools.

I've implemented 301 redirects and 410 gone status codes on watfordjc.com and used the change of address tool in Webmaster Tools.

After a bit of fluctuation there are currently 6 links in Google for the domain, with 10 total indexed in Webmaster Tools for the https:// domain and 1 (the home page) for the http:// domain.

As for the .co.uk domain, there are 34 links in Google (including similar results), with 3 total indexed for the http:// domain and 64 indexed for the https:// domain.

It is now a matter of waiting to see if the changes have the intended effect.