Are infinite loops, dirty architecture and too many indexed URLs ruining your website?
Crawl Rank and Crawl Tank
One of my own projects is a site which I founded in 2008. As a result of negative experiences hiring web developers and SEO service providers I undertook a self education process in order to know enough about web development and SEO to ensure I would get value for money and speak the same language as my suppliers. Throughout this self-learning exercise, my interest in organic search grew and I made the move into the field professionally, working for leading digital marketing agencies and eventually running a team, whilst continuing to grow my site. The site started to take off, gaining increased organic search visibility and traffic. The time seemed right to commit more effort so I left “employment” in 2012 to go it alone with my project. Not long after this move, I decided to add an additional dimension to the site to extend upon existing natural search traffic. If executed well I anticipated that this could explode organic visits from longer tailed queries, which had always been my main focus. I’d always made good use of XML sitemaps historically, and after adding the additional layer my team went to work in resubmitting sitemaps to alert Googlebot to our sections.
Site Indexation Levels Rising
As expected, site indexation levels began to rise in Google SERPs. At first the rise in indexation was steady, but picked up pace until it reached around the 1,500,000 level. We were excited by the significant increase in indexed pages in Google. Surely the more pages in the index, the better? The more ‘mud at the wall’ the greater chance of receiving traffic based on increased impressions and click-throughs.
Declining Crawl Stats
We expected Googlebot’s daily visits to our site to increase too. But in contrast we started to see a decline in our crawl stats in Webmaster Tools and strange URL patterns emerging in our landing page analytics. The higher the indexation levels rose, the greater the decline in pages crawled per day. At its lowest point, under 0.1% of the site’s indexed pages were being visited daily by Googlebot, according to Webmaster Tools (NB: GWT is not entirely accurate but gave us enough indication that there was an issue). Something had clearly gone horribly wrong. A WARNING It did not take us long to create or find this issue, but it’s taken us a long time to begin a recovery. This article serves as a word of warning to others and will provide a checklist of important items to monitor regularly as part of an ongoing crawl optimisation process.
What was the problem?
It turned out when we’d added our additional dimension to the site and linked everything up, that our dev team had not checked the variables which were pulled in to create the dynamic content, and inadvertently we were spinning nonsensical new content URL’S out into the SERPS. It wasn’t going to stop either. We’d created an ‘infinite loop’ of sorts which would just continue to run and run.
For those less familiar with an infinite loop here’s Wikipedia’s definition: “An infinite loop is a sequence of instruction in a computer program which loops endlessly, either due to the loop having no terminating condition, having one that can never be met, or one that causes the loop to start over…” Before you think that these types of issues don’t affect any of your sites and is isolated to parameter based dynamic sites only, consider this: An infinite loop can happen on even the smallest website. E.g. WordPress internal links without the http: http://www.mysite.com/www.mysite.com/www.mysite.com On a large scale dynamic URL churning site an infinite loop can be, and usually is, disastrous if left unchecked for long.
In architecture terms this is called ‘circular dependency’ – caused by creating and linking content without considering the outcome when the links are picked up on by search crawlers. The problem was Google saw this as unique content and was indexing it, but in reality no-one was ever going to search for the type of pages that we’d created as they made no sense. Without giving too much away, they were akin to saying “find red shoes in Manchester for green shoes”. Googlebot didn’t realise that this didn’t make sense, but who looks for red shoes in Manchester for green shoes? Worse than that, because of the cross-module internal linking we’d created, these pages started to rank in the long tail, but for the wrong queries, often out-ranking our target landing pages. We were receiving multiple listings (I’m talking nearly the complete page 1 of SERPs for long tail and low search volume on many queries), but for the wrong pages. The more the churn continued the less relevant the pages became. Why had this problem occurred? Surely they should have thrown a 404 page not found and we’d have identified this sooner? Nope. Because the additional dimension had been created and therefore the parameters existed, the logic was telling the page to render with a 200 OK server response code and we’d missed some validations. Natural search traffic began to decline ‘drastically’; with Google getting ‘lost’ in our random URLs, relevance dropping along with each crawl, and Google unable to fathom which pages to rank for what.
Vulnerable to Google Panda
The thin, irrelevant content which we had inadvertently created made us vulnerable to the rolling Panda algorithm. NOTE TO SELF: Don’t allow anyone to submit XML sitemaps at any type of scale until you’ve checked what they contain thoroughly; Rather like driving a Porsche with a learner driver licence.
Google Penguin Came Along
To compound matters, in May 2013 Penguin came along and all those cheap and cheerful directories, which we’d submitted to historically, came back to bite us. Removing and disavowing meant that our main category pages from an external link graph perspective weren’t worth jack any longer, and with over a million and a half thin and nonsensical pages in the index we were in a dire situation. One could say “Why not get the offending URLs removed from the index”? It was more complicated than that – the URLs didn’t match a particular pattern and to be frank, we didn’t want to take the risk. Likewise, “Why not canonicalise?” – Here’s why – The content we’d produced wasn’t even similar to what we’d created before and again it would prove problematic technically given the setup of the URL structure. To ascertain SEO value and link equity we had to use URL mapping, one to one.
Over Indexation In Google – SEO Death
By now, we’d rectified the offending code and removed all of our XML sitemaps, but still had the issue of over a million and a half pages in the index. So, we went to Google and started to search for inspiration. Initially, what we found appeared to be disastrous. It was an article titled – “TOO MANY URLS = SEO DEATH” The author of the piece had experienced the same as us. He referred to it as “Royally tanking a test site by allowing Google to crawl too many URLs of thin content without adding any more inbound links or non-thin content”.
Crawl Budget And Crawl Rank
Further research led us to the subject of ‘crawl budget’ and ‘crawl rank’ in a piece from July 2013 on The Blind Five Year Old Blog.
Here A.J Kohn, looks at the crawl budget and a potential ‘crawl rank’ factor. A J Kohn refers to Eric Enge’s interview with Matt Cutts when looking for the best description of crawl budget:
“The best way to think about it is that the number of pages that we crawl is roughly proportional to your PageRank. So if you have a lot of incoming links on your root page, we’ll definitely crawl that. Then your root page may link to other pages, and those will get PageRank and we’ll crawl those as well. As you get deeper and deeper in your site, however, PageRank tends to decline. Another way to think about it is that the low PageRank pages on your site are competing against a much larger pool of pages with the same or higher PageRank. There are a large number of pages on the web that have very little or close to zero PageRank. The pages that get linked to a lot tend to get discovered and crawled quite quickly. The lower PageRank pages are likely to be crawled not quite as often.”
That was before Caffeine you may say, but A.J Kohn argues that it’s still relevant and Cutts’ Webmaster video from October 2013 seems to support it. Kohn’s findings over the past couple of years are that pages crawled less often won’t rank well and that ‘time since last crawl’ had a bearing on rankings for low or no page rank pages. Less relevant for pages with page rank which will automatically be crawled often (confirmed in Matt’s October 2013 video, and his interview with Eric Enge back in 2010).
Kohn refers to ranking factors which affect pages with no or low page as ‘crawl rank’ – i.e. those crawled most often ranked higher than those competitor pages crawled less. It turned out he was not alone in his theorising. Kohn found that others were also tracking rankings on pages with no or low page rank based on last crawled date. According to Kohn,
The big takeaway from the ‘crawl rank’ theory is – “You win if you can get your low pagerank (or no pagerank) pages crawled more often than your competition”.
Did this mean that you lose ‘page crawl rank’ if your low to no page rank pages are crawled almost never? It seems plausible based upon his theory.
Did We Lose Crawl Rank?
Immediately prior to the ‘infinite loop’ issue our site was ranking on the long tail for thousands of terms – outranking market leading competitors. We’d never attempted to compete for head terms. Had positive ‘crawl rank’ led to this in the first place? By this logic, inadvertently indexing thousands of nonsensical, poorly coded URLs and the subsequent decline to almost a non existent daily crawl meant that very few of our low to no page rank pages were being visited regularly by Googlebot. With a crawl rate down to less than 0.1% of the URLs in the index, how could they be? A J Kohn’s quote “It’s not that Google will penalise you, it’s the opportunity cost for dirty architecture based on a finite crawl budget” seemed to hit the nail right on the head for our site. I’d say losing crawl rank was a significant factor and we needed to re-optimise our internal linking structures to recover in addition to optimising Googlebot’s crawl experience overall and getting the most value for our ‘budget’. That’s what we started to do.
How Could We Fix It?
- Remove existing XML sitemaps
- Fix the offending ‘looping’ code
- Implement a hard 404 on parameters / variables that did not exist / did not validate
- Download and check server logs for Googlebot activity
- Hundreds of thousands of broken links resulted with nowhere to go. Naturally, adding this number of individual 301 redirects via .htaccess was not an option and using wildcards and pattern matching didn’t fit with the kind of errors generated and had the potential to create even more disastrous results. We had to build a database to handle the obsoletes, then map everything across Download and mark as fixed broken links from GWT as they appeared Extensive internal crawling for internal links to offending URLs and remove / relink 301 redirect offending URLs after developing database to handle for custom site
- Recreate, check and submit correct XML sitemaps to ‘flush’ out offending URLs from index Temporarily lift everything up a level – remove thin categories with a view to rebuild Remove .htaccess created subfolders to retain authority and reduce ‘depth’
- Remove thin site sections – block other areas with robots.txt Re-implement cross module internal linking to aid boost recovery of lower level pages via ‘crawl rank’
- Implement upper level cross module internal linking to boost Penguin’d higher level pages following link removal and rebuild authority from within
- Ensure most important pages were high up in internal link list in GWT.
- Check for legacy (‘back in the day’) parameters and old code on .htaccess and 301 redirect to correct parameters (which also considering issues with overloaded .htaccess file on apache)
- Built out thin ‘panda-vulnerable’ pages – Amend / edit any over optimised pages Look to build relationships between site sections and specific pages (ontology) – again, avoiding overkill
- Revisit our crawl optimisation checklist often
CRAWL OPTIMISATION – A CHECKLIST
Before you start rushing out looking for external links and embarking on a content marketing exercise (potentially diluting your site with irrelevant ‘non-theme’ topics), ensure that you’ve covered these crawl optimisation essentials and revisit the list often.
FIND OUT WHERE GOOGLEBOT GOES AND KEEP WATCHING
At its most basic level, check your Webmaster Tools pages crawled per day screen daily. Ideally, get access to server logs, download them and begin to track and monitor how often Googlebot comes to your site, the response codes it receives and which URLs are being downloaded. (Tools such as ‘Splunk’ will help with this and Kohn refers to applications such as SEO Clarity for ‘out of the box’ solutions to this). You’ll likely find areas through this exercise where you can ‘trim off’ wasted crawl budget so that Googlebot gets fed the URLs that you really want them to visit. If you can turn this into a process which becomes automated through working with your site’s developers even better.
ENSURE URLS RETURN THE CORRECT SERVER RESPONSE AND KEEP CHECKING
Check server logs and crawl your site regularly to check for changes which may have inadvertently occurred during the latest developer sprint. Don’t leave these things to chance.
ENSURE THAT DYNAMIC VARIABLES VALIDATE & WATCH OUT FOR INFINITE LOOPS
If your site renders to anything it’s likely you’ll have trouble sooner or later. Ensure that proper validation is incorporated so that a hard 404 is returned
DON’T BE AFRAID OF HARD 404’s – GIVE A 410 RESPONSE WHERE NECESSARY – AVOID SOFT 404’s
HARD 404’s help Google to learn your architecture (John Mueller). If you’re confident that you’re never going to want that particular URL crawling again (or you’ve inadvertently created nonsensical URLs give a 410 – never return – directive). Avoid giving 301 directive’s just for the sake of it. You’re sending Googlebot back up your architecture and potentially missing out on getting those lower level pages crawled during Googlebot’s visit. Likewise, consider using expires after headers on thin and time dependent content such as auctions or job listings so that Googlebot can spend time where you really want it to. Soft 404’s – Don’t waste valuable crawl time on these – always return a hard 404 server response code – As the Google Team say in the Webmaster Guidelines – It’s like a giraffe wearing a name tag that says “dog.” Just because it says it’s a dog, doesn’t mean it’s actually a dog. Make sure dogs are called dogs, and giraffes are called giraffes on your site.
CHECK XML SITEMAPS – THOROUGHLY
Before you submit XML sitemaps via Google Webmaster Tools, take a moment to check what they contain. Ensure that you’ve neither inadvertently picked up on an infinite loop due to poor parameter handling, nor picked up on URLs which may live in another one of your sitemaps. Those few moments casting an eye over the file before you submit it could make all the difference to where Googlebot goes on your site and the level of importance that is placed upon a URL.
CATEGORISE XML SITEMAPS
Having sitemaps called sitemap1.xml, sitemap2.xml etc is all good and well on a small site with maybe a few hundred URLs, but simply won’t cut the mustard on larger sites. A decent XML sitemap generator or crawling tool should allow you to produce sitemaps categorised by site section, product type, category, etc simply by allowing you to exclude strings which match a certain pattern or list. You’ll then easily be able to identify areas of concern which need work. Use tools such as Deep Crawl and XML Unlimited Sitemap Generator to achieve this with greater automation.
GAIN ACCESS TO TESTING / DEV ENVIRONMENT BEFORE TEMPLATE CHANGES GO LIVE
Developers are great, but even they make mistakes. A dodgy line of code or a parameter picked up on accidentally can prove disastrous when you’re churning out thousands of parameter based, dynamic URLs from templates as standard. Make sure that you have access to a testing platform as an SEO and ask to check everything before changes are uploaded to live. Remember, a single file upload could destroy your whole natural search campaign.
ENSURE YOUR IMPORTANT PAGES HAVE THE MOST INTERNAL LINKS
Visit Google Webmaster Tools and check the ‘internal links’ section under ‘Search Traffic’. Are the pages you want to rank in ascending order there? Is that sitewide image link with ‘get a quote’ to a thin quote form page at the top of the list over your primary hero target pages? Is your blog and it’s categories higher than your commercial pages – if so, it’s likely that they are what Google will be returning for queries over your ‘money’ pages and you could have a problem. With a finite crawl budget assigned to your site you need to ensure that your main pages are very clearly at the top of this list. Look at ways in which you can change this and do it.
UNDERSTAND AND MANAGE PARAMETERS AND URL REWRITES
Find out what parameters have been used to pull in dynamic URLs have been re-written and what variable are adding content into your site and whether this has changed over time. There may still be old parameters which are floating around which amount to the same as other URLs you have indexed. Visit Google Webmaster Tools ‘parameter handling’ section to see how Googlebot is handling these and intervene if necessary.
USE ROBOTS.TXT WELL
Block pages unlikely to rank / you don’t want to rank via robots.txt. After checking server logs, get your head around regular expressions and use these to your advantage. There are a number of simple cheat sheets to get you started, such as this one – http://web.mit.edu/hackl/www/lab/turkshop/slides/regex-cheatsheet.pdf Avoid blocking images (although it’s worth considering loading these and other media from a different subdomain for site speed), as you could miss out on valuable visibility in the images index.
AVOID PHONEY .HTACCESS FOLDERS
Don’t try to be smart with .htaccess Remember that those subfolders created at .htaccess level could come back to bite you in the longer term. People may link to them, you’re effectively adding another hop for Googlebot, splitting equity between each folder, putting unjustified load on your server and potentially diluting authority across URLs. That folder called /in/ you created to rank for location based terms is just a waste of time in the long term (and a stop word), and you may find you’ll need to 301 redirect a load of URLs in the future. Even worse if it’s a real folder. What happens if you ever decide to get rid of it? There’ll no doubt be a lot of files to shuffle around and a tech nightmare internal link re-optimisation issue and a restructuring one too. FLATTEN ARCHITECTURE WHERE POSSIBLE – THINK FLAT AND FAT (but not too flat and fat)
AVOID DEEP ARCHITECTURES
Avoid deep architectures – if legacy development makes flattening the architecture more of a long term plan, utilise XML sitemaps and html sitemaps to shorten paths to deep pages if necessary (making use of categorised sitemaps for relevance and site section trouble shooting with ease). Canonicalise or flatten paginated results, near or exact duplicate output.
AVOID A JUMBLE SALE – TOO FLAT
Don’t over flatten though by adding a mega menu with every size, shape and colour of your ecommerce product you can think of and potentially dilute every page on your site with hundreds and hundreds of links out of each page. Whilst your higher authority pages may be able to live with this, you’re jeopardising those valuable secondary term pages lower down in your site’s architecture. Whilst there may be the urge to show all your wares on the front page of your site – avoid the ‘jumble sale’ effect, where there is no order or logic (nor relationship) between the parts of your site by flattening to an extremity.
VISIT WEBMASTER TOOLS DAILY (AT LEAST)
Look at everything – Apart from server logs it’s pretty much our birds eye view of what Googlebot is seeing and how your site is being perceived. Check the ‘content keywords’ interface, for example, to see what topics Googlebot is tying in with your site. If it’s not what you expect to rank for then there’s a problem. You may even find that spammers have been posting comments in your blog and effectively changing the perceived topic of your site (drastic example, but it happens). (Google Webmaster Tools is hugely under utilised by SEO’s and holds many of the secrets to the basic SEO of old) Get to know every aspect and feature that is available via this valuable and often overlooked resource.
OPTIMISE YOUR INTERNAL STRUCTURE – UTILISE CROSS MODULE LINKING IF NECESSARY
Work on internal link optimisation to ensure that low to no page rank pages get crawled more often, either via XML sitemaps or via related cross module linking if necessary. Refer to Internal links screen on Google Webmaster Tools to monitor this. Make sure that lower level pages have easy (and many) crawl access points but be careful to ensure that this doesn’t mean they outrank your categories. Continually refer back to GWT internal links to check your progress. Pass authority horizontally via relationships as well as down the ‘tree’ to children of the page above. Look for highly relevant relationships that you can tie between pages and queries – Think “Where is the Ontology?” (relationship link) – Definition: In computer science and information science, an ontology formally represents knowledge as a set of concepts within a domain, using a shared vocabulary to denote the types, properties and interrelationships of those concepts. (There are clues in ‘content keywords’ under variants in Google Webmaster Tools to help with this).
Not everything can be fixed immediately and the technical enhancements you made to your site ‘back in the day’ which you thought were quick wins could prove to be difficult to turn around. Googlebot won’t be rushed but we can influence it to go where we want it to on our site via crawl optimisation techniques and understanding it’s experience – as Mike King said “Googlebot is always your last persona.” So ensure you understand what you’re dishing out to it. Sometimes you simply have to go backwards in order to go forwards when it comes to organic search. Before you start content marketing like crazy in the pursuit of likes, shares and tweets and hunting for new links to replace those you had to remove or disavow following Penguin, take a look at how you can build relationships between your own internal site sections first (fall back on your own site’s trusted internal links), and make best use of the valuable and oft missed link opportunities from ourselves and present it in a way that Googlebot enjoys.
Brighton SEO Deck
The deck shared on Slideshare.
Been affected by crawl tank?
If you’ve been impacted by crawling issues on your site we can help through our crawl optimisation service where we analyse your crawling activity and advise on the best strategy to emphasise importance of the right web pages for maximum impact and repair.
Last Updated on