How We Can Help You With Crawl Optimisation

Smart Crawl Optimisation Analysis and Strategy

Search engines naturally seek to be efficient in their crawling strategies when traversing the web and are thought to visit web pages which either change more frequently or are more important to search engine users than other less important pages.  Search engine crawlers also visit servers (hosting websites) based upon how much capacity the server can tolerate in simultaneous connections (hits from bots).  This is to ensure the server is not damaged, and also the websites hosted by the server.

On larger sites we find there are often issues whereby search engines are not either finding or visiting pages frequently because there are blockages, there are issues with content or structural quality on the site, or there are issues with the site being slow.

Our team have huge experience in analysing and interpreting crawl frequency, and then providing recommendations to open up blockages, reduce duplication of content and therefore combining signals for search engine crawlers to consolidate strength, and for speed optimisation.  It doesn’t necessarily follow you will rank higher in search engines as a result of crawl optimisation, but it certainly follows that combining otherwise competing signals so the strength is focused on non-duplicate URLs (web pages) cannot be negative.  It also follows that unblocking areas which are rich in content and which search engines cannot access because of issues with web page directives added in website development, can also not be negative.  It also follows that speeding up a website or website sections so that search engine crawlers can get round more URLs (web pages) from their scheduled ‘crawling list’ when they visit, can also not be negative.

These are the areas we focus on with crawl rate optimisation analysis and strategy development as part of our core technical SEO services.  We use a range of tools to analyse and identify where and when search engine crawlers are visiting and redirect them to more appropriate URLs.  We help search engines understand which parts of the site we consider important and where they should focus their crawling activity.

What Type of Googlebots Crawl My Website?



There are many Googlebots which may come to visit your website.

The Googlebot family is as follows:

  • APIs-Google
  • Adsense
  • AdsBot Mobile Web Android
  • AdsBot Mobile Web
  • AdsBot
  • Googlebot Images – Crawls to gather URLs for Google Images section of Universal search
  • Googlebot News – Crawls to gather URLs for Google News
  • Googlebot Video – The googlebot which crawls video documents / video files.  Crawls to gather URLs (web pages / documents) for Google’s video search
  • Googlebot (Desktop) – Organic search googlebot – crawls web pages as a desktop user
  • Googlebot (Smartphone) – Organic search mobile googlebot – crawls web pages emulating a mobile user with a mobile user-agent
  • Mobile Adsense – The mobile crawler for the Adsense platform rather than the organic SEO mobile crawler
  • Mobile Apps Android

Googlebots and Google Search Console Crawl Stats

If you frequent Google Search Console you also might be surprised to discover it’s not simply Googlebots which are used for organic search (SEO) who are shown as part of the crawling statistics in the ‘crawl stats section’.  Every member of the Googlebot family may have left a footprint in those stats.  Of course, your site may not have visits frequently from all of the Googlebots.  For example, if you are not running any kind of paid search PPC then you would be unlikely to receive visits frequently from ‘Googlebot Adsbot’.  If you are not running Adsense campaigns again, you would be far less likely to receive visits from ‘Adsense Bot’.


What is Indexing and Crawling?


Indexing is the process undertaken by search engines to ‘file’ every document they come across in a systematic order in storage for quick retrieval later.  Liken this to a card index system.  Search engines use what is called an ‘inverted index’ to file documents (web pages, PDFs, URLs) with each word (known as tokens) discovered marked as being on particular documents.  The index is called upon when a search engine user enters a query (either by written text or voice), and the purpose of the search engine then is to return the most relevant response as quickly as possible from the storage system.  What is considered the most relevant document is beyond the scope of crawl optimisation, but it goes without saying that unless a document is in the index it cannot be retrieved as a relevant result.


Crawling is a process used by search engines as part of information retrieval systems.  Crawlers (A.K.A. spiders or bots) traverse the web via links or using other discovery systems to both discover new web pages and revisit updated web pages in order to keep the search engine index up to date.  The undiscovered web pages are known as ‘The Crawling Frontier’.  Existing web pages being revisited for substantive change will likely be marked as ‘already seen’.  Google Patents clearly show an element within the search engine crawling architecture called ‘Already Seen Test’.

Search engine crawlers periodically revisit existing web pages based upon likelihood of change frequency according to a schedule (rather like an air traffic control system for bots), to ensure efficiency.  Crawling the web is expensive for search engines and therefore systems are built for scale and to manage resources as efficiently as possible.



What is Indexing in a Search Engine?

Indexing in a search engine is the process of organising discovered web pages and storing them in a meaningful way so that when a search engine user enters a query the system (search engine storage) can quickly find what is needed and return the most relevant pages.  Web pages are collected via crawling the web and then the contents of the web pages are broken down into tokens (tokenization / tokenisation) which are very much like individual words.  The words discovered already (and new ones) are then marked as being found in particular documents.  For example, cat might be found in thousands of documents.  Cat would be given an identifier and the documents containing cat would also be given identifiers with each document containing ‘cat’ stored in the index information system.  Again, consider the index to be similar to a card index system used by a librarian to quickly find a requested book, but in this case, at a word level.

What is crawl budget?

‘Crawl budget’ is a term used frequently in SEO, but in reality this is not a term used internally at Google by their search engineers.  However, it is referred to by Google and Bing when talking to SEOs.   Crawl budget as a concept is really a combination of two different notions from a search engine perspective.  These two elements or conceptual notions are host load and scheduling.  Scheduling is also based on another notion which is that of ‘crawl demand’.

Host load

The first notion is called ‘host load’.  Host load is based around ‘crawling politeness’.   One of the major ‘rules of the web’ (larger than simply Google or even all of the search engines, as the web as a community is much bigger than this), is never to hurt the servers which sites sit on whilst crawling between documents (nodes), which we will really think of mainly as web pages.  This rule of ‘good behaviour’ is called ‘crawling politeness’.  It’s akin to being a good bot citizen of the web, along with the bot controllers also being good citizens of the web.  ‘Crawling politeness’ refers to the amount of simultaneous downloads of all types of documents (HTML pages, CSS files, javascript files, even images and PDFs).  In fact, any kind of file at all which can be downloaded and using several connections which can be carried out at the same time by Google’s crawlers (bots). Tying these together essentially means host load is “how much can we crawl with our bots politely (without hurting the server)”.  i.e. How much can the host (this is not a domain, but an IP typically) handle at one time?

Scheduling (Crawl demand)

The second part of crawl budget is ‘scheduling’ or ‘crawl demand’.  The web is huge and growing all the time.  According to Internet Live Stats there are over 1.5 billion websites online right now (probably many more since you’re reading this after we added this text).  As the web has grown the task of crawling all of the web pages in over 1.5 billion websites plus is an extensive task.  If you also consider the number of individual webpages and even user generated content such as tweets (effectively a web page too), indexed and crawlable the task becomes humungous.  There is a necessity for search engines to build systems which are scalable and efficient.  Therefore, a schedule of crawling is developed over time based on how important web pages are and how often they change too. If your web page rarely changes it’s likely to be less frequently crawled than web pages which change with substantive changes (things that make a real difference to search engine users such as price in an e-commerce page), frequently.  If your web page is not getting crawled often it doesn’t automatically follow that your page is lower quality, however there is often a correlation between many low quality pages on a site and low crawling.  These pages tend to be low in importance and readily available (i.e. there’s a lot of them and they are mostly the same; meeting the same query clusters as each other).  Over time, as your web pages are monitored for change and importance the search engines build a picture of how often these pages should be revisited. When these two factors (host load and crawl demand) are combined we end up with the concept we, as SEOs have come to know as ‘crawl budget’.        

What is Host Load in Web Crawling?

Host load in web crawling is based on “how much can this server / IP address handle?”  It is important that web crawlers behave like good citizens of the web and do not damage the servers they visit.  If a web server is struggling to handle fast crawling or several connections from web crawlers at once then the web crawling may be pulled back to adjust to the capacity of the server.  If the web server suffers such problems as server errors or returns server response codes such as 500 or 503 which indicate an issue with loading, the web crawler will again pull back from crawling to avoid causing any further problems.  These response codes are noted too against the document IDs which were being downloaded by the web crawler.  Therefore, it makes sense to ensure you have enough capacity on the server / IP address (host) to handle simultaneous connections and also to ensure your site does not have coding or database loading issues when search engines (or humans) come to call.

What is Crawl Scheduling?

Crawl Scheduling in web crawling is the ordering of documents (documents can be web page URLs, images, cascading style sheets CSS, javascript files JS, PDF or other documents), and allocation of visiting priority and frequency by search engine web crawlers.  Crawl scheduling is mainly based on the importance and change frequency of the documents.

What is Crawl Demand?

What is the Crawl Rate?

The crawl rate in web crawling is how quickly a web crawler crawls through your website.  It is based on how many pages are accessed per second by Google’s search engine crawler when traversing the site.  The fastest speed at which traversal occurs is 10 times per second.  However, it is important to note that all websites and their host load capacity are different so there is not necessarily a ‘standard’ crawl rate for a particular type of site.  Each site and its components as well as its hosting is unique.

How Fast Can Google Crawl My Website?

How Can Crawl Optimisation Help My Site?

Can Googlebot Crawl Javascript?

Is Crawling My Website More Difficult With a Javascript Framework?

Javascript can certainly make crawling your webpages more complicated for search engines, and even more so when it comes to reading the content on the pages, which could negatively impact your SEO.

Crawl optimisation via speed improvements

If you have a slow website, Googlebot or other search engine crawlers arriving with their scheduled ‘shopping list’ of website pages to visit may not get round all the URLs on the list.  They may also not have time to spend on pages which are also important and could be discovered on these visits.  This could impact your site’s indexing.  By understanding where search engines are visiting and how frequently, identifying issues with slow loading pages, and how to improve these, we can help you get maximum effectiveness from crawling.

To Index or Not To Index

There may be some areas of your site which actually are not very useful to be visited directly from search engine results pages.  That is not to say that they do not add some type of value to visitors when they are on the site (for example, for some aspects of navigation).  There may also be some pages on your site which you don’t feel are good enough quality to be a direct landing page for visitors from search engine users.  Search engines may also agree with this.  These types of pages might not be good enough to be indexed and search engines may choose to not include them in their results pages.  However, low quality pages when crawled could be dragging down your site down in search engine results so for now it might be best to choose to noindex these pages.  We’ll help you identify these pages and advise regarding whether to try to improve these pages or noindex them so that search engines put them to one side.

We study crawling a lot and share our knowledge with the industry

We’ve been a little bit obsessed with web search engine crawling for quite a few years.  We’ve shared our knowledge within the industry and spoken at conferences both in the UK and internationally (as far and wide as the United States and Australia).  We’ve contributed to industry publications around the subject of crawl frequency, URL scheduling and how these can impact website migrations on larger sites.  So, we know a thing or two about this topic.  We’re well placed to help you get the best when search engine bots come to call.

Duplicate, near duplicate and similar content

There’s a lot of confusion in the SEO world when it comes to duplicate, near duplicate and similar content.  There are even some SEOs who advise clients to remove all aspects of duplication on their website.  This is often a mistake as duplication is often tied more to the search engine query than the content on the page.  Does the content on more than one page meet the query behind the informational need in the same intent and context?  If so, regardless of what is on the page, this could be considered a case where duplication occurs.  Exact duplicate content is filtered out in search engines before they even get indexed anyway, because search engines are thought to have a ‘dup-checker’ in their search engine ‘boiler-room’ (anatomy).  This merely checks using a ‘content checksum’ (a kind of binary calculation) when when returned says whether two pages (or more) are exactly the same.  If these two (or more) pages are exactly the same, all but one of these will be dropped and one will be chosen to be shown in search engine results.  However, the web page chosen by the search engine may not always be the one which we would prefer the search engine to show as brands and website owners.  Crawl optimisation can also help with this as we can signal to search engines which of the URLs should be chosen as the ‘canonical’ (the chosen one).

SEO Services

Explore more of our digital services

Penalty recovery

Penalty Recovery

Backlink analysis

Backlink Analysis

Crawl rate optimisation

Crawl Rate Optimisation

More Reasons To Choose Us As Crawl Optimisation Partner

Get maximum value from search engine crawling

Identify the best places for Googlebot and other crawlers to visit and emphasise these to them

Speed up your site

Speeding up your site as part of crawl optimisation can never be a bad thing

Send better quality signals

We’ll help you identify the weak areas in your site and help you put together a strategy to either improve or exclude this content for better quality signals when search engines come to call.

Ready to start winning with crawl?

Call us on:

0161 241 5151