Crawling – What is it?

Crawling is a process used by search engines as part of information retrieval systems.  Crawlers (A.K.A. spiders or bots) traverse the web via links or using other discovery systems to both discover new web pages and revisit updated web pages in order to keep the search engine index up to date.  The undiscovered web pages are known as ‘The Crawling Frontier’.  Existing web pages being revisited for substantive change will likely be marked as ‘already seen’.  Google Patents clearly show an element within the search engine crawling architecture called ‘Already Seen Test’.

What is crawl budget?

‘Crawl budget’ is a term used frequently in SEO, but in reality this is not a term used internally at Google by their search engineers.  However, it is referred to by Google and Bing when talking to SEOs.  Bing also mention crawl budget when they communicate with the SEO world.  Again, it is unlikely to be a term used by search engineers internally at any search engine.

Crawl budget as a concept is really a combination of two different notions from a search engine perspective.  These two elements or conceptual notions are host load and scheduling.  Scheduling is also based on another notion which is that of ‘crawl demand’.  Crawl budget in the SEO world relates to how much time or activity is likely to be apportioned to a particular domain (host) by search engine crawlers.

Why crawl budget exists?

Whilst search engine resources are extensive, they are not infinite, and the amount of content on the web grows more than exponentially. Naturally, some content / URLs are more important to users than others. Search engines naturally seek to be efficient in their crawling strategies when traversing the web and visit web pages which either change more frequently or are more important to search engine users than other less important pages.  Search engine crawlers also visit servers (hosting websites) based upon how much capacity the server can tolerate in simultaneous connections (hits from bots).  This is to ensure the server is not damaged, and also the websites hosted by the server.

What is host load and why does it impact crawling?

The first notion is called ‘host load’.  Host load is based around ‘crawling politeness’.   One of the major ‘rules of the web’ (larger than simply Google or even all of the search engines, as the web as a community is much bigger than this), is never to hurt the servers which sites sit on whilst crawling between documents (nodes), which we will really think of mainly as web pages.  This rule of ‘good behaviour’ is called ‘crawling politeness’.  It’s akin to being a good bot citizen of the web, along with the bot controllers also being good citizens of the web.  ‘Crawling politeness’ refers to the amount of simultaneous downloads of all types of documents (HTML pages, CSS files, javascript files, even images and PDFs).  In fact, any kind of file at all which can be downloaded and using several connections which can be carried out at the same time by Google’s crawlers (bots). Tying these together essentially means host load is “how much can we crawl with our bots politely (without hurting the server)”.  i.e. How much can the host (this is not a domain, but an IP typically) handle at one time?  Note that this will be shared amongst subdomains belonging to a domain overall rather than a single subdomain.

Host load in web crawling is based on “how much can this server / IP address handle?”  It is important that web crawlers behave like good citizens of the web and do not damage the servers they visit.  If a web server is struggling to handle fast crawling or several connections from web crawlers at once then the web crawling may be pulled back to adjust to the capacity of the server.  If the web server suffers such problems as server errors or returns server response codes such as 500 or 503 which indicate an issue with loading, the web crawler will again pull back from crawling to avoid causing any further problems.  These response codes are noted too against the document IDs which were being downloaded by the web crawler.  Therefore, it makes sense to ensure you have enough capacity on the server / IP address (host) to handle simultaneous connections and also to ensure your site does not have coding or database loading issues when search engines (or humans) come to call.

What is crawl scheduling?

Crawl Scheduling in web crawling is the ordering of documents (documents can be web page URLs, images, cascading style sheets CSS, javascript files JS, PDF or other documents), and allocation of visiting priority and frequency by search engine web crawlers.  Crawl scheduling is mainly based on the importance and change frequency of the documents.  Scheduling is a common operation utilised in computer science on data structures and in order to prioritise or manage repetitive tasks.  Queues in crawling take what is known as a FIFO method (first in, first out), but likely the queue in web crawling has a circular tail (i.e. the queue never ends).

What is crawl scheduling and crawl demand?

The second part of crawl budget is ‘scheduling’ or ‘crawl demand’.  The web is huge and growing all the time.  According to Internet Live Stats there are over 1.5 billion websites online right now (probably many more since you’re reading this after we added this text).  As the web has grown the task of crawling all of the web pages in over 1.5 billion websites plus is an extensive task.  If you also consider the number of individual webpages and even user generated content such as tweets (effectively a web page too), indexed and crawlable the task becomes humungous.  There is a necessity for search engines to build systems which are scalable and efficient.  Therefore, a schedule of crawling is developed over time based on how important web pages are and how often they change too. If your web page rarely changes it’s likely to be less frequently crawled than web pages which change with substantive changes (things that make a real difference to search engine users such as price in an e-commerce page), frequently.  If your web page is not getting crawled often it doesn’t automatically follow that your page is lower quality, however there is often a correlation between many low quality pages on a site and low crawling.  These pages tend to be low in importance and readily available (i.e. there’s a lot of them and they are mostly the same; meeting the same query clusters as each other).  Over time, as your web pages are monitored for change and importance the search engines build a picture of how often these pages should be revisited. When these two factors (host load and crawl demand) are combined we end up with the concept we, as SEOs have come to know as ‘crawl budget’.

What is the ‘Crawl Rate’?

The crawl rate in web crawling is how quickly a web crawler crawls through your website.  It is based on how many pages are accessed per second by Google’s search engine crawler when traversing the site.  The fastest speed at which traversal occurs is 10 times per second.  However, it is important to note that all websites and their host load capacity are different so there is not necessarily a ‘standard’ crawl rate for a particular type of site.  Each site and its components as well as its hosting is unique.

What is Indexing and Crawling?

Indexing

Indexing is the process undertaken by search engines to ‘file’ every document they come across in a systematic order in storage for quick retrieval later.  Liken this to a card index system.  Search engines use what is called an ‘inverted index’ to file documents (web pages, PDFs, URLs) with each word (known as tokens) discovered marked as being on particular documents.  The index is called upon when a search engine user enters a query (either by written text or voice), and the purpose of the search engine then is to return the most relevant response as quickly as possible from the storage system.  What is considered the most relevant document is beyond the scope of crawl optimisation, but it goes without saying that unless a document is in the index it cannot be retrieved as a relevant result.

What is Indexing in a Search Engine

Indexing in a search engine is the process of organising discovered web pages and storing them in a meaningful way so that when a search engine user enters a query the system (search engine storage) can quickly find what is needed and return the most relevant pages.  Web pages are collected via crawling the web and then the contents of the web pages are broken down into tokens (tokenization / tokenisation) which are very much like individual words.  The words discovered already (and new ones) are then marked as being found in particular documents.  For example, cat might be found in thousands of documents.  Cat would be given an identifier and the documents containing cat would also be given identifiers with each document containing ‘cat’ stored in the index information system.  Again, consider the index to be similar to a card index system used by a librarian to quickly find a requested book, but in this case, at a word level.

Search engine crawlers periodically revisit existing web pages based upon likelihood of change frequency according to a schedule (rather like an air traffic control system for bots), to ensure efficiency.  Crawling the web is expensive for search engines and therefore systems are built for scale and to manage resources as efficiently as possible.

What type of Googlebots Crawl My Site?

There are many Googlebots which may come to visit your website.

The Googlebot family is as follows:

  • APIs-Google
  • Adsense
  • AdsBot Mobile Web Android
  • AdsBot Mobile Web
  • AdsBot
  • Googlebot Images – Crawls to gather URLs for Google Images section of Universal search
  • Googlebot News – Crawls to gather URLs for Google News
  • Googlebot Video – The googlebot which crawls video documents / video files.  Crawls to gather URLs (web pages / documents) for Google’s video search
  • Googlebot (Desktop) – Organic search googlebot – crawls web pages as a desktop user
  • Googlebot (Smartphone) – Organic search mobile googlebot – crawls web pages emulating a mobile user with a mobile user-agent
  • Mobile Adsense – The mobile crawler for the Adsense platform rather than the organic SEO mobile crawler
  • Mobile Apps Android

Googlebots and Google Search Console Crawl Stats

If you frequent Google Search Console you also might be surprised to discover it’s not simply Googlebots which are used for organic search (SEO) who are shown as part of the crawling statistics in the ‘crawl stats section’.  Every member of the Googlebot family may have left a footprint in those stats.  Of course, your site may not have visits frequently from all of the Googlebots.  For example, if you are not running any kind of paid search PPC then you would be unlikely to receive visits frequently from ‘Googlebot Adsbot’.  If you are not running Adsense campaigns again, you would be far less likely to receive visits from ‘Adsense Bot’.

Does Javascript Slow Down Crawling?

Javascript can certainly make crawling your webpages more complicated for search engines, and even more so when it comes to reading the content on the pages, which could negatively impact your SEO.

Does a Slow Website Affect Crawling?

If you have a slow website, Googlebot or other search engine crawlers arriving with their scheduled ‘shopping list’ of website pages to visit may not get round all the URLs on the list.  They may also not have time to spend on pages which are also important and could be discovered on these visits.  This could impact your site’s indexing.  By understanding where search engines are visiting and how frequently, identifying issues with slow loading pages, and how to improve these, we can help you get maximum effectiveness from crawling.

Should All Pages Be Indexed?

There may be some areas of your site which actually are not very useful to be visited directly from search engine results pages.  That is not to say that they do not add some type of value to visitors when they are on the site (for example, for some aspects of navigation).  There may also be some pages on your site which you don’t feel are good enough quality to be a direct landing page for visitors from search engine users.  Search engines may also agree with this.  These types of pages might not be good enough to be indexed and search engines may choose to not include them in their results pages.  However, low quality pages when crawled could be dragging down your site down in search engine results so for now it might be best to choose to noindex these pages.  We’ll help you identify these pages and advise regarding whether to try to improve these pages or noindex them so that search engines put them to one side.