How We Can Help You With Crawl Optimisation
Smart Crawl Optimisation Analysis and Strategy
Search engines naturally seek to be efficient in their crawling strategies when traversing the web and are thought to visit web pages which either change more frequently or are more important to search engine users than other less important pages. Search engine crawlers also visit servers (hosting websites) based upon how much capacity the server can tolerate in simultaneous connections (hits from bots). This is to ensure the server is not damaged, and also the websites hosted by the server.
On larger sites we find there are often issues whereby search engines are not either finding or visiting pages frequently because there are blockages, there are issues with content or structural quality on the site, or there are issues with the site being slow.
Our team have huge experience in analysing and interpreting crawl frequency, and then providing recommendations to open up blockages, reduce duplication of content and therefore combining signals for search engine crawlers to consolidate strength, and for speed optimisation. It doesn’t necessarily follow you will rank higher in search engines as a result of crawl optimisation, but it certainly follows that combining otherwise competing signals so the strength is focused on non-duplicate URLs (web pages) cannot be negative. It also follows that unblocking areas which are rich in content and which search engines cannot access because of issues with web page directives added in website development, can also not be negative. It also follows that speeding up a website or website sections so that search engine crawlers can get round more URLs (web pages) from their scheduled ‘crawling list’ when they visit, can also not be negative.
These are the areas we focus on with crawl rate optimisation analysis and strategy development as part of our core technical SEO services. We use a range of tools to analyse and identify where and when search engine crawlers are visiting and redirect them to more appropriate URLs. We help search engines understand which parts of the site we consider important and where they should focus their crawling activity.
What Type of Googlebots Crawl My Website?
There are many Googlebots which may come to visit your website.
The Googlebot family is as follows:
- AdsBot Mobile Web Android
- AdsBot Mobile Web
- Googlebot Images – Crawls to gather URLs for Google Images section of Universal search
- Googlebot News – Crawls to gather URLs for Google News
- Googlebot Video – The googlebot which crawls video documents / video files. Crawls to gather URLs (web pages / documents) for Google’s video search
- Googlebot (Desktop) – Organic search googlebot – crawls web pages as a desktop user
- Googlebot (Smartphone) – Organic search mobile googlebot – crawls web pages emulating a mobile user with a mobile user-agent
- Mobile Adsense – The mobile crawler for the Adsense platform rather than the organic SEO mobile crawler
- Mobile Apps Android
Googlebots and Google Search Console Crawl Stats
If you frequent Google Search Console you also might be surprised to discover it’s not simply Googlebots which are used for organic search (SEO) who are shown as part of the crawling statistics in the ‘crawl stats section’. Every member of the Googlebot family may have left a footprint in those stats. Of course, your site may not have visits frequently from all of the Googlebots. For example, if you are not running any kind of paid search PPC then you would be unlikely to receive visits frequently from ‘Googlebot Adsbot’. If you are not running Adsense campaigns again, you would be far less likely to receive visits from ‘Adsense Bot’.
What is Indexing and Crawling?
Indexing is the process undertaken by search engines to ‘file’ every document they come across in a systematic order in storage for quick retrieval later. Liken this to a card index system. Search engines use what is called an ‘inverted index’ to file documents (web pages, PDFs, URLs) with each word (known as tokens) discovered marked as being on particular documents. The index is called upon when a search engine user enters a query (either by written text or voice), and the purpose of the search engine then is to return the most relevant response as quickly as possible from the storage system. What is considered the most relevant document is beyond the scope of crawl optimisation, but it goes without saying that unless a document is in the index it cannot be retrieved as a relevant result.
Crawling is a process used by search engines as part of information retrieval systems. Crawlers (A.K.A. spiders or bots) traverse the web via links or using other discovery systems to both discover new web pages and revisit updated web pages in order to keep the search engine index up to date. The undiscovered web pages are known as ‘The Crawling Frontier’. Existing web pages being revisited for substantive change will likely be marked as ‘already seen’. Google Patents clearly show an element within the search engine crawling architecture called ‘Already Seen Test’.
Search engine crawlers periodically revisit existing web pages based upon likelihood of change frequency according to a schedule (rather like an air traffic control system for bots), to ensure efficiency. Crawling the web is expensive for search engines and therefore systems are built for scale and to manage resources as efficiently as possible.
What is Indexing in a Search Engine?
Indexing in a search engine is the process of organising discovered web pages and storing them in a meaningful way so that when a search engine user enters a query the system (search engine storage) can quickly find what is needed and return the most relevant pages. Web pages are collected via crawling the web and then the contents of the web pages are broken down into tokens (tokenization / tokenisation) which are very much like individual words. The words discovered already (and new ones) are then marked as being found in particular documents. For example, cat might be found in thousands of documents. Cat would be given an identifier and the documents containing cat would also be given identifiers with each document containing ‘cat’ stored in the index information system. Again, consider the index to be similar to a card index system used by a librarian to quickly find a requested book, but in this case, at a word level.
What is crawl budget?
‘Crawl budget’ is a term used frequently in SEO, but in reality this is not a term used internally at Google by their search engineers. However, it is referred to by Google and Bing when talking to SEOs. Crawl budget as a concept is really a combination of two different notions from a search engine perspective. These two elements or conceptual notions are host load and scheduling. Scheduling is also based on another notion which is that of ‘crawl demand’.
Scheduling (Crawl demand)
The second part of crawl budget is ‘scheduling’ or ‘crawl demand’. The web is huge and growing all the time. According to Internet Live Stats there are over 1.5 billion websites online right now (probably many more since you’re reading this after we added this text). As the web has grown the task of crawling all of the web pages in over 1.5 billion websites plus is an extensive task. If you also consider the number of individual webpages and even user generated content such as tweets (effectively a web page too), indexed and crawlable the task becomes humungous. There is a necessity for search engines to build systems which are scalable and efficient. Therefore, a schedule of crawling is developed over time based on how important web pages are and how often they change too. If your web page rarely changes it’s likely to be less frequently crawled than web pages which change with substantive changes (things that make a real difference to search engine users such as price in an e-commerce page), frequently. If your web page is not getting crawled often it doesn’t automatically follow that your page is lower quality, however there is often a correlation between many low quality pages on a site and low crawling. These pages tend to be low in importance and readily available (i.e. there’s a lot of them and they are mostly the same; meeting the same query clusters as each other). Over time, as your web pages are monitored for change and importance the search engines build a picture of how often these pages should be revisited. When these two factors (host load and crawl demand) are combined we end up with the concept we, as SEOs have come to know as ‘crawl budget’.
What is Host Load in Web Crawling?
What is Crawl Scheduling?
What is Crawl Demand?
What is the Crawl Rate?
How Fast Can Google Crawl My Website?
How Can Crawl Optimisation Help My Site?
Crawl optimisation via speed improvements
If you have a slow website, Googlebot or other search engine crawlers arriving with their scheduled ‘shopping list’ of website pages to visit may not get round all the URLs on the list. They may also not have time to spend on pages which are also important and could be discovered on these visits. This could impact your site’s indexing. By understanding where search engines are visiting and how frequently, identifying issues with slow loading pages, and how to improve these, we can help you get maximum effectiveness from crawling.
To Index or Not To Index
There may be some areas of your site which actually are not very useful to be visited directly from search engine results pages. That is not to say that they do not add some type of value to visitors when they are on the site (for example, for some aspects of navigation). There may also be some pages on your site which you don’t feel are good enough quality to be a direct landing page for visitors from search engine users. Search engines may also agree with this. These types of pages might not be good enough to be indexed and search engines may choose to not include them in their results pages. However, low quality pages when crawled could be dragging down your site down in search engine results so for now it might be best to choose to noindex these pages. We’ll help you identify these pages and advise regarding whether to try to improve these pages or noindex them so that search engines put them to one side.
We study crawling a lot and share our knowledge with the industry
We’ve been a little bit obsessed with web search engine crawling for quite a few years. We’ve shared our knowledge within the industry and spoken at conferences both in the UK and internationally (as far and wide as the United States and Australia). We’ve contributed to industry publications around the subject of crawl frequency, URL scheduling and how these can impact website migrations on larger sites. So, we know a thing or two about this topic. We’re well placed to help you get the best when search engine bots come to call.
Duplicate, near duplicate and similar content
There’s a lot of confusion in the SEO world when it comes to duplicate, near duplicate and similar content. There are even some SEOs who advise clients to remove all aspects of duplication on their website. This is often a mistake as duplication is often tied more to the search engine query than the content on the page. Does the content on more than one page meet the query behind the informational need in the same intent and context? If so, regardless of what is on the page, this could be considered a case where duplication occurs. Exact duplicate content is filtered out in search engines before they even get indexed anyway, because search engines are thought to have a ‘dup-checker’ in their search engine ‘boiler-room’ (anatomy). This merely checks using a ‘content checksum’ (a kind of binary calculation) when when returned says whether two pages (or more) are exactly the same. If these two (or more) pages are exactly the same, all but one of these will be dropped and one will be chosen to be shown in search engine results. However, the web page chosen by the search engine may not always be the one which we would prefer the search engine to show as brands and website owners. Crawl optimisation can also help with this as we can signal to search engines which of the URLs should be chosen as the ‘canonical’ (the chosen one).
Explore more of our digital services
Crawl Rate Optimisation
More Reasons To Choose Us As Crawl Optimisation Partner
Get maximum value from search engine crawling
Identify the best places for Googlebot and other crawlers to visit and emphasise these to them
Speed up your site
Speeding up your site as part of crawl optimisation can never be a bad thing
Send better quality signals
We’ll help you identify the weak areas in your site and help you put together a strategy to either improve or exclude this content for better quality signals when search engines come to call.
Ready to start winning with crawl?
Call us on:
0161 241 5151