Googlebot

Googlebot is Google's web crawler and indexes pages for search, mostly with Mobile-First. It follows links, uses sitemaps, and is controlled by robots.txt as well as Follow/Nofollow. Crawl budget and technique determine what ends up in the index.

The Googlebot is one of Google's most important tools, making it relevant for search engine optimization. Generally, the term "Googlebot" refers to all of the search engine's web crawlers, including the crawler that simulates a desktop user and another one representing mobile users, as well as the bots for news and images. The bot's task is to automatically search the World Wide Web and include web pages in the Google Index. Consequently, the Googlebot ensures that a website is findable as a search result in the SERPs and is displayed during a user search. Google primarily uses the mobile crawler to index pages on the web.

When the Googlebot visits a website, it downloads the corresponding file version and updates it in the Google Index. Depending on the number of external backlinks and the importance of a page, a bot revisits a website at the corresponding frequency. The Google robot navigates along the links that exist between websites. By submitting a sitemap.xml file, the crawling of the Googlebot can be significantly increased and improved. In Google Webmaster Tools, it is specified which sitemap.xml file should be used and where it is located; for content over 50,000 pages, the sitemap.xml file must be split into several sub-files.

This is how the Googlebot accesses a page on the web

The Googlebot crawls over HTTP/1.1—or if supported by the website, over HTTP/2. The protocol version can indeed affect a page's ranking. However, HTTP/2 is more efficient for computing resources, both for the website itself and the crawler. The bot can crawl the first 15 MB of an HTML file. Resources displayed in the HTML file, such as images, videos, JavaScript, or CSS, are retrieved separately. Once the 15 MB limit is reached, the Googlebot stops crawling and considers exactly those 15 MB for indexing. The robot predominantly accesses a website only once every few seconds. However, due to delays, it may temporarily operate at a higher frequency.

Are all pages of a web domain indexed by Googlebot?

Not every page is visited by Google's web crawler. This is partly due to the nature of dynamic page content, such as PHP sessions. The Googlebot can index these only with difficulty or not at all. Additionally, the webmaster has the option to exclude certain pages from being indexed by the crawler. This is done through the robots.txt file. The webmaster can specify in this file whether and in what form a page should be visited by the Googlebot. However, the file does not ensure that it is not accessible or encrypted.

If a website is to be indexed, the webmaster must indicate this to the Googlebot using the robots.txt file. This is important, among other things, so that the web crawler does not waste unnecessary time crawling unimportant pages. The background is that the computer program has a specific time budget available for each webpage. This depends, among other things, on the relevance of the page. Depending on the time budget, the Googlebot can read more or fewer subpages of a URL. The goal is to exclude unimportant pages so that the most important subpages are read by the Google Crawler on the web.

Hints for the bot: Set follow and nofollow links on a page

When you set a link, it can be equipped with a follow or a nofollow attribute. These tags indicate whether the Googlebot should follow a link and include the corresponding page in the index. However, this is only a recommendation for the bot, although it usually follows it. Nevertheless, you should not simply create links to lots of bad websites with poor content or to/from bad neighborhoods and think that a nofollow link will then get you off the hook. Such poor link building can still be detected and penalized by the Google crawler despite a nofollow link.