Let’s start with the crawling process. Google visits pages using a software often referred to as a crawler or a spider, which downloads the page. Then, the indexing process begins, which means collecting, analyzing and saving the data of pages. When a search engine user enters a phrase, the index is searched and the best result is displayed. More information regarding Google search can be found here and here.
An example from a Pulno report. These errors will cause indexation issues.
Search engines have limited resources for crawling websites. Webmasters who create websites with interesting content get more visits from Google’s robots. Crawl budget is the amount of resources that search engines allocate to a given website. Making sure that crawl budget is as big as possible is recommended.
One of our case studies has indicated that there has been a significant improvement in:
once about 70% of low quality content has been removed and the website has been sped up. This clearly shows that crawl budget of search engines should be optimized.
List of factors affecting crawl budget:
Content example for robots.txt
User-agent: * Disallow: /administrator/ Disallow: /bin/ Disallow: /cache/ Disallow: /cli/ Disallow: /components/ Disallow: /includes/ Disallow: /installation/ Disallow: /language/ Disallow: /layouts/ Disallow: /libraries/ Disallow: /logs/ Disallow: /modules/ Disallow: /plugins/ Disallow: /tmp/
Robots.txt informs a web crawler about which areas of the website should not be processed.
A broken robots.txt might cause indexing errors. Common issues include:
Page indexing may be managed on the page level by “robots” meta tag or X-Robots-Tag header. Robots meta tag may have many parameters. In the context of this article, the most important parameter is "noindex".
In the case of detecting the “noindex” parameter in the X-Robots-Tag header or in robots meta tag, search engines will not index the page content. Consequently, that page will not be visible in search results. The noindex attribute might be used in the case of pages we do not want to be indexed (e.g. duplicated content, pages under construction). In most cases, leaving the “noindex” parameter is a mistake. Checking if you really want to block search engines from accessing a page is recommended!
Link depth information in Pulno
Link depth may be defined as a number of clicks required to reach a selected page from the home page. Creating a proper structure for a big website is a great challenge. It is much easier for smaller domains with just a couple dozen pages. In simplified terms, if pages are linked internally from the home page, their linking power is divided equally. In the case of bigger domains, the number of internal links on the home page is limited. Some pages will not be linked on the home page which results in lesser linking power. Most often the home page has the best backlinks so planning the website’s structure in such a way that all pages can be reached with as few clicks as possible is recommended. We suggest adding links to important pages as close to the home page as possible.
Password protected pages will not be indexed because search engines do not know the passwords. Using a password is a good solution for pages that are under construction. However, if the page is to be indexed it is better to make sure it is not password protected.
Search engines display their results based on what they are able to find and index. Making sure that the websites are available for Google, Bing, DuckDuckGo and other search engines is recommended.
Jacek Wieczorek is the co-founder of Pulno. Since 2006, he has been optimizing and managing websites that generate traffic counted in hundreds of thousands of daily visits.
Audit any website for crawlability issues. Free trial!