Indexing

Blog tutorial Indexing
Indexing of loads of documents

HOW DOES GOOGLE FIND WEBSITES?

Ahrefs article shows that over 90% of Internet content does not get traffic from Google[2]. How does Google find websites then[3],[4]?

Let’s start with the crawling process. Google visits pages using a software often referred to as a crawler or a spider, which downloads the page. Then, the indexing process begins, which means collecting, analyzing and saving the data of pages[1]. When a search engine user enters a phrase, the index is searched and the best result is displayed. More information regarding Google search can be found here and here.


TIPS FOR BETTER INDEXING

  • create fast websites
  • create mobile-friendly websites
  • publish up-to-date and useful content
  • create good page titles
  • create good page headers
  • make sitemaps available to search engines
  • be sure to link all the pages
  • avoid broken pages
  • eliminate server errors


An example from a Pulno report. These errors will cause indexation issues.


Crawl budget

Search engines have limited resources for crawling websites. Webmasters who create websites with interesting content get more visits from Google’s robots. Crawl budget is the amount of resources that search engines allocate to a given website. Making sure that crawl budget is as big as possible is recommended.

One of our case studies has indicated that there has been a significant improvement in:

  • the number of indexed pages per day
  • rankings (from item 22-25 for the main entry to item 6)

once about 70% of low quality content has been removed and the website has been sped up. This clearly shows that crawl budget of search engines should be optimized.

List of factors affecting crawl budget:

  • page speed - the faster a website is, the more often Google can visit it
  • attaching session IDs
  • duplicate content
  • broken pages
  • hacked content
  • spam
  • low quality pages
  • redirects (several consecutive redirects in particular)
  • popularity of a website
  • relevancy of a website



ROBOTS.TXT

Robots.txt is one of the first files visited by Google while crawling the website[5],[6]. It should be available in the main directory of the domain, e.g.

https://www.bizdb.co.uk/robots.txt

Content example for robots.txt

User-agent: *
Disallow: /administrator/
Disallow: /bin/
Disallow: /cache/
Disallow: /cli/
Disallow: /components/
Disallow: /includes/
Disallow: /installation/
Disallow: /language/
Disallow: /layouts/
Disallow: /libraries/
Disallow: /logs/
Disallow: /modules/
Disallow: /plugins/
Disallow: /tmp/

Robots.txt informs a web crawler about which areas of the website should not be processed.



A broken robots.txt might cause indexing errors. Common issues include:

  • blocking access to javascript[6]
  • blocking access to images
  • blocking access to CSS files

Allowing access to all above files[7] to all search engines is recommended.


ROBOTS META TAG AND X-ROBOTS-TAG

Page indexing may be managed on the page level by “robots” meta tag or X-Robots-Tag header. Robots meta tag may have many parameters[8]. In the context of this article, the most important parameter is "noindex".

<link rel="stylesheet" href="/assets/layerslider/css/layerslider.css" type="text/css">
<meta name="robots" content="noindex, follow">
<script type="text/javascript">

In the case of detecting the “noindex” parameter in the X-Robots-Tag header or in robots meta tag, search engines will not index the page content. Consequently, that page will not be visible in search results. The noindex attribute might be used in the case of pages we do not want to be indexed (e.g. duplicated content, pages under construction). In most cases, leaving the “noindex” parameter is a mistake. Checking if you really want to block search engines from accessing a page is recommended!


LINK DEPTH

Link depth information in Pulno

Link depth may be defined as a number of clicks required to reach a selected page from the home page. Creating a proper structure for a big website is a great challenge. It is much easier for smaller domains with just a couple dozen pages. In simplified terms, if pages are linked internally from the home page, their linking power is divided equally. In the case of bigger domains, the number of internal links on the home page is limited. Some pages will not be linked on the home page which results in lesser linking power. Most often the home page has the best backlinks so planning the website’s structure in such a way that all pages can be reached with as few clicks as possible is recommended. We suggest adding links to important pages as close to the home page as possible.


PASSWORDS


Password protected pages will not be indexed because search engines do not know the passwords. Using a password is a good solution for pages that are under construction. However, if the page is to be indexed it is better to make sure it is not password protected.


CONCLUSION

Search engines display their results based on what they are able to find and index. Making sure that the websites are available for Google, Bing, DuckDuckGo and other search engines is recommended.


Sources:

  1. https://www.shoutmeloud.com/google-crawling-and-indexing.html
  2. https://ahrefs.com/blog/search-traffic-study/
  3. https://support.google.com/webmasters/answer/70897?hl=en
  4. https://www.google.com/search/howsearchworks/
  5. https://support.google.com/webmasters/answer/6062608?hl=en
  6. https://yoast.com/dont-block-css-and-js-files/
  7. https://searchengineland.com/google-search-console-warnings-issued-for-blocking-javascript-css-226227
  8. https://developers.google.com/search/reference/robots_meta_tag?hl=en



Jacek Wieczorek is the co-founder of Pulno. Since 2006, he has been optimizing and managing websites that generate traffic counted in hundreds of thousands of daily visits. 


Get in touch:   



×

Check indexing issues

Audit any website for crawlability issues. Free trial!

Enter valid URL
Enter valid e-mail
You have to accept the terms.