A part of SEO is to make sure Google can crawl your website and index all your primary pages. Google is sometimes picky about what pages on a site it will index.
If you have a website indexation problem and want to get more of your pages indexed on Google, read on.
Table of Contents
Check out the new Indexation Report In Google Search Console
This is sure to be an invaluable addition to Google Search Console for some larger sites.
If you submit an XML sitemap file in Search Console, Google will help you better understand why certain pages are not indexed.
As you can see, Google goes to great lengths in 2018 to help you to identify indexation problems on your website, including, in this example:
|Error||Submitted URL marked ‘noindex’|
|Error||Server errors (5xx)|
|Error||Submitted URL blocked by robots.txt|
|Error||Submitted URL seems to be a Soft 404|
|Excluded||Excluded by ‘noindex’ tag|
|Excluded||Page with redirect|
|Excluded||Duplicate page without canonical tag|
|Excluded||Crawled – currently not indexed|
|Excluded||Discovered – currently not indexed|
|Excluded||Blocked by robots.txt|
|Excluded||Alternate page with proper canonical tag|
|Excluded||Submitted URL not selected as canonical|
|Valid||Indexed, not submitted in sitemap|
|Valid||Submitted and indexed|
How To Access The New Indexation Report In Search Console
The new Google Search Console is not available to everyone yet.
If you don’t get access to the new version in your country, you may be able to use this hack:
Replace the “www.whatever.com” domain in the url above to access the new Search console.
Read my notes on how to submit a site to Google Search Console.
Will Google Index every page on Your Website?
QUOTE: “We never index all known URLs, that’s pretty normal. I’d focus on making the site awesome and inspiring, then things usually work out better“. John Mueller, 2018
Some URLs are not that important to Google, some are duplicates, some have conflicting indexation instructions and some pages are low-quality or even spammy.
Read my article on what Google considers low-quality pages.
Does Google crawl an XML sitemap and does it crawl the entire sitemap once it starts?
A question was asked in a recent Google Hangout by someone with a website indexation problem:
QUOTE: “How often does Google crawl an XML sitemap and does it crawl the entire sitemap once it starts?”
An XML sitemap is inclusive, not exclusive.
QUOTE: “sitemap files do help us to better understand a website and to better figure out which parts of website need to be recrawled so specifically if you have information in like the last modification date that really helps us to figure out which of these pages are new or have changed that need to be recrawled.” John Mueller Google
There will be URLs on your site that are not in the XML sitemap that Google will crawl and index. There are URLs in your XML sitemap that Google will probably crawl and not index.
QUOTE: “if you’re looking at the sitemap files in search console you have information on how many URLs are indexed from those sitemap files the important part there is that we look at exactly the URL that you list in the sitemap file so if we index the URL with a different parameter or with different upper or lower case or a slash at the end or not then all of that matters for for that segment file so that that might be an issue to kind of look out there” John Mueller 2017
QUOTE: “in the sitemap file we primarily focus on the last modification date so that’s that’s what we’re looking for there that’s where we see that we’ve crawled this page two days ago and today it has changed therefore we should recrawl it today we don’t use priority and we don’t use change frequency in the sitemap file at least at the moment with regards to crawling so I wouldn’t focus too much on priority and change frequency but really on the more factual last modification date information an RSS feed is also a good idea with RSS you can use pubsubhubbub which is a way of getting your updates even faster to Google so using pubsubhubbub is probably the fastest way to to get content where you’re regularly changing things on your site and you want to get that into Google as quickly as possible an RSS feed with pubsubhubbub is is a really fantastic way to get that done.” John Mueller Google 2017
QUOTE: “so a [XML] sitemap file helps us to understand which URLs on your website are new or have recently changed so in the second file you can specify a last modification date and with that we can kind of judge as we need to crawl next to make sure that we’re not like behind in keeping up with your website’s indexing so if you have an existing website and you submit a sitemap file and the sitemap file has realistic change dates on there then in an ideal case we would look at that and say oh we know about most of these URLs and here are a handful of URLs that we don’t know about so we’ll go off and crawl those URLs it’s not the case that submitting a sitemap file will replace our normal crawling it essentially justadds to the existing crawling that we do“. John Mueller 2017
Can I put my sitemap file into separate smaller files? Yes.
QUOTE: “Another thing that sometimes helps is to split the sitemap files up into separate chunks of logical chunks for your website so that you understand more where pages are not being indexed and then you can see are the products not being indexed or the categories not being indexed and then you can drill down more and more and figure out where where there might be problems that said we don’t guarantee indexing so just because a sitemap file has a bunch of URLs and it doesn’t mean that we will index all of them that’s still kind of something to keep in mind but obviously you can try to narrow things down a little bit and see where where you could kind of improve that situation.” John Mueller, 2017
The URL is naturally important in an XML sitemap. The only other XML sitemap you should really be concerned about is the DATE LAST MODIFIED. You can ignore the FREQUENCY attribute:
QUOTE – “we don’t use that at all ….no we only use the date in the [XML] sitemap file “ John Mueller, Google 2017
How many times a week is the index status data in search console updated?
It is updated 2-3 times a week.
Should you use sitemaps with last modified for expired content?
Expired pages can be picked up quickly if you use a last modified date
Why Doesn’t Google Crawl and Index My Website XML Sitemap Fully?
QUOTE: “So we don’t guarantee indexing. So just because something is in a sitemap file isn’t a guarantee that we will actually index it. It might be completely normal that we don’t actually index all of those pages… that even if you do everything technically correct there’s no guarantee that we will actually index everything.” John Mueller, 2018
I have looked at a lot of sites with such indexation problems. In my experience the most common reasons for poor indexation levels of a sitemap on a site with thousands or millions of pages are:
- doorway pages
- thin pages
Pages that are almost guaranteed to get into Google’s index have one common feature: They have unique content on them.
In short, if you are building doorway type pages without unique content on them, Google won’t index them all properly. If you are sloppy, and also produce thin pages on the site, Google won’t exactly reward that behaviour either.
QUOTE: “with regards to product pages not being indexed in Google again that’s something where maybe that’s essentially just working as intended where we just don’t index everything from them from any website. I think for most websites if you go into the sitemap section or the indexing section you’ll see that we index just a fraction of all of the content on the website. I think for any non-trivial sized website indexing all of the content would be a very big exception and I would be very surprised to see that happen.” John Mueller, Google 2017
Google rewards (in 2018) a smaller site with fat, in-depth pages a lot more than a larger site with millions of thinner pages.
Perhaps Google can work out how much unique text a particular site has on it and weighs that score with the number of pages the site produces. Who knows.
The important takeaway is ‘Why would Google let millions of your autogenerated pages rank, anyway?”
QUOTE: “really create something useful for users in individual locations maybe you do have some something unique that you can add there that makes it more than just a doorway page“. John Mueller, Google 2017
Google Not Indexing URLs In Your Sitemap? Creating New Sitemaps Won’t Help
It is unlikely that modifying your XML sitemaps alone will result in more pages on your site being indexed if the reason the URLs are not indexed in the first place is quality-related:
QUESTION: “I’ve 100 URLs in a xml sitemap. 20 indexed and 80 non-indexed. Then I uploaded another xml sitemap having non-indexed 80 URLs. So same URL’s in multiple sitemap. . . Is it a good practice? Can it be harmful or useful for my site?”
and from Google:
QUOTE: “That wouldn’t change anything. If we’re not indexing 80 URLs from a normal 100 URL site, that sounds like you need to work on the site instead of on sitemap submissions. Make it awesome! Make every page important!” John Mueller, 2018
Most links in your XML Sitemap should be Canonical and not redirects
Google wants final destination URLs and not links that redirect to some other location.
QUOTE: “in general especially, for landing pages…. we do recommend to use the final destination URL in the sitemap file a part of the reason for that is so that we can report explicitly on those URLs in search console …. and you can look at the indexing information just for that sitemap file and that’s based on the exact URLs that you have there. The other reason we recommend doing that is that we use a sitemaps URLs as a part of trying to understand which URL should be the canonical for a piece of content so that is the URL that we should show in the search results and if the sitemap file says one URL and it redirects to a different URL then you’re giving us kind of conflicting information.” John Mueller, Google 2018” John Mueller, Google 2018
QUOTE: “actually the date the last modification date of some of these URLs because with that date we can figure out do we need to recall these URLS to figure out what is new or what is different on these URLs or are these old URLs that basically we might already know about we decided not to index so what I would recommend doing there is creating an XML sitemap file with the dates with the last modification dates just to make sure that Google has all of the information that it can get.” John Mueller, Google 2018
Read my article on managing redirects properly on a site.
Sometime non-canonical versions of your URLs are indexed instead
QUOTE: “I would recommend doing there is double-checking the URLs themselves and double-checking how they’re actually indexed in Google so it might be that we don’t actually index the URL as you listed in the sitemap file but rather a slightly different version that is perhaps linked in within your website so like I mentioned before the trailing slash is very common or ducked up the non WWW(version) – all of those are technically different URLs and we wouldn’t count that for the sitemap as being indexed if we index it with a slightly different URL.” John Mueller, Google 2018
It is ‘normal’ for Google NOT to index all the pages on your site.
What Is the maximum size limit of an XML Sitemap?
QUOTE: “We support 50 megabytes for a sitemap file, but not everyone else supports 50 megabytes. Therefore, we currently just recommend sticking to the 10 megabyte limit,“ John Mueller, Google 2014
Google wants to know when primary page content is updated, not when supplementary page content is modified – “if the content significantly changes, that’s relevant. If the content, the primary content, doesn’t change,then I wouldn’t update it.“
Why Is The Number Of Indexed URLs in Search Console Dropping?
Google has probably worked out you are creating doorway-type pages with no-added-value.
QUOTE: “The Panda algorithm may continue to show such a site for more specific and highly-relevant queries, but its visibility will be reduced for queries where the site owner’s benefit is disproportionate to the user’s benefit. Google
Page Quality & Site Quality
Google measures quality on a per page basis and also look at the site overall (with site-wide quality being affected by the quality of individual pages.
Do no indexed pages have an impact on site quality?
Only indexable pages have an impact on site quality. You can use a noindex on low-quality pages if page quality cannot be improved.
QUOTE: “If you if you have a website and you realize you have low-quality content on this website somewhere then primarily of course we’d recommend increasing the quality of the content if you really can’t do that if there’s just so much content there that you can’t really adjust yourself if it’s user-generated content all of these things then there there might be reasons where you’d say okay I’ll use a no index for the moment to make sure that this doesn’t affect the bigger picture of my website.” John Mueller, Google 2017
You should only be applying noindex to pages as a temporary measure at best.
Google wants you to improve the content that is indexed to improve your quality scores:
Read my article on how to create seo-friendly content.
What Are The Low-Quality Signals Google Looks For?
QUOTE: “Low quality pages are unsatisfying or lacking in some element that prevents them from achieving their purpose well. These pages lack expertise or are not very trustworthy/authoritative for the purpose of the page.” Google Quality Evaluator Guidelines, 2017
These include but are not limited to:
- Lots of spammy comments
- Low-quality content that lacks EAT signal (Expertise + Authority + Trust”)
- NO Added Value for users
- Poor page design
- Malicious harmful or deceptive practices detected
- Negative reputation
- Auto-generated content
- No website contact information
- Fakery or INACCURATE information
- Website not maintained
- Pages just created to link to others
- Pages lack purpose
- Keyword stuffing
- Inadequate customer service pages
- Sites that use practices Google doesn’t want you to use
Pages can get a neutral rating too.
Pages that have “Nothing wrong, but nothing special” about them don’t “display characteristics associated with a High rating” and puts you in the middle ground – probably not a sensible place to be a year or so down the line.
Read my article on what are low-quality pages to Google.
What Are The High-Quality Characteristics of a Web Page?
QUOTE: “High quality pages are satisfying and achieve their purpose well.” Google Quality Evaluator Guidelines, 2017
The following are examples of what Google calls ‘high-quality characteristics’ of a page and should be remembered:
- “A satisfying or comprehensive amount of very high-quality” main content (MC)
- Copyright notifications up to date
- Functional page design
- Page author has Topical Authority
- High-Quality Main Content
- Positive Reputation or expertise of website or author (Google yourself)
- Very helpful SUPPLEMENTARY content “which improves the user experience.“
- Google wants to reward ‘expertise’ and ‘everyday expertise’ or experience so make this clear (Author Box?)
- Accurate information
- Ads can be at the top of your page as long as it does not distract from the main content on the page
- Highly satisfying website contact information
- Customised and very helpful 404 error pages
- Evidence of expertise
- Attention to detail
If Google can detect investment in time and labour on your site – there are indications that they will reward you for this (or at least – you won’t be affected when others are, meaning you rise in Google SERPs when others fall).
Read my article on what are high-quality pages to Google.
Help Google Help You Index More Pages
Minimise the production of doorway-type pages you produce on your site
You will need another content strategy. If you are forced to employ these type of pages, you need to do it in a better way.
QUOTE: “if you have a handful of locations and you have unique valuable content to provide for those individual locations I think providing that on your website is perfectly fine if you have hundreds of locations then putting out separate landing pages for every city or every region is almost more like creating a bunch of doorway pages so that’s something I really discourage” John Mueller Google 2017
Are you making ‘doorway pages’ and don’t even know it? See my notes on what are doorway pages to Google.
Minimise the production of thin pages you produce on your site
You will need to check how sloppy your CMS is. Make sure it does not inadvertently produce pages with little to no unique content on them (especially if you have ads on them).
QUOTE: “John says to avoid lots of “just automatically generated” pages and “if these are pages that are not automatically generated, then I wouldn’t see that as web spam.” Conversely then “automatically generated” content = web spam? It is evident Googlebot expects to see a well formed 404 if no page exists at a url.” Shaun Anderson, Hobo
Read my article on what are thin pages to Google.
Have your site produce proper 404 pages
This will prevent the automatic creation of thin pages and could help prevent against negative SEO attacks, too.
QUOTE: “Tell visitors clearly that the page they’re looking for can’t be found. Use language that is friendly and inviting. Make sure your 404 page uses the same look and feel (including navigation) as the rest of your site. Consider adding links to your most popular articles or posts, as well as a link to your site’s home page. Think about providing a way for users to report a broken link. No matter how beautiful and useful your custom 404 page, you probably don’t want it to appear in Google search results. In order to prevent 404 pages from being indexed by Google and other search engines, make sure that your webserver returns an actual 404 HTTP status code when a missing page is requested.” Google, 2018
For more information, read how to create useful 404 pages.
Block your internal search function on your site.
QUOTE: “Use the robots.txt file on your web server to manage your crawling budget by preventing crawling of infinite spaces such as search result pages. Keep your robots.txt file up to date.” Google (2017)
This will prevent the automatic creation of thin pages and could help prevent against negative SEO attacks, too. Read a beginner’s guide to Robot.txt files.
Use canonicals properly
QUOTE: “If your site contains multiple pages with largely identical content, there are a number of ways you can indicate your preferred URL to Google. (This is called “canonicalization”.)” Google
This will help consolidate signals in the correct pages. See how to use canonical link elements properly.
Use proper pagination control on paginated sets of pages
This will help with duplicate content issues.
rel="prev"links to indicate the relationship between component URLs. This markup provides a strong hint to Google that you would like us to treat these pages as a logical sequence, thus consolidating their linking properties and usually sending searchers to the first page. Google
Read my article on how to use pagination properly.
Use proper indexation control on pages
Some pages your site may require to have a meta noindex on them.
Identify your primary content assets and improve them instead of optimising low-quality pages (which will get slapped in a future algorithm update).
Read my article on how to use the meta robots noindex tag.
How To Deal With Search Console Indexation Report Errors
How to deal with “Submitted URL marked ‘noindex’” and “Excluded by ‘noindex’ tag” notifications in Search console
Why are you creating pages and asking Google to noindex them? There is always a better way than to noindex pages. Review the pages you are making and check they comply with Google guidelines e.g. are not doorway pages. Check if technically there is a better way to handle noindexed pages.
How to handle “Page with redirect” notifications in Search console
Why do you have URLs in your sitemap that are redirects? This is not ideal. Review and remove the redirects from the sitemap.
What does “ Indexed, not submitted in sitemap” mean in Search Console?
It means Google has crawled your site and found more pages than you have in your sitemap. Depending on the number of pages indicated, this could be a non-issue or a critical issue.
Make sure you know the type of pages you are attempting to get indexed, the page types your CMS produces.
How to deal with “Duplicate page without canonical tag”, “Alternate page with proper canonical tag” and “Submitted URL not selected as canonical” notifications in Search console
Review how you use canonical link elements throughout the site.
How To Deal with “Crawl anomaly” notifications in search console:
QUOTE: “An unspecified anomaly occurred when fetching this URL. This could mean a 4xx- or 5xx-level response code; try fetching the page using Fetch as Google to see if it encounters any fetch issues. The page was not indexed.” Google, 2018
How To Deal With Crawled – currently not indexed:
QUOTE: “The page was crawled by Google, but not indexed. It may or may not be indexed in the future; no need to resubmit this URL for crawling.”
These could be problematic. You should check to see if pages you want indexed are included in this list of URLs. If they are, this could be indicative of a page quality issue.
Read this official article a full list of new features in the Google Search Console Indexation Report,
What Are The Best SEO Tools for Website Crawl & Indexation Problems in 2018?
If you are not technically minded, we can analyse and optimise your website for you as part of our fixed price SEO service.