Separating the wheat from the chaff.
Being ‘indexed’ is important. If a page isn’t indexed, the page can’t be returned by Google in Search Engine Results Pages.
While getting as many pages indexed in Google was historically a priority for a SEO, Google is now rating the quality of pages on your site and the type of pages it is indexing. So bulk indexation is no guarantee of success – in fact, it’s a risk in 2016 to index all pages on your site, especially if you have a large, sprawling site.
If you have a lot of low-quality pages (URLs) indexed on your site compared to high-quality pages (URLs)…. Google has told us it is marking certain sites down for that.
Some URLs are just not welcome to be indexed as part of your website content anymore.
Do I need to know which pages are indexed?
No. Knowing is useful, of course, but largely unnecessary. Indexation is never a guarantee of traffic.
Some SEO would tend to scrape Google to get indexation data on a website. I’ve never bothered with that. Most sites I work with have xml sitemap files, so an obvious place to start to look at such issues is Google Search Console.
Google will tell you how many pages you have submitted in a sitemap, and how many pages are indexed. It will not tell you which pages are indexed, but if there is a LARGE discrepancy between SUBMITTED and INDEXED, it’s very much worth digging deeper.
If Google is de-indexing large swaths of your content that you have actually submitted as part of an xml sitemap, then a problem is often afoot.
Unfortunately with this method, you don’t get to see the pages produced by the CMS out with the xml sitemap – so this is not a full picture of the ‘health’ of your website.
Identifying Dead Pages
I usually start with a performance analysis that involves merging data from a physical crawl of a website with analytics data and Webmaster tools data. A content type analysis will identify the type of pages the cms generates. A content performance analysis will gauge how well each section of the site performs.
If you have 100,000 pages on a site, and only 1,000 pages get organic traffic from Google over a 3-6 month period – you can make the argument 99% of the site is rated as ‘crap’ (at least as far as Google rates pages these days).
I group pages like these together as ‘dead pages‘ for further analysis. Deadweight, ‘dead’ for short.
The thinking is if the pages were high-quality, they would be getting some kind of organic traffic.
Identifying which pages receive no organic visitors over a sensible timeframe is a quick, if noisy, way to separate pages that obviously WORK from pages that DONT – and will help you clean up a large portion of redundant URLs on the site.
It helps to see page performance in the context of longer timeframes as some types of content can be seasonal, for instance, and produce false positives over a shorter timescale. It is important to trim content pages carefully – and there are nuances.
Experience can educate you when a page is high-quality and yet receives no traffic. If the page is thin, but is not manipulative, is indeed ‘unique’ and delivers on a purpose with little obvious detectable reason to mark it down, then you can say it is a high-quality page – just with very little search demand for it. Ignored content is not the same as ‘toxic’ content.
False positives aside, once you identify the pages receiving no traffic, you very largely isolate the type of pages on your site that Google doesn’t rate – for whatever reason. A strategy for these pages can then be developed.
Identifying Content That Can Potentially Hurt Your Rankings
As you review the pages, you’re probably going to find pages that include:
- out of date, overlapping or irrelevant content
- collections of pages not paginated properly
- index able pages that shouldn’t be indexed
- stub pages
- indexed search pages
- pages with malformed HTML and broken images
- auto generated pages with little value
You will probably find ‘dead’ pages you didn’t even know your cms produced (hence why an actual crawl of your site is required, rather than just working from a lit of URLs form a xml sitemap, for instance).
Those pages need cleaned up, Google has said. And remaining pages should:
‘stand on their own’ J.Mueller
Google doesn’t like auto generated pages in 2016, so you don’t want Google indexing these pages in a normal fashion. Judicious use of ‘noindex,follow’ directive in robots meta tags, and sensible use of the canonical link element are required implementation on most sites I see these days.
The aim in 2016 is to have as few ‘low-quality pages on a site as possible, using as few aged SEO techniques as possible.
The pages that remain after a URL clear-out, can be reworked and improved.
In fact – they MUST BE improved if you are to win more rankings and get more Google organic traffic in future.
This is time-consuming – just like Google wants it to be. You need to review DEAD pages with a forensic eye and ask:
- Are these pages high-quality and very relevant to a search term?
- Do these pages duplicate content on the pages on the site?
- Are these pages automatically generated, with little or no unique text content on them?
- Is the purpose of this page met WITHOUT sending visitors to another page e.g. doorway pages?
- Will these pages ever pick up natural links?
- Is the intent of these pages to inform first? ( or profit from organic traffic through advertising?)
- Are these pages FAR superior than the competition in Google presently for the search term you want to rank? This is actually very important.
If the answer to any of the above is NO – then it is imperative you take action to minimise the amount of these types of pages on your site.
What about DEAD pages with incoming backlinks or a lot of text content?
Bingo! Use 301 redirects (or use canonical link elements) to redirect any asset you have with some value to Googlebot to equivalent, up to date sections on your site. Do NOT just redirect these pages to your homepage.
Rework available content before you bin it
High-quality content is expensive – so rework content when it is available. Medium quality content can always be made higher quality – in fact – a page is hardly ever finished in 2016. EXPECT to come back to your articles every six months to improve them to keep them moving in the right direction.
Sensible grouping of content types across the site can often leave you with substantial text content that can be reused and repackaged in a way that the same content originally spread over multiple pages, now consolidated into one page reworked and shaped around a topic, has a considerably much more successful time of it in Google SERPs in 2016.
Well, it does if the page you make is useful and has a purpose other than just to make money.
REMEMBER – DEAD PAGES are only one aspect of a site review. There’s going to be a large percentage of any site that gets a little organic traffic but still severely underperforms, too – tomorrows DEAD pages. I call these POOR pages in my reviews.
Minimise Low-Quality Content & Overlapping Text Content
Google may well be able to recognise ‘low-quality’ a lot better than it does ‘high-quality’ – so having a lot of ‘low-quality’ pages on your site is potentially what you are actually going to be rated on (if it makes up most of your site) – now, or in the future. NOT your high-quality content.
This is more or less explained by Google spokespeople like John Mueller. He is constantly on about ‘folding’ thin pages together, these days (and I can say that certainly has a positive impact on many sites).
While his advice in this instance might be specifically about UGC (user generated content like forums) – I am more interested in what he has to say when he talks about the algorithm looking at the site “overall” and how it ‘thinks’ when it finds a mixture of high-quality pages and low-quality pages.
And Google has clearly said in print:
low-quality content on part of a site can impact a site’s ranking as a whole
Avoid Google’s punitive algorithms
Fortunately, we don’t actually need to know and fully understand the ins-and-outs of Google’s algorithms to know what the best course of action is.
The sensible thing in light of Google’s punitive algorithms is just to not let Google index (or more accurately, rate) low-quality pages on your site. And certainly – stop publishing new ‘thin’ pages. Don’t put your site at risk.
If pages get no organic traffic anyway, are out-of-date for instance, and improving them would take a lot of effort and expense, why let Google index them normally, if by rating them it impacts your overall score? Clearing away the low-quality stuff lets you focus on building better stuff on other pages that Google will rank in 2016 and beyond.
Ideally you would have a giant site and every page would be high-quality – but that’s not practical.
A myth is that pages need a lot of text to rank. They don’t, but a lot of people still try to make text bulkier and unique page-to-page .
While that theory is sound (when focused on a single page, when the intent is to deliver utility content to a Google user) using old school SEO techniques on especially a large site spread out across many pages seems to amplify site quality problems, after recent algorithm changes, and so this type of optimisation without keeping an eye on overall site quality is self-defeating in the long run.