Need Help With Your SEO?

SEO Consultancy Services

Duplicate Content SEO Advice From Google

“Important: The Lowest rating is appropriate if all or almost all of the MC (main content) on the page is copied with little or no time, effort, expertise, manual curation, or added value for users. Such pages should be rated Lowest, even if the page assigns credit for the content to another source. – Google Search Quality Evaluator Guidelines March 2017

TLDR: ‘Duplicate content‘ is NOT mentioned **once** in the recently published Search Quality Raters Guidelines. ‘Copied content’, is. Semantics aside, duplicate content is evidently treated differently by Google than copied content, with the difference being the INTENT and nature of the duplicated text.

Duplicated content is often not manipulative and is commonplace on many websites and often free from malicious intent. Copied content can often be penalised algorithmically or manually. Duplicate content is not penalised, but this is often not an optimal set-up for pages, either. Be VERY careful ‘spinning’ ‘copied’ text to make it unique!

If you need help identifying if any problematic content is on your website, our SEO audit can give you answers. View SEO service prices.

If you are doing SEO yourself, read on, as I attempt to break down a common SEO misunderstanding.

UPDATE:

Quote: "This is an awesome summary of duplicate content & web-search." John Mueller, Google Trends Analyst

Table of Contents

Google Q&A on Duplicate Content

In the video above (June 17 2016), Google’s Andrey Lipattsev was adamant: Google DOES NOT have a duplicate content penalty.

It’s a good video for a refresher on the subject of duplicated content on your site.

He clearly wants people to understand it is NOT a penalty if Google discovers your content is not unique and doesn’t rank your page above a competitor’s page.

Also, as John Mueller points out, Google picks the best option to show users depending on who they are and where they are.  So sometimes, your duplicate content will appear to users where relevant.

This latest advice from Google is useful in that it clarifies Google’s position, which I quickly paraphrase below:

  • There is no duplicate content penalty
  • Google rewards UNIQUENESS and the signals associated with ADDED VALUE
  • Google FILTERS duplicate content
  • Duplicate content can slow Google down in finding new content
  • XML sitemaps are just about the BEST technical method of helping Google discover your new content
  • Duplicate content is probably not going to set your marketing on fire
  • Google wants you to concentrate signals in canonical documents, and it wants you to focus on making these canonical pages BETTER for USERS.
  • For SEO, it is not necessarily the abundance of duplicate content on a website that is the real issue. It’s the lack of positive signals that NO unique content or added value provides that will fail to help you rank faster and better in Google.

A sensible strategy for SEO would still appear to be to reduce Googlebot crawl expectations and consolidate ranking equity & potential in high-quality canonical pages and you do that by minimising duplicate or near-duplicate content.

A self-defeating strategy would be to ‘optimise’ low-quality or non-unique pages or present low-quality pages to users.

QUOTE: "Do SEO? Then read this now: I'm only about 300 words in, had to share. Great piece by @Hobo_Web"

Duplicate Content SEO Best Practice

Webmasters are confused about ‘penalties’ for duplicate content, which is a natural part of the web landscape, because Google claims there is NO duplicate content penalty, yet rankings can be impacted negatively, apparently, by what looks like ‘duplicate content’ problems.

The reality in 2017 is that if Google classifies your duplicate content as THIN content, or MANIPULATIVE BOILER-PLATE or NEAR DUPLICATE ‘SPUN’ content, then you probably DO have a severe problem that violates Google’s website performance recommendations and this ‘violation’ will need ‘cleaned’ up – if – of course – you intend to rank high in Google.

Google wants us to understand, in 2017, that MANIPULATIVE BOILER-PLATE or NEAR DUPLICATE ‘SPUN’ content is NOT ‘duplicate content’.

Duplicate content is not necessarily ‘spammy’ to Google.

The rest of it is e.g:

Content which is copied, but changed slightly from the original. This type of copying makes it difficult to find the exact matching original source. Sometimes just a few words are changed, or whole sentences are changed, or a “find and replace” modification is made, where one word is replaced with another throughout the text. These types of changes are deliberately done to make it difficult to find the original source of the content. We call this kind of content “copied with minimal alteration.” Google Search Quality Evaluator Guidelines March 2017

At the ten minute mark in a recent video, John Mueller of Google also clarified, with examples, that there is:

No duplicate content penalty” but “We do have some things around duplicate content … that are penalty worthy

 

What Is Duplicate Content?

Here is a definition from Google:

Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar. Mostly, this is not deceptive in origin…..

It’s crucial to understand that if in 2017, as a Webmaster, you republish posts, press releases, news stories or product descriptions found on ***other*** sites, then your pages are very definitely going to struggle to gain traction in Google’s SERPs (search engine results pages).

Google doesn’t like using the word ‘penalty’ but if your entire site is made of entirely of republished content – Google does not want to rank it above others who provide more of a ‘value add’ – and that can be in many areas.

If you have a multiple site strategy selling the same products – you are probably going to cannibalise your traffic in the long run, rather than dominate a niche, as you used to be able to do.

This is all down to how a search engine filters duplicate content found on other sites – and the experience Google aims to deliver for it’s users – and it’s competitors.

Mess up with duplicate content on a website, and it might look like a penalty as the end-result is the same – important pages that once ranked might not rank again – and new content might not get crawled as fast as a result.

Your website might even get a ‘manual action’ for thin content.

Worse case scenario your website is hit by the Google Panda algorithm.

A good rule of thumb is; do NOT expect to rank high in Google with content found on other, more trusted sites, and don’t expect to rank at all if all you are using is automatically generated pages with no ‘value add’.

While there are exceptions to the rule, (and Google certainly treats your OWN duplicate content on your OWN site differently), your best bet in ranking in 2017 is to have one single (canonical) version of content on your site with rich, unique text content that is written specifically for that page.

Google wants to reward RICH, UNIQUE, RELEVANT, INFORMATIVE and REMARKABLE content in its organic listings – and it has raised the quality bar over the last few years.

If you want to rank high in Google for valuable key phrases and for a long time – you better have good, original content for a start – and lots of it.

A very interesting statement in a recent webmaster hangout was “how much quality content do you have compared to low-quality content“. That indicates Google is looking at this ratio. John says to identify “which pages are high-quality, which pages are lower quality so that the pages that do get indexed are really the high-quality ones.

Gary IILyes chipped in recently – (Feb 2017)

DYK Google doesn’t have a duplicate content penalty, but having many URLs serving the same content burns crawl budget and may dilute signals

What is Boilerplate Content?

Wikipedia says of ‘boilerplate’ content:

Boilerplate is any text that is or can be reused in new contexts or applications without being greatly changed from the original. WIKI

…and Google says to:

Minimize boilerplate repetition

Google is very probably looking to see if your pages ‘stand on their own‘ – as John Mueller is oft fond of saying.

How would they do that algorithmically? Well, they could look to see if text blocks on your pages were unique to the page, or were very similar blocks of content to other pages on your site.

If this ‘boilerplate’ content is the content that makes up the PRIMARY content of multiple pages – Google can easily filter to ignore – or penalise – this practice.

The sensible move would be to listen to Google – and minimise – or at least diffuse – the instances of boilerplate text, page-to-page on your website.

Note that THIN CONTENT exacerbates SPUN BOILERPLATE TEXT problems on a site – as THIN CONTENT just creates more pages that can only be created with boilerplate text – itself, a problem.

E.G. – if a product has 10 URLs – one URL for each colour of the product, for instance – then the TITLE, META DESCRIPTION & PRODUCT DESCRIPTION (and other elements on the page) for these extra pages will probably rely on BOILERPLATE techniques to create them, and in doing so – you create 10 URLs on the site that do ‘not stand on their own’ and essentially duplicate text across pages.

It’s worth listening to John Mueller’s recent advice on this point. He clearly says that the practice of making your text more ‘unique’, using low-quality techniques is:

“probably more counter productive than actually helping your website”

If you have many pages of similar content your site, Google might have trouble choosing the page you want to rank, and it might dilute your capability to rank for what you do what to rank for.

For instance, if you have ‘PRINT-ONLY’ versions of web pages (Joomla used to have major issues with this), that can end up displaying in Google instead of your web page if you’ve not handled it properly.

That’s probably going to have an impact on conversions and link building – for instance. Poorly implemented mobile sites can cause duplicate content problems, too.

Should I Rewrite Product Descriptions To Make The Text Unique?

Probably.

Whatever you do, beware ‘spinning the text’ – Google might have an algorithm or two focused on that sort of thing:

Google: John Mueller says "Do NOT spin text on your website"

How Does Google Rate ‘Copied’ Main Content?

This is where you are swimming upstream in 2017. Copied content is not going to be a long-term strategy when creating a unique page better than your competitions’ pages.

In the latest Google Search Quality Evaluator Guidelines that were published on 14 March 2017, Google states:

7.4.5 Copied Main Content

Every page needs MC. One way to create MC with no time, effort, or expertise is to copy it from another source. Important: We do not consider legitimately licensed or syndicated content to be “copied” (see here for more on web syndication). Examples of syndicated content in the U.S. include news articles by AP or Reuters.

The word “copied” refers to the practice of “scraping” content, or copying content from other non­affiliated websites without adding any original content or value to users (see here for more information on copied or scraped content).

If all or most of the MC on the page is copied, think about the purpose of the page. Why does the page exist? What value does the page have for users? Why should users look at the page with copied content instead of the original source? Important: The Lowest rating is appropriate if all or almost all of the MC on the page is copied with little or no time, effort, expertise, manual curation, or added value for users. Such pages should be rated Lowest, even if the page assigns credit for the content to another source.

and

7.4.6 More About Copied Content

All of the following are considered copied content:

● Content copied exactly from an identifiable source. Sometimes an entire page is copied, and sometimes just parts of the page are copied. Sometimes multiple pages are copied and then pasted together into a single page. Text that has been copied exactly is usually the easiest type of copied content to identify.

● Content which is copied, but changed slightly from the original. This type of copying makes it difficult to find the exact matching original source. Sometimes just a few words are changed, or whole sentences are changed, or a “find and replace” modification is made, where one word is replaced with another throughout the text. These types of changes are deliberately done to make it difficult to find the original source of the content. We call this kind of content “copied with minimal alteration.”

● Content copied from a changing source, such as a search results page or news feed. You often will not be able to find an exact matching original source if it is a copy of “dynamic” content (content which changes frequently). However, we will still consider this to be copied content. Important: The Lowest rating is appropriate if all or almost all of the MC on the page is copied with little or no time, effort, expertise, manual curation, or added value for users. Such pages should be rated Lowest, even if the page assigns credit for the content to another source.

Is There A Penalty For Duplicate Content On A Website?

Google has given us some explicit guidelines when it comes to managing duplication of content.

filtering-of-duplicate-content

John Mueller clearly states in the video where I grabbed the above image:

 “We don’t have a duplicate content penalty. It’s not that we would demote a site for having a lot of duplicate content.

and

You don’t get penalized for having this kind of duplicate content

…in which he was talking about very similar pages. John says to “provide… real unique value” on your pages.

I think that could be understood that Google is not compelled to rank your duplicate content.

If it ignores it, it’s different from a penalty. Your original content can still rank, for instance.

An e-commerce SEO tip from John with:

variations of  product “colors…for product page, but you wouldn’t create separate pages for that.With these type of pages you are “always balancing is having really, really strong pages for these products, versus having, kind of, medium strength pages for a lot of different products.

John says:

one kind of really, really strong generic page” trumps “hundreds” of mediocre ones.

If “essentially, they’re the same, and just variations of keywords” that should be ok, but if you have ‘millions‘ of them- Googlebot might think you are building doorway pages, and that IS risky.

Generally speaking, Google will identify the best pages on your site if you have a decent on-site architecture and unique content.

The advice is to avoid duplicate content issues if you can and this should be common sense.

Google wants (and rewards) original content – it’s a great way to push up the cost of SEO and create a better user experience at the same time.

Google doesn’t like it when ANY TACTIC it’s used to manipulate its results, and republishing content found on other websites is a common practice of a lot of spam sites.

Duplicate content on a site is not grounds for action on that site unless it appears that the intent of the duplicate content is to be deceptive and manipulate search engine results. Google.

You don’t want to look anything like a spam site; that’s for sure – and Google WILL classify your site… as something.

The more you can make it look a human made every page on a page by page basis with content that doesn’t appear exactly in other areas of the site – the more Google will ‘like’ it.

Google does not like automation when it comes to building the main content of a text-heavy page; that’s clear in 2017.

I don’t mind multiple copies of articles on the same site – as you find with WordPress categories or tags, but I wouldn’t have tags and categories, for instance, and expect them to rank well on a small site with a lot of higher quality competition, and especially not targeting the same keyword phrases in a way that can cannibalise your rankings.

I prefer to avoid repeated unnecessary content on my site, and when I do have 100% automatically generated or syndicated content on a site, I tell Google NOT to index it with a noindex in meta tags or XRobots or Robots.txt it out completely.

I am probably doing the safest thing, as that could be seen as manipulative if I intended to get it indexed.

Google won’t thank you, either, for spidering a calendar folder with 10,000 blank pages on it, or a blog with more categories than original content – why would they?

Quote: "Duplicate Content SEO Advice From Google - My vote for best SEO post this week."

How Does Googe Rate Content ‘Deliberately Duplicated Across Domains‘?

It can see it as manipulative:

“…in some cases, content is deliberately duplicated across domains in an attempt to manipulate search engine rankings or win more traffic.

Deceptive practices like this can result in a poor user experience, when a visitor sees substantially the same content repeated within a set of search results.

Google tries hard to index and show pages with distinct information.

This filtering means, for instance, that if your site has a “regular” and “printer” version of each article, and neither of these is blocked with a noindex meta tag, we’ll choose one of them to list.

In the rare cases in which Google perceives that duplicate content may be shown with intent to manipulate our rankings and deceive our users, we’ll also make appropriate adjustments in the indexing and ranking of the sites involved.

As a result, the ranking of the site may suffer, or the site might be removed entirely from the Google index, in which case it will no longer appear in search results. GOOGLE

If you are trying to compete in competitive niches, you need original content that’s not found on other pages in the same form on your site, and THIS IS, EVEN MORE, IMPORTANT WHEN THAT CONTENT IS FOUND ON OTHER PAGES ON OTHER WEBSITES.

Google isn’t under any obligation to rank your version of content – in the end, it depends on who’s site has got the most domain authority or most links coming to the page.

Well, historically at least – today in 2017 – it is often the page that satisfies users the most.

If you want to avoid being filtered by duplicate content algorithms,  produce unique content.

How To Manage Content Spread Across Multiple Domains

How To Manage Duplicate Content When Reporting News Stories

How To Deal With Content Spread Across Multiple TLDs?

What is ‘Near-Duplicate’ Content, to Google?

When asked on Twitter, Googler Gary Illyes responded (2017):

Think of it as a piece of content that was slightly changed, or if it was copied 1:1 but the boilerplate is different.

Based on research papers, it might be the case that once Google detects a page is a near duplicate of something else, it is going to find it hard to rank this page against the source.

Can Duplicate Content Rank in Google?

Yes. There are strategies where this will still work, in the short term.

Opportunities are (in my experience) reserved for local and long tail SERPs where the top ten results page is already crammed full of low-quality results, and the SERPs are shabby – certainly not a strategy for competitive terms.

There’s not a lot of traffic in long tail results unless you do it en-mass and that could invite further site quality issues, but sometimes it’s worth exploring if using very similar content with geographic modifiers (for instance) on a site with some domain authority has opportunity.

Very similar content can be useful across TLDs too. A bit spammy, but if the top ten results are already a bit spammy…

If low-quality pages are performing well in the top ten of an existing long tail SERP – then it’s worth exploring – I’ve used it in the past.

I always thought if it improves user experience and is better than what’s there in those long tail searches at present, who’s complaining?

It not exactly best practice SEO in 2017, and I’d be nervous creating any low-quality pages on your site these days.

Too many low-quality pages might cause you site wide issues in the future, not just page level issues.

Original Content Is King, they say

Stick to original content, found on only one page on your site, for best results – especially if you have a new/young site and are building it page by page over time… and you’ll get better rankings and more traffic to your site (affiliates too!).

Yes – you can be creative and reuse and repackage content, but I always make sure if I am asked to rank a page I will require original content on the page.

Should I Block Google From Indexing My Duplicate Content?

 

No. There is NO NEED to block your own Duplicate Content

There was a useful post in Google forums a while back with advice from Google how to handle very similar or identical content:

“We now recommend not blocking access to duplicate content on your website, whether with a robots.txt file or other methods” John Mueller

John also goes on to say some good advice about how to handle duplicate content on your own site:

  1. Recognize duplicate content on your website.
  2. Determine your preferred URLs.
  3. Be consistent on your website.
  4. Apply 301 permanent redirects where necessary and possible.
  5. Implement the rel=”canonical” link element on your pages where you can.
  6. Use the URL parameter handling tool in Google Webmaster Tools where possible.

Webmaster guidelines on content duplication used to say:

Consider blocking pages from indexing: Rather than letting Google’s algorithms determine the “best” version of a document, you may wish to help guide us to your preferred version. For instance, if you don’t want us to index the printer versions of your site’s articles, disallow those directories or make use of regular expressions in your robots.txt file. Google

but now Google is pretty clear they do NOT want us to block duplicate content, and that is reflected in the guidelines.

Google does not recommend blocking crawler access to duplicate content (dc) on your website, whether with a robots.txt file or other methods.

If search engines can’t crawl pages with dc, they can’t automatically detect that these URLs point to the same content and will therefore effectively have to treat them as separate, unique pages.

A better solution is to allow search engines to crawl these URLs, but mark them as duplicates by using the rel="canonical" link element, the URL parameter handling tool, or 301 redirects.

In cases where DC leads to us crawling too much of your website, you can also adjust the crawl rate setting in Webmaster Tools.

DC on a site is not grounds for action on that site unless it appears that the intent of the DC is to be deceptive and manipulate search engine results.

If your site suffers from DC issues, and you don’t follow the advice listed above, we do a good job of choosing a version of the content to show in our search results.

You want to minimise dupe content, rather than block it, I find the best solution to handling a problem is on a case by case basis.

Sometimes I will block Google when using OTHER people’s content on pages. I never block Google from working out my own content.

Google says it needs to detect an INTENT to manipulate Google to incur a penalty, and you should be OK if your intent is innocent, BUT it’s easy to screw up and LOOK as if you are up to something fishy.

It is also easy to fail to get the benefit of proper canonicalisation and consolidation of relevant primary content if you don’t do basic housekeeping, for want of a better turn of phrase.

Is A Mobile Site Counted As Duplicate Content?

How To Use Canonical Link Elements Properly

Google also recommends using the canonical link element to help minimise content duplication problems and this is of the most powerful tools at our disposal.

If your site contains multiple pages with largely identical content, there are a number of ways you can indicate your preferred URL to Google. (This is called “canonicalization”.)

Google SEO – Matt Cutts from Google shared tips on the rel=”canonical” tag (more accurately – the canonical link element) that the 3 top search engines now support. Google, Yahoo!, and Microsoft have all agreed to work together in a

“joint effort to help reduce duplicate content for larger, more complex sites, and the result is the new Canonical Tag”.

Example Canonical Tag From Google Webmaster Central blog:

<link rel="canonical" href="http://www.example.com/product.php?item=swedish-fish" />

You can put this link tag in the head section of the problem URLs if you think you need it.

I add a self-referring canonical link element as standard these days – to ANY web page.

Is rel=”canonical” a hint or a directive? 
It’s a hint that we honor strongly. We’ll take your preference into account, in conjunction with other signals, when calculating the most relevant page to display in search results.

Can I use a relative path to specify the canonical, such as <link rel=”canonical” href=”product.php?item=swedish-fish” />?
Yes, relative paths are recognized as expected with the <link> tag. Also, if you include a<base> link in your document, relative paths will resolve according to the base URL.

Is it okay if the canonical is not an exact duplicate of the content?
We allow slight differences, e.g., in the sort order of a table of products. We also recognize that we may crawl the canonical and the duplicate pages at different points in time, so we may occasionally see different versions of your content. All of that is okay with us.

What if the rel=”canonical” returns a 404?
We’ll continue to index your content and use a heuristic to find a canonical, but we recommend that you specify existent URLs as canonicals.

What if the rel=”canonical” hasn’t yet been indexed?
Like all public content on the web, we strive to discover and crawl a designated canonical URL quickly. As soon as we index it, we’ll immediately reconsider the rel=”canonical” hint.

Can rel=”canonical” be a redirect?
Yes, you can specify a URL that redirects as a canonical URL. Google will then process the redirect as usual and try to index it.

What if I have contradictory rel=”canonical” designations?
Our algorithm is lenient: We can follow canonical chains, but we strongly recommend that you update links to point to a single canonical page to ensure optimal canonicalization results.

Can this link tag be used to suggest a canonical URL on a completely different domain?
**Update on 12/17/2009: The answer is yes! We now support a cross-domain rel=”canonical” link element.**

Tip – Redirect old, out of date content to new, freshly updated articles on the subject, minimising low-quality pages and duplicate content while at the same time, improving the depth and quality of the page you want to rank.

See our page on 301 redirects – https://www.hobo-web.co.uk/how-to-change-domain-names-keep-your-rankings-in-google/.

Tips from Google

As with everything Google does – Google has had its own critics about its use of duplicate content on its own site for its own purposes:

example-of-scraper

 

There are some steps you can take to proactively address duplicate content issues, and ensure that visitors see the content you want them to:

Use 301s: If you’ve restructured your site, use 301 redirects (“RedirectPermanent”) in your .htaccess file to smartly redirect users, Googlebot, and other spiders. (In Apache, you can do this with an .htaccess file; in IIS, you can do this through the administrative console.)

Be consistent: Try to keep your internal linking consistent. For example, don’t link to http://www.example.com/page/ and http://www.example.com/page and http://www.example.com/page/index.htm.

I would also ensure your links are all the same case, and avoid capitalisation and lower case variations of the same URL.

This type of duplication can be quickly sorted keeping internal linking consistent and proper use of canonical link elements.

Use top-level domains: To help us serve the most appropriate version of a document, use top-level domains whenever possible to handle country-specific content. We’re more likely to know that http://www.example.de contains Germany-focused content, for instance, than http://www.example.com/de or http://de.example.com.

Google also tells Webmasters to choose a preferred domain to rank in Google:

Use Webmaster Tools to tell us how you prefer your site to be indexed: You can tell Google your preferred domain(for example, http://www.example.com or http://example.com).

…although you should ensure you handle such redirects server side, with 301 redirects redirecting all versions of a URL to one canonical URL (with a self-referring canonical link element).

Minimize boilerplate repetition: For instance, instead of including lengthy copyright text on the bottom of every page, include a very brief summary and then link to a page with more details. In addition, you can use the Parameter Handling tool to specify how you would like Google to treat URL parameters.  Understand your content management system: Make sure you’re familiar with how content is displayed on your web site. Blogs, forums, and related systems often show the same content in multiple formats. For example, a blog entry may appear on the home page of a blog, in an archive page, and in a page of other entries with the same label.

Understand If Your CMS Produces Thin Content or Duplicate Pages

Google says:

Understand your content management system: Make sure you’re familiar with how content is displayed on your website. Blogs, forums, and related systems often show the same content in multiple formats. For example, a blog entry may appear on the home page of a blog, in an archive page, and in a page of other entries with the same label.

WordPress, Magento, Joomla, Drupal – they all come with slightly different duplicate content (and crawl equity performance) challenges.

Will Google Penalise You For Syndicated Content?

No. When it comes to publishing your content on other websites:

Syndicate carefully: If you syndicate your content on other sites, Google will always show the version we think is most appropriate for users in each given search, which may or may not be the version you’d prefer. However, it is helpful to ensure that each site on which your content is syndicated includes a link back to your original article. You can also ask those who use your syndicated material to use the noindex meta tag to prevent search engines from indexing their version of the content.

The problem with syndicating your content is you can never tell if this will ultimately cost you organic traffic.

If it is on other websites – they might be getting ALL the positive signals from that content – not you.

It’s also worth noting that Google still clearly says in 2017 that you CAN put links back to your original article in posts that are republished elsewhere.

But you need to be careful with that too – as those links could be classified as unnatural links.

A few years ago I made an observation I think that links that feature on duplicate posts that have been stolen – duplicated and republished – STILL pass anchor text value (even if it is a light boost).

In this example, my ‘what is SEO‘ post was stripped out all my links and published the article as his own.

Well he stripped out all the links apart from one link he missed:


Snippet

 


Yes, the link to http://www.duny*.com.pk/ was actually still pointing to my home page.

This gave me an opportunity to look at something…..

The article itself wasn’t 100% duplicate – there was a small intro text as far as I can see. It was clear by looking at Copyscape just how much of the article is unique and how much is duplicate.

So this is was 3 yr. old article republished on a low-quality site with a link back to my site within a portion of the page that’s clearly duplicate text.

I would have *thought* Google just ignored that link.

But no, Google did return my page for the following query (at the time):

Google SERP

The Google Cache notification (below) is now no longer available, but it was a good little tool to dig a little deeper into how Google works:

Google Cache

… which indicated that Google will count links (AT SOME LEVEL) even on duplicate articles republished on other sites – probably depending on the search query, and the quality of the SERP at that time (perhaps even taking into consideration the quality score of the site with the most trust?).

I have no idea if this is the case even today.

Historically, syndicating your content via RSS and encouraging folk to republish your content got your links, that counted, on some level (which might be useful for long tail searches).

Google is quite good at identifying the original article especially if the site it’s published on has a measure of trust – I’ve never had a problem with syndication of my content via RSS and let others cross post…. but I do like at least a link back, nofollow or not.

Original Articles Come Top (usually)

The bigger problem with content syndication in 2017 is unnatural links and whether or not Google classifies your intent as manipulative.

If Google does class your intent to rank high with unnatural links, then you have a much more serious problem on your hands.

Does Google Penalise ‘Thin’ Content On A Website?

Yes. Google also says about ‘thin’ content.

Avoid publishing stubs: Users don’t like seeing “empty” pages, so avoid placeholders where possible. For example, don’t publish pages for which you don’t yet have real content. If you do create placeholder pages, use the noindex meta tag to block these pages from being indexed.

and

Minimize similar content: If you have many pages that are similar, consider expanding each page or consolidating the pages into one. For instance, if you have a travel site with separate pages for two cities, but the same information on both pages, you could either merge the pages into one page about both cities or you could expand each page to contain unique content about each city.

The key things to understand about duplicate content on your web pages are:

  • ‘Duplicate content’ on your own website is not necessarily ‘copied content’
  • Duplicate content is a normal churn of the web. Google will rank it – for a time. Human or machine generated, there is a lot of it – and Google has a lot of experience handling it and there are many circumstances where Google finds duplicate content on websites. Not all duplicate content is a bad thing.
  • If a page ranks well and Google finds it a manipulative use of duplicate content, Google can demote the page if it wants to. If it is deemed the intent is manipulative and low-quality with no value add, Google can take action on it – using manual or algorithmic actions.
  • There is a very thin line between reasonable duplicate content and thin content. This is where the confusion comes in.
  • Google explicitly states they don’t have a duplicate content penalty – but they do have a ‘thin content’ manual action… that looks and feels a lot like a penalty.

They also have Google Panda, an algorithm specifically designed to weed out low-quality content on websites.

How To Deal With Pagination Problems On Your Website

 

Paginated pages are not duplicate content, but often, it would be more beneficial to the user to land on the first page of the sequence.

Folding pages in a sequence and presenting a canonical URL for a group of pages has numerous benefits.

If you think you have paginated content problems on your website, it can be a frightening prospect to try and fix.

It is actually not that complicated.

Google knows that ‘Sites paginate content in various ways.’ and it is used to dealing with this type of problem on different types of sites like:

  • News and/or publishing sites often divide a long article into several shorter pages.

  • Retail sites may divide the list of items in a large product category into multiple pages.

  • Discussion forums often break threads into sequential URLs.

While Google says you can ‘do nothing‘ with paginated content, that might be taking a risk in a number of areas, and part of SEO in 2017 is to focus on ranking a canonical version of a URL at all times.

What you do to handle paginated content will depend on your circumstances.

A better recommendation on offer is to:

Specify a View All page. Searchers commonly prefer to view a whole article or category on a single page. Therefore, if we think this is what the searcher is looking for, we try to show the View All page in search results. You can also add a rel=”canonical” link to the component pages to tell Google that the View All version is the version you want to appear in search results. GOOGLE

and

Use rel="next" and rel="prev" links to indicate the relationship between component URLs. This markup provides a strong hint to Google that you would like us to treat these pages as a logical sequence, thus consolidating their linking properties and usually sending searchers to the first page. GOOGLE

You can also use meta robots ‘noindex,follow‘ directions on certain types of paginated content (I do), however, I would recommend you think twice before actually removing such content from Google’s index IF those URLs (or a portion of those URLs) generate a good amount of traffic from Google, and there is no explicit need for Google to follow the links to find content.

If a page is getting traffic from Google but needs to come out of the index, then I would ordinarily rely on an implementation that included the canonical link element (or redirect).

Ultimately, this depends on the situation and the type of site you are dealing with.

QUOTE: "These in depth articles from @Hobo_Web are excellent - well, well worth reading them to build your fundamental SEO knowledge. For free!"

How To Use Rel=Next & Rel=Previous Markup, Properly

Pagination can be a tricky concept and it is easy to mess up. The above video offers advice direct from Google.

Here are some notes to help you also:

  • The ‘rel=“next” and rel=“previous” markup is the standard for indicating paginated pages (and this can have any page parameters included)
  • Rel Canonical is for duplicate content. e.g. for session IDS OR for content which is a ‘superset’ e.g. a superset VIEW ALL Page. If Google’s algorithms PREFER a page with a ‘misused’ canonical, they will pick the page they (or their users) prefer. Canonical is a HINT, not a DIRECTIVE Google is required to honour. A page once removed from Google’s index because of misused canonical will reappear in time if Google prefers it, and it will ignore the rel=canonical on that page and index it normally.
    Google does NOT completely ignore an entire sites canonical instructions if misused – rather it seems to do this url by url – but it might impact traffic levels in core updates, PAGE TO PAGE, as Google better understands your site and especially when the pages its USER’s prefer is different (when matched to a query) to the URL YOU specify in Rel=canonical. This is what Google means by a ‘hint’ not a directive.
  • The ‘rel=“next” and rel=“previous” markup destination URLs should all return a 200 OK response header.
  • Page one in the series should have rel=next markup, but no rel=prev code, with the opposite being true of the last page in the series.
  • Pages in a series should have a self-referencing canonical tag unless they specify a ‘view all page’. rel=“canonical” tells Google to index ONLY the content on the page specified in the rel=“canonical”.
  • rel=“canonical” – Do NOT canonicalise component pages in a series to the first page.

You ONLY use rel=“canonical” to point to a VIEW ALL Page if one is present, OTHERWISE, all pages SHOULD have a SELF-referencing canonical tag. “In cases of paginated content, we recommend either a rel=canonical from component pages to a single-page version of the article, or to use rel=”prev” and rel=”next” pagination markup.” – GOOGLE A common mistake web developers make is to add a rel=canonical to the first page in the series or to add NOINDEX to pages in the series of a component set.

“While it’s fine to set rel=”canonical” from a component URL to a single view-all page, setting the canonical to the first page of a parameter-less sequence is considered improper usage.” – GOOGLE “When you implement rel=”next” and rel=”prev” on component pages of a series, we’ll then consolidate the indexing properties from the component pages and attempt to direct users to the most relevant page/URL. This is typically the first page. There’s no need to mark page 2 to n of the series with noindex.” GOOGLE A Google spokesperson explains why this is not optimal here: https://youtu.be/njn8uXTWiGg?t=11m52s

RE: Time-sequential Series of Pages (e.g. in a blog) Google offers this advice:

QUESTION:

“Should I use the rel next/prev into [sic] the section of a blog even if the two contents are not strictly correlated (but they are just time-sequential)?” 

ANSWER: “In regard to using rel=”next” and rel=”prev” for entries in your blog that “are not strictly correlated (but they are just time-sequential),” pagination markup likely isn’t the best use of your time — time-sequential pages aren’t nearly as helpful to our indexing process as semantically related content, such as pagination on component pages in an article or category. 

It’s fine if you include the markup on your time-sequential pages, but please note that it’s not the most helpful use case.” GOOGLE So – for internal pages that are ordered by date of publishing, it is probably better to just let Google crawl these, and use NOINDEX,FOLLOW on such pages (blog sub pages, date based archives etc) to remove them from the index.

Further reading: https://webmasters.googleblog.com/2012/03/video-about-pagination-with-relnext-and.html

Should You Block Google from Crawling Internal Search Result Pages?

Yes. According to Google.

Google wants you to use Robots text to block internal search results. Google recommends “not allowing internal search pages to be indexed”.

While there are ways around this guideline that do not produce ‘infinite search spaces”, letting Google index and rank your internal search pages is a VERY risky manoeuvre (over time) if you are in a competitive industry.

These recommendations are actually in webmaster guidelines.

“Use the robots.txt file on your web server to manage your crawling budget by preventing crawling of infinite spaces such as search result pages. Keep your robots.txt file up to date.” Google (2017)

Letting Google crawl and index your internal search results pages is an ‘inefficient’ from a crawling and indexing.

Such pages cause “problems in search” for Google, and Google has a history of ‘snapping back’ on companies who break such guidelines to their profit.

Are Uppercase and Lowercase URLs Counted as TWO different pages to Google?

Yes. Uppercase and lowercase versions of a URL are classed as TWO different pages for Google.

Best practice has long been to force lowercase URLs on your server, and be consistent when linking to internal pages on your website and use only lowercase URLs when creating internal links.

The video below offers recent (2017) confirmation of this challenge – with the advice being to use canonicals or redirects to fix this issue, and this would be whatever was more efficient from a crawling and indexing perspective (which I think to be 301 redirects in this instance, where necessary, and a overhaul of the internal linking structure):

…and when asked recently (2017) on Twitter, former Googler Matt Cutts replied:

Is there ANY SEO-relevant reason to force lowercase urls?

Matt replied:

Unifying links on a single URL

Does The Google Panda Algorithm Penalise Duplicate Content?

A part of Google Panda algorithm is focused on thin pages and (many think) the ratio of good-quality content to low-quality content on a site and user feedback to Google as a proxy for satisfaction levels.

In the original announcement about Google Panda we were specifically told that the following was a ‘bad’ thing:

Does the site have duplicate, overlapping, or redundant articles?

If Google is rating your pages on content quality, or lack of it, as we are told, and user satisfaction – on some level – and a lot of your site is duplicate content that provides no positive user satisfaction feedback to Google – then that may be a problem too.

Google offers some advice on thin pages (emphasis mine):

Here are a few common examples of pages that often have thin content with little or no added value: 1 . Automatically generated content, 2. Thin affiliate pages 3. Content from other sources. For example: Scraped content or low-quality guest blog posts. 4. Doorway pages

Everything I’ve bolded in the last two quotes is essentially about what many SEO have traditionally labelled (incorrectly) as ‘duplicate content’.

This might be ‘semantics’, but Google calls that type of duplicate content ‘spam’.

Google is, even more, explicit when it tells you how to clean up this ‘violation’:

Next, follow the steps below to identify and correct the violation(s) on your site: Check for content on your site that duplicates content found elsewhere.

So beware.

Google says there is NO duplicate content penalty, but if Google classifies your duplicate content as copied content, thin content or boilerplate, spun text, then you MAY WELL have a problem!

A serious challenge, if your entire site is built like that.

And how Google rates thin pages changes over time, with a quality bar that is always going to rise and that your pages need to keep up with.

Especially if rehashing content is what you do.

Google Panda does not penalise a site for duplicate content, but it does measure site and content ‘quality’. Google Panda actually DEMOTES a site where it determines an intent to manipulate the algorithms.

Google Panda:

“measures the quality of a site pretty much by looking at the vast majority of the pages at least. But essentially allows us to take quality of the whole site into account when ranking pages from that particular site and adjust the ranking accordingly for the pages. [Google Panda] is an adjustment. Basically, we figured that site is trying to game our systems, and unfortunately, successfully. So we will adjust the rank. We will push the site back just to make sure that it’s not working anymore.” Gary Illyes, Google Spokesperson

To read more about Google Panda, you can check out my article on Google Panda and site quality issues.

TIP – Look out for soft 404 errors in Google Webmaster tools (now called Google Search Console) as examples of pages Google are classing as low-quality, user-unfriendly thin pages.

How To Check For Duplicate Content On A Website?

An easy way to find duplicate content is to use Google.

Just take a piece of text content from your site and put it “in quotes” as a search in Google.

Google will tell you how many pages that piece of content it found on pages in its index of the web.

The best known online duplicate content checker tool is Copyscape and I particularly like this little tool too, which check duplicate content ratio between two selections of text.

If you find evidence of plagiarism, you can file a DMCA or contact Google, but I haven’t ever bothered with that, and many folks have republished my articles over the years.

I once found my article (word for word) in a paid advert in a printed magazine before, for someone else!

Comments:

A few marketers and Google spokespeople have commented on this article on social circles (presented as images throughout this article)

More reading


  • SEO Audit

    Our SEO Audits are designed to help identify underperformance on your site and help manage, consolidate and improve the website (in a legitimate way, by meeting Google's guidelines) in order to get more traffic and sales from Google.

    See Prices