Duplicate Content SEO Best Practice
Webmasters are confused about penalties for duplicate content, which is a natural part of the web landscape, because Google claims there is NO duplicate content penalty, yet rankings can be impacted negatively, apparently, by what looks like duplicate content problems.
The reality in 2016 is that if Google classifies your duplicate content as THIN content, or BOILER-PLATE content, then you DO have a severe problem that violates Google’s website performance recommendations and this ‘violation’ will need ‘cleaned’ up.
At the ten minute mark in the above video, John Mueller of Google has recently clarified, with examples, that there is:
“No duplicate content penalty” but “We do have some things around duplicate content … that are penalty worthy“
Table of Contents
What Is Duplicate Content?
Here is a definition from Google:
Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar. Mostly, this is not deceptive in origin…..
It’s crucial to understand that if, in 2016, as a Webmaster you republish posts, press releases, news stories or product descriptions found on other sites, then your pages are very definitely going to struggle to gain traction in Google’s SERPs (search engine results pages).
Google doesn’t like using the word ‘penalty’ but if your entire site is made of entirely of republished content – Google does not want to rank it.
If you have a multiple site strategy selling the same products – you are probably going to cannibalise your traffic in the long run, rather than dominate a niche, as you used to be able to do.
This is all down to how a search engine filters duplicate content found on other sites – and the experience Google aims to deliver for it’s users – and it’s competitors.
Mess up with duplicate content on a website, and it might look like a penalty as the end-result is the same – important pages that once ranked might not rank again – and new content might not get crawled as fast as a result.
Your website might even get a ‘manual action’ for thin content.
Worse case scenario your website is hit by the Google Panda algorithm.
A good rule of thumb is; do NOT expect to rank high in Google with content found on other, more trusted sites, and don’t expect to rank at all if all you are using is automatically generated pages with no ‘value add’.
While there are exceptions to the rule, (and Google certainly treats your OWN duplicate content on your OWN site differently), your best bet in ranking in 2016 is to have one single version of content on your site with rich, unique text content that is written specifically for that page.
Google wants to reward RICH, UNIQUE, RELEVANT, INFORMATIVE and REMARKABLE content in its organic listings – and it’s raised the quality bar over the last few years.
If you want to rank high in Google for valuable key phrases and for a long time – you better have good, original content for a start – and lots of it.
A very interesting statement in a recent webmaster hangout was “how much quality content do you have compared to low-quality content“. That indicates Google is looking at this ratio. John says to identify “which pages are high-quality, which pages are lower quality so that the pages that do get indexed are really the high-quality ones.“
What is Boilerplate Content?
Wikipedia says of ‘boilerplate’ content:
Boilerplate is any text that is or can be reused in new contexts or applications without being greatly changed from the original. WIKI
…and Google says to:
Minimize boilerplate repetition
Google is very probably looking to see if your pages ‘stand on their own‘ – as John Mueller is oft fond of saying.
How would they do that algorithmically? Well, they could look to see if text blocks on your pages were unique to the page, or were very similar blocks of content to other pages on your site.
If this ‘boilerplate’ content is the content that makes up the PRIMARY content of multiple pages – Google can easily filter to ignore – or penalise – this practice.
The sensible move would be to listen to Google – and minimise – or at least diffuse – the instances of boilerplate text, page-to-page on your website.
Note that THIN CONTENT exacerbates BOILERPLATE TEXT problems on a site – as THIN CONTENT just creates more pages that can only be created with boilerplate text – itself, a problem.
E.G. – if a product has 10 URLs – one URL for each colour of the product, for instance – then the TITLE, META DESCRIPTION & PRODUCT DESCRIPTION (and other elements on the page) for these extra pages will probably rely on BOILERPLATE techniques to create them, and in doing so – you create 10 URLs on the site that do ‘not stand on their own’ and essentially duplicate text across pages.
It’s worth listening to John Mueller’s recent advice on this point. He clearly says that the practice of making your text more ‘unique’, using low-quality techniques is:
“probably more counter productive than actually helping your website”
If you have many pages of similar content your site, Google might have trouble choosing the page you want to rank, and it might dilute your capability to rank for what you do what to rank for.
For instance, if you have ‘PRINT-ONLY’ versions of web pages (Joomla used to have major issues with this), that can end up displaying in Google instead of your web page if you’ve not handled it properly. That’s probably going to have an impact on conversions – for instance. Poorly implemented mobile sites can cause duplicate content problems, too.
Is There A Penalty For Duplicate Content On A Website?
Google has given us some explicit guidelines when it comes to managing duplication of content.
John Mueller clearly states in the video where I grabbed the above image:
“We don’t have a duplicate content penalty. It’s not that we would demote a site for having a lot of duplicate content.”
“You don’t get penalized for having this kind of duplicate content”
…in which he was talking about very similar pages. John says to “provide… real unique value” on your pages.
I think that could be understood that Google is not compelled to rank your duplicate content.
If it ignores it, it’s different from a penalty. Your original content can still rank, for instance.
An e-commerce SEO tip from John with:
“variations of product “colors…for product page, but you wouldn’t create separate pages for that.” With these type of pages you are “always balancing is having really, really strong pages for these products, versus having, kind of, medium strength pages for a lot of different products.“
“one kind of really, really strong generic page” trumps “hundreds” of mediocre ones.
If “essentially, they’re the same, and just variations of keywords” that should be ok, but if you have ‘millions‘ of them- Googlebot might think you are building doorway pages, and that IS risky.
Generally speaking, Google will identify the best pages on your site if you have a decent on-site architecture and unique content.
The advice is to avoid duplicate content issues if you can and this should be common sense.
Google wants (and rewards) original content – it’s a great way to push up the cost of SEO and create a better user experience at the same time.
Google doesn’t like it when ANY TACTIC it’s used to manipulate its results, and republishing content found on other websites is a common practice of a lot of spam sites.
Duplicate content on a site is not grounds for action on that site unless it appears that the intent of the duplicate content is to be deceptive and manipulate search engine results. Google.
You don’t want to look anything like a spam site; that’s for sure – and Google WILL classify your site… as something.
The more you can make it look a human built every page on a page by page basis with content that doesn’t appear exactly in other areas of the site – the more Google will like it. Google does not like automation when it comes to building a website; that’s clear in 2016.
I don’t mind multiple copies of articles on the same site – as you find with WordPress categories or tags, but I wouldn’t have tags and categories, for instance, and expect them to rank well on a small site with a lot of higher quality competition, and especially not targeting the same keyword phrases.
I prefer to avoid repeated unnecessary content on my site, and when I do have automatically generated content on a site, I tell Google not to index it with a noindex it in meta tags or XRobots.
I am probably doing the safest thing, as that could be seen as manipulative if I intended to get it indexed.
Google won’t thank you, either, for spidering a calendar folder with 10,000 blank pages on it, or a blog with more categories than original content – why would they?
…in some cases, content is deliberately duplicated across domains in an attempt to manipulate search engine rankings or win more traffic. Deceptive practices like this can result in a poor user experience, when a visitor sees substantially the same content repeated within a set of search results. Google tries hard to index and show pages with distinct information. This filtering means, for instance, that if your site has a “regular” and “printer” version of each article, and neither of these is blocked with a noindex meta tag, we’ll choose one of them to list. In the rare cases in which Google perceives that duplicate content may be shown with intent to manipulate our rankings and deceive our users, we’ll also make appropriate adjustments in the indexing and ranking of the sites involved. As a result, the ranking of the site may suffer, or the site might be removed entirely from the Google index, in which case it will no longer appear in search results. GOOGLE
If you are trying to compete in competitive niches, you need original content that’s not found on other pages in the same form on your site, and THIS IS, EVEN MORE, IMPORTANT WHEN THAT CONTENT IS FOUND ON OTHER PAGES ON OTHER WEBSITES.
Google isn’t under any obligation to rank your version of content – in the end, it depends on who’s site has got the most domain authority or most links coming to the page.
Don’t unnecessarily compete with these duplicate pages by always rewriting your content if you think the content will appear on other sites (especially if you are not the first to ‘break it’ if it’s news).
How To Check For Duplicate Content On A Website
An easy way to find duplicate content is to use Google. Just take a piece of text content from your site and put it “in quotes” as a search in Google. Google will tell you how many pages that piece of content it found on pages in its index of the web. The best known online duplicate content checker tool is Copyscape and I particularly like this little tool too, which check duplicate content ratio between two selections of text.
If you find evidence of plagiarism, you can file a DMCA or contact Google, but I haven’t ever bothered with that, and many folks have republished my articles over the years.
I even found my article in a paid advert in a magazine before.
A Dupe Content Strategy?
There are strategies where this will still work, in the short term. Opportunities are (in my experience) reserved for long tail SERPs where the top ten results page is already crammed full of low-quality results, and the SERPs are shabby – certainly not a strategy for competitive terms.
There’s not a lot of traffic in long tail results unless you do it en-mass and that could invite further site quality issues, but sometimes it’s worth exploring if using very similar content with geographic modifiers (for instance) on a site with some domain authority has opportunity. Very similar content can be useful across TLDs too. A bit spammy, but if the top ten results are already a bit spammy…
If low-quality pages are performing well in the top ten of an existing long tail SERP – then it’s worth exploring – I’ve used it in the past. I always thought if it improves user experience and is better than what’s there in those long tail searches at present, who’s complaining?
Too many low-quality pages might cause you site wide issues in the future, not just page level issues.
Original Content Is King, they say
Stick to original content, found on only one page on your site, for best results – especially if you have a new/young site and are building it page by page over time… and you’ll get better rankings and more traffic to your site (affiliates too!).
Yes – you can be creative and reuse and repackage content, but I always make sure if I am asked to rank a page I will require original content on the page.
There is NO NEED to block your own Duplicate Content
There was a useful post in Google forums a while back with advice from Google how to handle very similar or identical content:
“We now recommend not blocking access to duplicate content on your website, whether with a robots.txt file or other methods” John Mueller
John also goes on to say some good advice about how to handle duplicate content on your own site:
- Recognize duplicate content on your website.
- Determine your preferred URLs.
- Be consistent within your website.
- Apply 301 permanent redirects where necessary and possible.
- Implement the rel=”canonical” link element on your pages where you can. (Note – Soon we’ll be able to use the Canonical Tag across multiple sites/domains too.)
- Use the URL parameter handling tool in Google Webmaster Tools where possible.
Webmaster guidelines on content duplication used to say:
Consider blocking pages from indexing: Rather than letting Google’s algorithms determine the “best” version of a document, you may wish to help guide us to your preferred version. For instance, if you don’t want us to index the printer versions of your site’s articles, disallow those directories or make use of regular expressions in your robots.txt file. Google
but now Google is pretty clear they do NOT want us to block duplicate content, and that is reflected in the guidelines.
Google does not recommend blocking crawler access to duplicate content (dc) on your website, whether with a robots.txt file or other methods. If search engines can’t crawl pages with dc, they can’t automatically detect that these URLs point to the same content and will therefore effectively have to treat them as separate, unique pages. A better solution is to allow search engines to crawl these URLs, but mark them as duplicates by using the
rel="canonical"link element, the URL parameter handling tool, or 301 redirects. In cases where DC leads to us crawling too much of your website, you can also adjust the crawl rate setting in Webmaster Tools. DC on a site is not grounds for action on that site unless it appears that the intent of the DC is to be deceptive and manipulate search engine results. If your site suffers from DC issues, and you don’t follow the advice listed above, we do a good job of choosing a version of the content to show in our search results.
You want to minimise dupe content, rather than block it. I find the best solution to handling a problem is on a case by case basis. Sometimes I will block Google.
Google says it needs to detect an INTENT to manipulate Google to incur a penalty, and you should be OK if your intent is innocent, BUT it’s easy to screw up and LOOK as if you are up to something fishy.
It is also easy to fail to get the benefit of proper canonicalisation and consolidation of relevant primary content if you don’t do basic housekeeping, for want of a better turn of phrase.
Advice on content spread across multiple domains:
Content Spread Accross Multiple TLDs
Mobile SEO Advice
Canonical Link Element Best Practice
Google also recommends using the canonical link element to help minimise content duplication problems.
If your site contains multiple pages with largely identical content, there are a number of ways you can indicate your preferred URL to Google. (This is called “canonicalization”.)
Google SEO – Matt Cutts from Google shared tips on the rel=”canonical” tag (more accurately – the canonical link element) that the 3 top search engines now support. Google, Yahoo!, and Microsoft have all agreed to work together in a
“joint effort to help reduce duplicate content for larger, more complex sites, and the result is the new Canonical Tag”.
Example Canonical Tag From Google Webmaster Central blog:
<link rel="canonical" href="http://www.example.com/product.php?item=swedish-fish" />
You can put this link tag in the head section of the problem URLs if you think you need it.
I add a self-referring canonical link element as standard these days – to ANY web page.
Is rel=”canonical” a hint or a directive?
It’s a hint that we honor strongly. We’ll take your preference into account, in conjunction with other signals, when calculating the most relevant page to display in search results.
Can I use a relative path to specify the canonical, such as <link rel=”canonical” href=”product.php?item=swedish-fish” />?
Yes, relative paths are recognized as expected with the <link> tag. Also, if you include a<base> link in your document, relative paths will resolve according to the base URL.
Is it okay if the canonical is not an exact duplicate of the content?
We allow slight differences, e.g., in the sort order of a table of products. We also recognize that we may crawl the canonical and the duplicate pages at different points in time, so we may occasionally see different versions of your content. All of that is okay with us.
What if the rel=”canonical” returns a 404?
We’ll continue to index your content and use a heuristic to find a canonical, but we recommend that you specify existent URLs as canonicals.
What if the rel=”canonical” hasn’t yet been indexed?
Like all public content on the web, we strive to discover and crawl a designated canonical URL quickly. As soon as we index it, we’ll immediately reconsider the rel=”canonical” hint.
Can rel=”canonical” be a redirect?
Yes, you can specify a URL that redirects as a canonical URL. Google will then process the redirect as usual and try to index it.
What if I have contradictory rel=”canonical” designations?
Our algorithm is lenient: We can follow canonical chains, but we strongly recommend that you update links to point to a single canonical page to ensure optimal canonicalization results.
Can this link tag be used to suggest a canonical URL on a completely different domain?
**Update on 12/17/2009: The answer is yes! We now support a cross-domain rel=”canonical” link element.**
Tip – Redirect old, out of date content to new, freshly updated articles on the subject, minimising low-quality pages and duplicate content while at the same time, improving the depth and quality of the page you want to rank. See our page on 301 redirects – http://www.hobo-web.co.uk/how-to-change-domain-names-keep-your-rankings-in-google/.
Tips from Google
As with everything Google does – Google has had its own critics about its use of duplicate content on its own site for its own purposes:
There are some steps you can take to proactively address duplicate content issues, and ensure that visitors see the content you want them to. Use 301s: If you’ve restructured your site, use 301 redirects (“RedirectPermanent”) in your .htaccess file to smartly redirect users, Googlebot, and other spiders. (In Apache, you can do this with an .htaccess file; in IIS, you can do this through the administrative console.)
Be consistent: Try to keep your internal linking consistent. For example, don’t link to
I would also ensure your links are all the same case, and avoid capitalisation and lower case variations of the same URL.
This type of duplication can be quickly sorted keeping internal linking consistent and proper use of canonical link elements.
Use top-level domains: To help us serve the most appropriate version of a document, use top-level domains whenever possible to handle country-specific content. We’re more likely to know that
http://www.example.decontains Germany-focused content, for instance, than
Google also tell Webmasters to choose a preferred domain to rank in Google:
Use Webmaster Tools to tell us how you prefer your site to be indexed: You can tell Google your preferred domain(for example,
…although you should ensure you handle such redirects server side, with 301 redirects redirecting all versions of a URL to one canonical URL (with a self-referring canonical link element).
Minimize boilerplate repetition: For instance, instead of including lengthy copyright text on the bottom of every page, include a very brief summary and then link to a page with more details. In addition, you can use the Parameter Handling tool to specify how you would like Google to treat URL parameters. Understand your content management system: Make sure you’re familiar with how content is displayed on your web site. Blogs, forums, and related systems often show the same content in multiple formats. For example, a blog entry may appear on the home page of a blog, in an archive page, and in a page of other entries with the same label.
Understand Your CMS
Understand your content management system: Make sure you’re familiar with how content is displayed on your website. Blogs, forums, and related systems often show the same content in multiple formats. For example, a blog entry may appear on the home page of a blog, in an archive page, and in a page of other entries with the same label.
WordPress, Magento, Joomla, Drupal – they all come with slightly different duplicate content (and crawl equity performance) challenges.
Syndicating Content Comes At A Risk
When it comes to publishing your content on other websites:
Syndicate carefully: If you syndicate your content on other sites, Google will always show the version we think is most appropriate for users in each given search, which may or may not be the version you’d prefer. However, it is helpful to ensure that each site on which your content is syndicated includes a link back to your original article. You can also ask those who use your syndicated material to use the noindex meta tag to prevent search engines from indexing their version of the content.
The problem with syndicating your content is you can never tell if this will ultimately cost you organic traffic.
If it is on other websites – they might be getting ALL the benefit – not you.
It’s also worth noting that Google still clearly says in 2016 that you can put links back to your original article in posts that are republished elsewhere. But you need to be careful with that too – as those links could be classified as unnatural links.
A few years ago I made an observation I think that links on duplicate posts that have been stolen – duplicated and republished – STILL pass anchor text value (even if it is a light boost).
Take this Cheeky beggar…. – he nicked my what is SEO post I created and stripped out all my links (cheek!) and published the article as his own.
Well he stripped out all the links apart from one link he missed:
Yes, the link to http://www.duny*.com.pk/ was actually still pointing to my home page.
This gave me an opportunity to look at something…..
The article itself wasn’t 100% duplicate – there where a small intro text as far as I can see. It was clear by looking at Copyscape just how much of the article is unique and how much is duplicate.
So this is was 3 yr. old article republished on a low-quality site with a link back to my site within a portion of the page that’s clearly dupe text.
I would have *thought* Google just ignored that link.
But no, Google did return my page for the following query (at the time):
This Google Cache notification is now no longer available tells fibs, but is pretty accurate this time:
… which looks to me as Google will count links (AT SOME LEVEL) even on duplicate articles republished on other sites – probably depending on the search query, and the quality of the SERP at that time (perhaps even taking into consideration the quality score of the site with the most trust?).
I’d imagine this to be the case even today.
How to take advantage of this?
Well, you get an idea of just how much original text you need to add to a page for that page to pass some kind of anchor text value (perhaps useful for article marketers). And in this case, it’s not much! Kind of lazy, though. And certainly not good enough in 2016.
It seems, syndicating your content via RSS and encouraging folk to republish your content will get you links, that count, on some level it seems (which might be useful for longer tail searches). I still always make sure even duplicate (in essence) press releases and articles we publish are ‘unique’ at some level.
Google is quite good at identifying the original article especially if the site it’s published on has a measure of trust – I’ve never had a problem with syndication of my content via RSS and let others cross post…. but I do like at least a link back, nofollow or not.
The bigger problem with content syndication in 2016 is unnatural links and whether or not Google classifies your intent as manipulative.
Thin Content Classifier
Google also says about ‘thin’ content.
Avoid publishing stubs: Users don’t like seeing “empty” pages, so avoid placeholders where possible. For example, don’t publish pages for which you don’t yet have real content. If you do create placeholder pages, use the noindex meta tag to block these pages from being indexed.
Minimize similar content: If you have many pages that are similar, consider expanding each page or consolidating the pages into one. For instance, if you have a travel site with separate pages for two cities, but the same information on both pages, you could either merge the pages into one page about both cities or you could expand each page to contain unique content about each city.
The key takeaways about duplicate content are this.
Duplicate content is a normal churn of the web. Google will rank it – for a time. Human or machine generated, there is a lot of it – and Google has a lot of experience handling it and there are many circumstances where Google finds duplicate content on websites. Not all duplicate content is a bad thing.
If a page ranks well and Google finds it a manipulative use of duplicate content, Google can demote the page if it wants to. If it is deemed the intent is manipulative and low-quality with no value add, Google can take action on it – using manual or algorithmic actions.
There is a very thin line between reasonable duplicate content and thin content. This is where the confusion comes in.
Google explicitly states they don’t have a duplicate content penalty – but they do have a ‘thin content’ manual action… that looks and feels a lot like a penalty. They also have Google Panda.
How To Deal With Pagination Problems On Your Website
Paginated pages are not duplicate content, but often, it would be more beneficial to the user to land on the first page of the sequence. Folding pages in a sequence and presenting a canonical URL for a group of pages has numerous benefits.
If you think you have paginated content problems on your website, it can be a frightening prospect to try and fix.
It is actually not that complicated.
Google knows that ‘Sites paginate content in various ways.’ and it is used to dealing with this type of problem on different types of sites like:
News and/or publishing sites often divide a long article into several shorter pages.
Retail sites may divide the list of items in a large product category into multiple pages.
Discussion forums often break threads into sequential URLs.
While Google says you can ‘do nothing‘ with paginated content, that might be taking a risk in a number of areas, and part of SEO in 2016 is to focus on ranking a canonical version of a URL at all times.
What you do to handle paginated content will depend on your circumstances.
A better recommendation on offer is to:
Specify a View All page. Searchers commonly prefer to view a whole article or category on a single page. Therefore, if we think this is what the searcher is looking for, we try to show the View All page in search results. You can also add a rel=”canonical” link to the component pages to tell Google that the View All version is the version you want to appear in search results. GOOGLE
rel="prev"links to indicate the relationship between component URLs. This markup provides a strong hint to Google that you would like us to treat these pages as a logical sequence, thus consolidating their linking properties and usually sending searchers to the first page. GOOGLE
You can also use meta robots ‘noindex,follow‘ directions on certain types of paginated content (I do), however, I would recommend you think twice before actually removing such content from Google’s index IF those URLs (or a portion of those URLs) generate a good amount of traffic from Google, and there is no explicit need for Google to follow the links to find content.
If a page is getting traffic from Google but needs to come out of the index, then I would ordinarily rely on an implementation that included the canonical link element.
Ultimately, this depends on the situation and the type of site you are dealing with.
Problems With Google Panda
A part of Google Panda algorithm is focused on thin pages and ratio of good-quality content to low-quality content on a site.
In the original announcement about Google Panda we were specifically told that the following was a ‘bad’ thing:
Does the site have duplicate, overlapping, or redundant articles?
If Google is rating your pages on content quality, or lack of it, as we are told, and user signals – on some level – and a lot of your site is duplicate content that gets no user signal – then that may be a problem too.
Google offers some advice on thin pages (emphasis mine):
Here are a few common examples of pages that often have thin content with little or no added value: 1 . Automatically generated content, 2. Thin affiliate pages 3. Content from other sources. For example: Scraped content or low-quality guest blog posts. 4. Doorway pages
Everything I’ve bolded in the last two quotes is essentially about duplicate content.
Google is, even more, explicit when it tells you how to clean up this ‘violation’:
Next, follow the steps below to identify and correct the violation(s) on your site: Check for content on your site that duplicates content found elsewhere.
So beware. Google says there is NO duplicate content penalty, but if Google classifies your duplicate content as thin content or boilerplate, spun text, then you DO have a problem!
A serious problem if your entire site is built like that.
And how Google rates thin pages changes over time, with a quality bar that is always going to rise and that your pages need to keep up with.
Especially if rehashing content is what you do.
TIP – Look out for soft 404 errors in Google Webmaster tools as examples of pages Google are classing as low-quality, user unfriendly thin pages.