Duplicate Content Issues and Scrapers

by Hari Varrier on January 5, 2009

An interesting article by Sven Naumann, from Google’s Search Quality Team, indicates that there are two kinds of duplicate content:
Internal – Identical content appears in more than one location on your site.
External – Your content appears on other sites, outside your own.

It even states that the duplicate content whether internal or external doesn’t negatively affect a site. My testing shows otherwise. Here is one example. Most “duplicate content” issues revolve around your site and your site only. External duplicate content is rare, even though Google would want you to believe otherwise. For example, do a search in Google on a recent news story. Hundreds of results will be displayed in Google’s results carrying the exact word-for-word story from the AP.
Another example is if you take two pages of your site, each page with different content, but you use the exact same Title and Description on both pages, Google will mark one of the pages as duplicate. However, if you take two pages with the exact same content, but you create unique Titles and Descriptions, Google will treat the pages as unique. I doubt this will last as Google should fix this issue by the end of the year.
Here is what you can do to combat internal duplicate content:
• www-Protection: Be sure you have the fix in your .htaccess file. (Apache servers only).
• Use 301s: If you have modified your site, use 301s in your .htaccess file to redirect users, Googlebot, and other spiders to the proper page.
• Block Bots: Never give Google the power to make a decision you should make. You choose which version of the document you want indexed and block the bots (by use of the robots.txt file) from the other versions, such as “Printer Friendly” versions.
• Link Consistently: Use relative or absolute links for your internal linking, but not both.
• Top Level Domains (TLD): TLD help Google serve the best version of the document for country-specific content. www.mydomain.de indicates German document better than www.mydomain.com/de.
• Preferred Domain Feature: In webmaster tools, there is a feature that allows you to indicate to Google which version you prefer to show in the SERPs. This is not a replacement for the non-www redirect.
• Avoid Repetition: If you have a lengthy copyright, disclosure or other text that is required on every page of your site, think about putting the text in an image and serving it via external css or JS.
• Understand your CMS: Be familiar with how content is displayed on your site, blog, and/or forum. Be aware that they may show the same content in various formats and locations. Work with your provider for different ways of solving this duplicate content problem – whether it is disallowing bots to a page or removing the option to view in another format altogether.
• Syndicated Content: Include a link back to your site within the content.
• Titles & Descriptions: Confirm you have unique titles and descriptions on every page. This is a good practice even if Google fixes the issue listed above.
For dealing with the latter, Sven claims that Google looks “at various signals to determine which site is the original” and “that you shouldn’t be very concerned about seeing negative effects on your site’s presence on Google for site scraping.” In my testing, I’ve seen that this is not the case, your site can be negatively affected in rankings and in traffic. Remember, traffic is related to rankings. Yes, there are other aspects of getting the click – a compelling title and description – but ranking is also part of the equation, because few if any users go past the top ten results.
In more instances than Google would care to admit, the scraper out ranks the original content. Here’s what to do if that happens:
Confirm your content and pages are accessible and have not been blocked by your robots.txt file or any meta tags on the page.
Ensure your site is well within the guidelines set forth in the Webmaster Guidelines.
Review your sitemap to see if you had made any changes for the content that was scraped. Changes to the original could make it appear as the counterfeit version in Google’s eyes.
File a DMCA Request if you have fixed the above and see no change in the rankings.

Leave a Comment

You can use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Next post: