I recently ran into an interesting SEO problem on a project which has led to a question I just don't know the answer to:

How do you display syndicated content without triggering Google's duplicate content flag?

Hmm... intriguing.

Background

To explain the problem more fully (without giving out any project specifics), imagine you have a website. You probably already do so that shouldn't be hard. Now, imagine you fill that website full of original content. Again, that shouldn't be hard. For the sake of example, let's assume you run a news blog where you comment on the important stories of the day.

Next, you figure that your readers also want to read important related facts about the news story. Associated Press (AP) syndicates its content and through the API, you can pull in independently-checked related facts about whatever your original content deals with. So far, so good.

Unfortunately, a thousand other news blogs also pull in the same AP content alongside their original (and some not-so-original) content. Now, when the Googlebot crawls your site, it finds the same content there as it does in a thousand other places. Suddenly, you're marked with a 'duplicate content' black flag and all the lovely google juice you got from writing original articles has been taken away. Boo.

Your first thought might be to reach for the rel="canonical" attribute but that really only applies to entire pages. We need something that only affects a portion of the page.

Solution

What you need to do is find a way to include the content in your page when a visitor views it (providing extra value for readers) but prevent Google from reading it (hurting your search ranking). Fortunately, there are some methods for doing this. One involves having the content in an external JS file which is listed in your robots.txt to prevent Google from reading it. Another similar method involves having the content in an external HTML and including it as an iframe, again, preventing crawling via robots.txt. When the reader visits the page, the content is there, when Google visits, it isn't.

The Problem with the Solution

Both of the techniques mentioned here involve an extra HTTP request. You are including an external file so the visitor's browser has to go to your server, grab the file and include it. This isn't a huge problem for most sites but when you're dealing with high-traffic, highly-optimised websites, every file transferred counts. You go to all the trouble of turning all background images into sprites, why waste extra unnecessary connections on content?

Yahoo's Solution

Yahoo have a nice solution to this problem. If you include the attribute class="robots-nocontent" on any element, the Yahoo spider (called 'slurp') will ignore the content. Perfect. This does, however, only work for Yahoo. Not perfect.

My solution

My attempt at solving this problem which is a combination of SEO and high front-end performance was inspired by the technique GMail uses to deliver JS to mobile devices. In their article, Google delivers JS that they don't want run immediately in the initial payload. They figure that the cost of serving a single slightly larger HTTP request is less than the delay in retrieving data on demand.

I use HTML embedded in a JS comment in the original page which is then processed on DOMReady to strip out the comments and inject it into wherever it is supposed to be (identified by the data-destination attribute). I'm doing this on a page which already loads jQuery so this can all be accomplished with a simple bit of code.

<script type="text/html" class="norobot" data-destination=".content-destination">
 /*!
  <p>This content is hidden on load but displayed afterwards using javascript.</p>
 */
</script>
<div class="content-destination"></div>
<script src="http://ajax.googleapis.com/ajax/libs/jquery/1.5/jquery.min.js"></script>
<script>
 $('.norobot').each(function() {
  _this = $(this);
  $(_this.data('destination')).html(_this.html().replace('/*!','').replace('*/',''));
 });
</script>

Notes on the code

You may have noticed the type="text/html" attribute on the script block. That's to get around the fact that jQuery parses and executes any script blocks it finds when moving elements around (in an appendTo() or an html(), for example. Adding this attribute tells jQuery to process this as template code.

Also, the opening JS comment here begins /*!. The exclamation mark in this is a directive to any minifiers you might use on the code to tell them not to remove this comment block.

This is also available in a standalone page.

This is all a very long setup for my initial question. Does Google read this and, if so, does this affect duplicate content rankings?

Plus and Minus

  • Minus: The duplicate content is definitely in the page.
  • Plus: It's hidden in JavaScript
  • Minus: we're using JavaScript to serve different content to users and to google.
  • Plus: we're showing less to google than users. Spam techniques show more to increase keyword matches.
  • Plus: faster response due to a single http request (Google likes fast pages)

Obviously, we could add an extra step of obfuscating the 'hidden' content by reversing it or encoding it. This would definitely hide it from google and it would be trivial to undo the process before showing it to the user but is this step necessary? Part of my reasoning for concluding that Google ignores JS comments is that thousands of sites include the same standard licences bundled with their JS library of choice and don't get penalised. This may, of course, be specific to licences, though.

I can find no definitive answer anywhere on this subject. If you have any good references, please let me know. Alternatively, if you happen to know Matt Cutts, ask him for me. If I get any conclusive answer, I'll update here.