thingsinjars

  • 3 Oct 2011

    Whitehat Syndication

    When I refer to Google or Googlebot here, I could really be talking about any web search or web crawler. When I refer to Yahoo, I actually mean Yahoo. Don't get confused.

    I recently ran into an interesting SEO problem on a project which has led to a question I just don't know the answer to:

    How do you display syndicated content without triggering Google's duplicate content flag?

    Hmm... intriguing.

    Background

    To explain the problem more fully (without giving out any project specifics), imagine you have a website. You probably already do so that shouldn't be hard. Now, imagine you fill that website full of original content. Again, that shouldn't be hard. For the sake of example, let's assume you run a news blog where you comment on the important stories of the day.

    Next, you figure that your readers also want to read important related facts about the news story. Associated Press (AP) syndicates its content and through the API, you can pull in independently-checked related facts about whatever your original content deals with. So far, so good.

    Unfortunately, a thousand other news blogs also pull in the same AP content alongside their original (and some not-so-original) content. Now, when the Googlebot crawls your site, it finds the same content there as it does in a thousand other places. Suddenly, you're marked with a 'duplicate content' black flag and all the lovely google juice you got from writing original articles has been taken away. Boo.

    Your first thought might be to reach for the rel="canonical" attribute but that really only applies to entire pages. We need something that only affects a portion of the page.

    Solution

    What you need to do is find a way to include the content in your page when a visitor views it (providing extra value for readers) but prevent Google from reading it (hurting your search ranking). Fortunately, there are some methods for doing this. One involves having the content in an external JS file which is listed in your robots.txt to prevent Google from reading it. Another similar method involves having the content in an external HTML and including it as an iframe, again, preventing crawling via robots.txt. When the reader visits the page, the content is there, when Google visits, it isn't.

    The Problem with the Solution

    Both of the techniques mentioned here involve an extra HTTP request. You are including an external file so the visitor's browser has to go to your server, grab the file and include it. This isn't a huge problem for most sites but when you're dealing with high-traffic, highly-optimised websites, every file transferred counts. You go to all the trouble of turning all background images into sprites, why waste extra unnecessary connections on content?

    There is an extra problem with the JS solution in that the content isn't available to visitors that don't have JS enabled but as I've mentioned before, that's only a problem if any of your visitors actually have JS disabled. Only you know that for sure.

    Yahoo's Solution

    Yahoo have a nice solution to this problem. If you include the attribute class="robots-nocontent" on any element, the Yahoo spider (called 'slurp') will ignore the content. Perfect. This does, however, only work for Yahoo. Not perfect.

    My solution

    My attempt at solving this problem which is a combination of SEO and high front-end performance was inspired by the technique GMail uses to deliver JS to mobile devices. In their article, Google delivers JS that they don't want run immediately in the initial payload. They figure that the cost of serving a single slightly larger HTTP request is less than the delay in retrieving data on demand.

    I use HTML embedded in a JS comment in the original page which is then processed on DOMReady to strip out the comments and inject it into wherever it is supposed to be (identified by the data-destination attribute). I'm doing this on a page which already loads jQuery so this can all be accomplished with a simple bit of code.

    <script type="text/html" class="norobot" data-destination=".content-destination">
     /*!
      <p>This content is hidden on load but displayed afterwards using javascript.</p>
     */
    </script>
    <div class="content-destination"></div>
    <script src="http://ajax.googleapis.com/ajax/libs/jquery/1.5/jquery.min.js"></script>
    <script>
     $('.norobot').each(function() {
      _this = $(this);
      $(_this.data('destination')).html(_this.html().replace('/*!','').replace('*/',''));
     });
    </script>

    Notes on the code

    You may have noticed the type="text/html" attribute on the script block. That's to get around the fact that jQuery parses and executes any script blocks it finds when moving elements around (in an appendTo() or an html(), for example. Adding this attribute tells jQuery to process this as template code.

    Also, the opening JS comment here begins /*!. The exclamation mark in this is a directive to any minifiers you might use on the code to tell them not to remove this comment block.

    This is also available in a standalone page.

    This is all a very long setup for my initial question. Does Google read this and, if so, does this affect duplicate content rankings?

    Plus and Minus

    • Minus: The duplicate content is definitely in the page.
    • Plus: It's hidden in JavaScript
    • Minus: we're using JavaScript to serve different content to users and to google.
    • Plus: we're showing less to google than users. Spam techniques show more to increase keyword matches.
    • Plus: faster response due to a single http request (Google likes fast pages)

    Obviously, we could add an extra step of obfuscating the 'hidden' content by reversing it or encoding it. This would definitely hide it from google and it would be trivial to undo the process before showing it to the user but is this step necessary? Part of my reasoning for concluding that Google ignores JS comments is that thousands of sites include the same standard licences bundled with their JS library of choice and don't get penalised. This may, of course, be specific to licences, though.

    I can find no definitive answer anywhere on this subject. If you have any good references, please let me know. Alternatively, if you happen to know Matt Cutts, ask him for me. If I get any conclusive answer, I'll update here.

    Development, Geek

  • 9 Sep 2011

    Testing

    The web is a visual medium. Well, mostly.

    There's no better way to test a visual medium than by looking at it. Look at your site in as many browsers as you can. If you've already got as many browsers installed on your development computer as you can fit, get another computer and install some more. Either that or run a Virtual Machine.

    Definition: Virtual Machine (VM)

    Applications like VirtualBox, VMWare or Parallels allow you to run an entire computer within your computer. It is a self-contained system that doesn't interact with your own machine meaning you can have IE6 installed on one VM, IE7 on another and IE8 on a third. All running in a window on your iMac. Shiny.

    If you can't do that easily, you could use one of the growing number of browser testing services. These are server rooms packed with computers running Virtual Machines and automated systems to which you supply a URL, wait a few moments and get shown an image (or several hundred images) showing your URL in different browsers on different platforms. Some of the more sophisticated services allow you to scroll down a long page or specify different actions, text entry or mouse events you want to see triggered. These services can be exceptionally useful when it comes to developing HTML e-mails as there are some rare and esoteric e-mail clients out in the wild. Litmus does an excellent job at virtualised testing for HTML e-mails. On that note, the Campaign Monitor library of free HTML e-mail templates is a great place to start, learn and possibly finish when working on an HTML e-mail.

    There is also a place for automated testing for some things. Recently, there has been a bit of a movement away from validating code as the purpose of web development is not to make it 'check a box' on a merely technical level, it is to get the message across via the design however possible. However, validation is still the best and easiest way to check your syntax. Syntax errors are still the main cause for mistakes appearing in your web sites and are the easiest thing to fix. Don't assume IE is wrong. Again, if you're keen on HTML e-mails, here's a great post on the Litmus blog.

    This article is modified from a chapter in a book Andrew and I were writing a couple of years ago about web development practical advice. Seeing as we both got too busy to finish it, I'm publishing bits here and there. If you'd like to see these in book form, let me know.

    Geek, Development, Guides

  • 29 Aug 2011

    Don’t be seduced by the drag-and-drop side

    You don’t have to be a survivor of the vi and Emacs holy wars to appreciate the beauty of fully hand-crafted code. There was a bit of attention a couple of weeks ago on the soon-to-be-launched Adobe Muse which lets you “Design and publish HTML websites without writing code”. If you want to be a kick-ass developer, you must realise that tools like this aren't designed for you. They're designed for people who want to do what you can do but either don't have the time or the inclination to learn how. Although drag 'n' drop application do lower the barrier to entry for creating a website, there is still a need for web developers to know exactly what's going on in their pages.

    I'm not saying that there will never be a time when visual design can be automatically translated into as good a product as a hand-crafted site, I'm just saying it’s not yet.

    In much the same way as with JavaScript (See “You must be able to read before you get a library card”), building your HTML using that extra level of abstraction might work for almost every situation but will eventually leave you stuck somewhere you don’t want to be. By all means, pimp up your text editor with all manner of handy tools, shortcuts and code snippets but make sure you still know exactly what each bit of mark up means and does. If you structure your code well (more on that in a later post), your natural feel for the code will be as good a validator as anything automated (by which I mean prone to errors and worth double-checking).

    Learn the keyboard shortcuts. If you learn nothing else from this, learn the importance of keyboard shortcuts. You might start off thinking you'll never need a key combination to simply swap two characters around but one day, you'll find yourself in the middle of a functino reaching for ctrl-T.

    Also, there is no easy way to tell if a text editor is fit for you until you have tried it, looking at screenshots won’t work. You don't need to build an entire project to figure out whether or not you're going to get on with a new text editor, just put together a couple of web pages, maybe write a jQuery plugin. Do the key combinations stick in your head or are you constantly looking up the same ones again and again? Do you have to contort your hand into an unnatural and uncomfortable claw in order to duplicate a line?

    The final thing to cover about text editors is that it's okay to pay for them. Sure, we all love free software. “Free as in pears and free as in peaches” (or whatever that motto is) but there are times when a good, well-written piece of software will cost money. And that's okay. You're a web developer. You are going to be spending the vast proportion of your day using this piece of software. If the people that made it would like you to spend $20 on it, don't instantly balk at the idea. Think back to the idea of web developers as artisan craftsmen. You're going to be using this chisel every day to carve out patterns in stone. Every now and then, you might need to buy your own chisel.

    This article is modified from a chapter in a book Andrew and I were writing a couple of years ago about web development practical advice. Seeing as we both got too busy to finish it, I'm publishing bits here and there. If you'd like to see these in book form, let me know.

    Geek, Opinion, Guides

  • 4 Aug 2011

    You must be able to read before you get a library card

    I like JavaScript. JS. ECMAScript. Ol' Jay Scrizzle as nobody calls it.

    I also like jQuery. jQ. jQuizzie. jamiroQuery. Whatever.

    Ignoring the stoopid nicknames I just made up, did you notice how I referred to JavaScript and jQuery separately? I didn't say "I like JavaScript, there are some really great lightbox plugins for it" just the same as I didn't say "I wish there was a way to do indexOf in jQuery".

    I'm regularly amazed at how many new (and some not-so-new) web developers either think they know JavaScript because they know jQuery or wish there was a way to do something in jQuery that they read about in an article about JavaScript. jQuery is a library written to make coding in JavaScript easier. It's made in JavaScript so you can say "jQuery is JavaScript" but only in the same way that "Simon is male". To confuse jQuery as all of JavaScript is the same as saying "Simon is all men" (don't worry, there's still only one of me).

    For most web site or web app development, I do recommend using a library. Personally, I've used jQuery and Prototype extensively and decided I prefer jQuery. Libraries are designed to make coding faster and more intuitive and they can be a great productivity aid. You can get a lot more done quicker. There is a downside, however.

    Downside

    If you're doing what the library was intended to help with, great. Slide this panel up, pop open a modal window, scroll to the bottom of the page and highlight that header. Brilliant. The difficulties come when you're either trying to do something the library wasn't intended to do or something nobody's thought of before or you're just plain doing something wrong. If you are fluent in your library of choice but don't know the JavaScript underpinnings, your usual debugging tools can only help you so far. There will come a point where there's an impenetrable black-box where data goes in and something unexpected comes out. Okay, it's probably still data but it's unexpected data.

    Don't let this point in the process be the discouragement. This is where the fun bit is.

    Learning to read

    Library authors are very clever groups of people. Often large groups. Reading through the unminified source of a library can be an awesomely educational experience as it's usually the culmination of many years best practice. If you want a nice introduction to some of the cool things in jQuery, for instance, check out these videos from Paul Irish:

    • http://paulirish.com/2010/10-things-i-learned-from-the-jquery-source/
    • http://paulirish.com/2011/11-more-things-i-learned-from-the-jquery-source/

    I've dug around in jQuery many, many times to try and figure out why something does or doesn't do what it should or shouldn't. The most detailed investigation was probably Investigating IE's innerHTML during which nothing was solved but I found out some cool stuff.

    Learning to write

    The best way to get your head around libraries is to write your own. Yes, there are literally millions of them (not literally) out there already but you don't need to aim for world dominance, that's not the point of writing your own. Start simply, map the dollar symbol to document.getElementById. Done. You've written a tiny library.

    function $(id){ 
        return document.getElementById(id);
    }

    Now you can add some more stuff. Maybe you could check to see if the thing passed to the $ is already an element or if it's a string. That way, you could be a bit more carefree about how you pass things around.

    function $(id){ 
      if(id.nodeType) {
        return id;
      } else {
        return document.getElementById(id);
      }
    }

    Add in a couple of AJAX methods, some array manipulation and before you know it, you've got a full-blown web development toolkit.

    If you're wanting a boilerplate to start your library off, I recommend Adam Sontag's Boilerplate Boilerplate.

    Here's your Library Card

    By now, you've rooted around in the jQuery undergrowth, dug through some of Moo's AJAX and pulled apart Prototype's string manipulation. You've written your own mini library, gotten a bit frustrated and wished you had a community of several thousand contributors to make it more robust. Now you're ready to start getting irked every time someone on Forrst asks if there's a jQuery plugin for charAt. Enjoy.

    This article is modified from a chapter in a book Andrew and I were writing a couple of years ago about web development practical advice. Seeing as we both got too busy to finish it, I'm publishing bits here and there. If you'd like to see these in book form, let me know.

    Geek, Opinion, Javascript, Guides

  • newer posts
  • older posts

Categories

Toys, Guides, Opinion, Geek, Non-geek, Development, Design, CSS, JS, Open-source Ideas, Cartoons, Photos

Shop

Colourful clothes for colourful kids

I'm currently reading

Projects

  • Awsm Street – Kid's clothing
  • Stickture
  • Explanating
  • Open Source Snacks
  • My life in sans-serif
  • My life in monospace
Simon Madine (thingsinjars)

@thingsinjars.com

Hi, I’m Simon Madine and I make music, write books and code.

I’m the Engineering Lead for komment.

© 2025 Simon Madine