thingsinjars

  • 3 Oct 2011

    Whitehat Syndication

    When I refer to Google or Googlebot here, I could really be talking about any web search or web crawler. When I refer to Yahoo, I actually mean Yahoo. Don't get confused.

    I recently ran into an interesting SEO problem on a project which has led to a question I just don't know the answer to:

    How do you display syndicated content without triggering Google's duplicate content flag?

    Hmm... intriguing.

    Background

    To explain the problem more fully (without giving out any project specifics), imagine you have a website. You probably already do so that shouldn't be hard. Now, imagine you fill that website full of original content. Again, that shouldn't be hard. For the sake of example, let's assume you run a news blog where you comment on the important stories of the day.

    Next, you figure that your readers also want to read important related facts about the news story. Associated Press (AP) syndicates its content and through the API, you can pull in independently-checked related facts about whatever your original content deals with. So far, so good.

    Unfortunately, a thousand other news blogs also pull in the same AP content alongside their original (and some not-so-original) content. Now, when the Googlebot crawls your site, it finds the same content there as it does in a thousand other places. Suddenly, you're marked with a 'duplicate content' black flag and all the lovely google juice you got from writing original articles has been taken away. Boo.

    Your first thought might be to reach for the rel="canonical" attribute but that really only applies to entire pages. We need something that only affects a portion of the page.

    Solution

    What you need to do is find a way to include the content in your page when a visitor views it (providing extra value for readers) but prevent Google from reading it (hurting your search ranking). Fortunately, there are some methods for doing this. One involves having the content in an external JS file which is listed in your robots.txt to prevent Google from reading it. Another similar method involves having the content in an external HTML and including it as an iframe, again, preventing crawling via robots.txt. When the reader visits the page, the content is there, when Google visits, it isn't.

    The Problem with the Solution

    Both of the techniques mentioned here involve an extra HTTP request. You are including an external file so the visitor's browser has to go to your server, grab the file and include it. This isn't a huge problem for most sites but when you're dealing with high-traffic, highly-optimised websites, every file transferred counts. You go to all the trouble of turning all background images into sprites, why waste extra unnecessary connections on content?

    There is an extra problem with the JS solution in that the content isn't available to visitors that don't have JS enabled but as I've mentioned before, that's only a problem if any of your visitors actually have JS disabled. Only you know that for sure.

    Yahoo's Solution

    Yahoo have a nice solution to this problem. If you include the attribute class="robots-nocontent" on any element, the Yahoo spider (called 'slurp') will ignore the content. Perfect. This does, however, only work for Yahoo. Not perfect.

    My solution

    My attempt at solving this problem which is a combination of SEO and high front-end performance was inspired by the technique GMail uses to deliver JS to mobile devices. In their article, Google delivers JS that they don't want run immediately in the initial payload. They figure that the cost of serving a single slightly larger HTTP request is less than the delay in retrieving data on demand.

    I use HTML embedded in a JS comment in the original page which is then processed on DOMReady to strip out the comments and inject it into wherever it is supposed to be (identified by the data-destination attribute). I'm doing this on a page which already loads jQuery so this can all be accomplished with a simple bit of code.

    <script type="text/html" class="norobot" data-destination=".content-destination">
     /*!
      <p>This content is hidden on load but displayed afterwards using javascript.</p>
     */
    </script>
    <div class="content-destination"></div>
    <script src="http://ajax.googleapis.com/ajax/libs/jquery/1.5/jquery.min.js"></script>
    <script>
     $('.norobot').each(function() {
      _this = $(this);
      $(_this.data('destination')).html(_this.html().replace('/*!','').replace('*/',''));
     });
    </script>

    Notes on the code

    You may have noticed the type="text/html" attribute on the script block. That's to get around the fact that jQuery parses and executes any script blocks it finds when moving elements around (in an appendTo() or an html(), for example. Adding this attribute tells jQuery to process this as template code.

    Also, the opening JS comment here begins /*!. The exclamation mark in this is a directive to any minifiers you might use on the code to tell them not to remove this comment block.

    This is also available in a standalone page.

    This is all a very long setup for my initial question. Does Google read this and, if so, does this affect duplicate content rankings?

    Plus and Minus

    • Minus: The duplicate content is definitely in the page.
    • Plus: It's hidden in JavaScript
    • Minus: we're using JavaScript to serve different content to users and to google.
    • Plus: we're showing less to google than users. Spam techniques show more to increase keyword matches.
    • Plus: faster response due to a single http request (Google likes fast pages)

    Obviously, we could add an extra step of obfuscating the 'hidden' content by reversing it or encoding it. This would definitely hide it from google and it would be trivial to undo the process before showing it to the user but is this step necessary? Part of my reasoning for concluding that Google ignores JS comments is that thousands of sites include the same standard licences bundled with their JS library of choice and don't get penalised. This may, of course, be specific to licences, though.

    I can find no definitive answer anywhere on this subject. If you have any good references, please let me know. Alternatively, if you happen to know Matt Cutts, ask him for me. If I get any conclusive answer, I'll update here.

    Development, Geek

  • 9 Sep 2011

    Testing

    The web is a visual medium. Well, mostly.

    There's no better way to test a visual medium than by looking at it. Look at your site in as many browsers as you can. If you've already got as many browsers installed on your development computer as you can fit, get another computer and install some more. Either that or run a Virtual Machine.

    Definition: Virtual Machine (VM)

    Applications like VirtualBox, VMWare or Parallels allow you to run an entire computer within your computer. It is a self-contained system that doesn't interact with your own machine meaning you can have IE6 installed on one VM, IE7 on another and IE8 on a third. All running in a window on your iMac. Shiny.

    If you can't do that easily, you could use one of the growing number of browser testing services. These are server rooms packed with computers running Virtual Machines and automated systems to which you supply a URL, wait a few moments and get shown an image (or several hundred images) showing your URL in different browsers on different platforms. Some of the more sophisticated services allow you to scroll down a long page or specify different actions, text entry or mouse events you want to see triggered. These services can be exceptionally useful when it comes to developing HTML e-mails as there are some rare and esoteric e-mail clients out in the wild. Litmus does an excellent job at virtualised testing for HTML e-mails. On that note, the Campaign Monitor library of free HTML e-mail templates is a great place to start, learn and possibly finish when working on an HTML e-mail.

    There is also a place for automated testing for some things. Recently, there has been a bit of a movement away from validating code as the purpose of web development is not to make it 'check a box' on a merely technical level, it is to get the message across via the design however possible. However, validation is still the best and easiest way to check your syntax. Syntax errors are still the main cause for mistakes appearing in your web sites and are the easiest thing to fix. Don't assume IE is wrong. Again, if you're keen on HTML e-mails, here's a great post on the Litmus blog.

    This article is modified from a chapter in a book Andrew and I were writing a couple of years ago about web development practical advice. Seeing as we both got too busy to finish it, I'm publishing bits here and there. If you'd like to see these in book form, let me know.

    Geek, Development, Guides

  • 25 Jul 2011

    Psychopathic megalomaniac vs Obsequious sycophant?

    - or -

    Tabs vs Spaces?

    Which are you? Is there any middle ground?

    In any modern, code-friendly text editor, you can set tab size. This means that one tab can appear to be as wide as a single space character or - more commonly - 2 or 4 spaces. It's even possible in some to set any arbitrary size so there's nothing stopping you setting tabs to be 30 spaces and

    format() {
                                  your code() {
                                                                like;
                                  }
                                  this;
    }

    However you do it, the point is that using tabs allows the reader personal preference.

    Indenting your code using spaces, however, is completely different. A space is always a space. There's actually a simple conversion chart:

    • 1 = One
    • 2 = Two
    • 3 = Three
    • and so on.

    The fundamental difference here is that the person reading the code has no choice about how it's laid out. Tabs means you get to read the code how you want to, spaces means you read the code how I want you to. It's a subtle difference but an important one.

    Space indenting is therefore perfectly suited to psychopathic megalomaniacs who insist their way is right while tab indenting is for obsequious sycophants who are putting the needs of others above themselves. Sure, there may be lots of grades in-between but why let moderation get in the way of a barely coherent point?

    Curiously, neither of these extremes feels a great need to comment their code. One sees no need as their code is for themselves and they already understand it, the other assumes that if they can understand it, its purpose is so obvious as to be trivial. Both are wrong here.

    There's unfortunately no real middle ground here. Teams must agree amongst themselves what side of the fence they fall down on. It is entirely possible to set many text editors to automatically convert tabs to spaces or spaces to tabs on file save but if you're doing that, you'd better hope your favourite diff algorithm ignores white space otherwise version control goes out the window.

    This article is modified from a chapter in a book Andrew and I were writing a couple of years ago about web development practical advice. Seeing as we both got too busy to finish it, I'm publishing bits here and there. If you'd like to see these in book form, let me know.

    Opinion, Development, Geek, Guides

  • 11 Jul 2011

    Torch

    Concept

    Another proof-of-concept game design prototype. This is kind of a puzzle game. Ish. It's mostly a simple maze game but one in which you can't see the walls. You can see the goal all the time but you are limited to only being able to see the immediate area and any items lit up by your torch. You control by touching or clicking near the torch. The character will walk towards as long as you hold down. You can move around and it'll follow.

    Torch v0.1

    The first 10 levels are very easy and take practically no time at all. After that, they get progressively harder for a while before reaching a limit (somewhere around level 50, I think). The levels are procedurally generated from a pre-determined seed so it would be possible to share high scores on progression or time taken without having to code hundreds of individual levels.

    Items that could be found around the maze (but aren't included in this prototype) include:

    • Spare torches which can be dropped in an area and left to cast a light
    • Entrance/exit swap so that you can retrace your steps to complete the level
    • Lightning which displays the entire maze for a few seconds (but removes any dropped torches)
    • Maze flips which flip the maze horizontally or vertically to disorient you.

    I worked on this for a few months (off and on) and found it to be particularly entertaining with background sound effects of dripping water, shuffling feet with every step, distant thunder rumbling. It can be very atmospheric and archaeological at times.

    Half-way through a level, two torches dropped
    Torch, the game

    Slightly technical bit

    The game loop and draw controls are lifted almost line-for-line from this Opera Dev article on building a simple ray-casting engine. I discarded the main focus of the article - building a 3D first-person view - and used the top-down mini map bit on its own. The rays of light emanating from the torch are actually the rays cast by the engine to determine visible surfaces. It's the same trick as used in Wolfenstein 3D but with fewer uniforms. It's all rendered on a canvas using basic drawing functionality.

    The audio is an interesting, complicated and confusing thing. If I were starting again, I'd look at integrating an existing sound manager. In fact, I'd probably use an entire game engine (something like impact.js, most likely) but it was handy to remember how I used to do stuff back in the days when I actually made games for a living. Most of all, I'd recommend not looking too closely at the code and instead focusing on the concept.

    Go, make.

    As with Knot from my previous blog post, I'm not going to do anything with this concept as I'm about to start a big, new job in a big, new country and I wanted to empty out my ‘To do’ folder. The code's so scrappy, I'm not going to put it up on GitHub along with my other stuff, I seriously do recommend taking the concept on its own. If you are really keen on seeing the original code, it's all unminified on the site so you can just grab it from there.

    The usual rules apply, if you make something from it that makes you a huge heap of neverending cash, a credit would be nice and, if you felt generous, you could always buy me a larger TV.

    Torch v0.1

    Ideas, Geek, Development, Toys

  • newer posts
  • older posts

Categories

Toys, Guides, Opinion, Geek, Non-geek, Development, Design, CSS, JS, Open-source Ideas, Cartoons, Photos

Shop

Colourful clothes for colourful kids

I'm currently reading

Projects

  • Awsm Street – Kid's clothing
  • Stickture
  • Explanating
  • Open Source Snacks
  • My life in sans-serif
  • My life in monospace
Simon Madine (thingsinjars)

@thingsinjars.com

Hi, I’m Simon Madine and I make music, write books and code.

I’m the Engineering Lead for komment.

© 2025 Simon Madine