Archiving Tweets - Technical

Twitter has an amazing amount of data and there's no end to the number of ideas a coffee-fuelled mind can come up with to take advantage of that data. One of the greatest advantages of Twitter – its immediacy, its currency, its of-the-moment-ness – is, however, also one of the drawbacks. It is very easy to find out what people are thinking about now, yesterday or a few days ago but trying to find something from more than a week ago is surprisingly tricky. The standard, unauthenticated search only returns the last 7 days of results, beyond that, there's nothing. You can get around this by building custom applications which authenticate themselves correctly with Twitter and provide a level of traceability but that's quite tricky.

The easiest way to archive search results for your own personal use is actually surprisingly simple. Every search results page includes an RSS feed link1. This will be updated everytime there are new results. Simply subscribe to this feed with a feed reader (Google Reader is simple and free) and your results will be archived permanently. This is great if you're searching for personal interest stuff but doesn't work so well if you want to share the results with the public.

This was the problem I was presented with when I was asked to build a tweet-archiving mechanism for Museum140's Museum Memories (#MusMem) project. Jenni wanted some kind of system that would grab search results, keep them permanently and display them nicely. I didn't want to create a fully OAuth-enabled, authenticated search system simply because that seemed like more work than should really be necessary for such a simple job. Instead, I went down the RSS route, grabbing the results every 10 minutes and storing them in a database using RSSIngest by Daniel Iversen. The system stores the unique ID of each Tweet along with the time it was tweeted and the search term used to find it. The first time a tweet is displayed, a request is made to the Twitter interface to ask for all the details, not only of the tweet, but also of the user who tweeted it. These are then stored in the database as well. This way, we don't make too many calls to Twitter and we don't get blocked.

If you want your own Tweet Archive, I've put the code on GitHub for anyone to use. It requires PHP, MySQL and, ideally, a techie-type to set it up.

Archiving Tweets - Non-technical

With the technical side out of the way, we're left with the human issues to deal with. If you're automatically saving all results with a particular phrase, all a malicious person needs to do is include that phrase in a spam-filled tweet or one with inappropriate language and suddenly, it's on your site, too. If you aren't going to individually approve each tweet before it is saved, you must keep a vigilant eye on it.

The other thing which has turned out to be a problem is the Signal-to-Noise ratio. When we initially decided on the hashtag #MusMem, nobody else was using it. To use Social Media parlance, there was no hashtag collision. The idea was to encourage people to use it when they wanted their memories stored in the MemoryBank. Unfortunately, it is now being used by anyone tweeting anything related to Museums and Memories. This is particularly troublesome at the moment as this month is International Museum month, one of the themes of which is ‘Memory’ (which is why we built the MemoryBank in the first place). This means that the permanent memories we want stored (the Signal) are getting lost in the morass of generic Museum Memories (the Noise). There is no way to solve this problem algorithmically. If we are to thin it down, we actually need to manually edit the several thousand entries stored.

If anyone can think of a solution to this issue, please let everybody know – the world needs you.