-
-
Explanating Experiment Follow-up
A year ago, I started a little experiment (and not one of my usual web experiments).
I decided to give away my book, Explanating, for free. The website has links to download the book without charge and also links through to Amazon and Lulu. I asked people to download the free one then, if they enjoyed it, they could come back and buy it.
Twelve months later, I now have the answer.
- Free: 315
- Amazon: 2
- Lulu: 0
Erm. Yeah. Not quite as successful as I would have liked. Still, the point was to measure the 'conversion rate'. Admittedly, it's a small sample size but it would appear to be about 0.6%. At this rate, I only need another 97,465,886 people to download the free one and I'll have made £1,000,000! Sweet!
-
Location-based time
Inspired by the simplicity of implementing a proximity search using MongoDB, I found myself keen to try out another technology.
It just so happened that I was presented with a fun little problem the other day. Given a latitude and longitude, how do I quickly determine what the time is? Continuing the recent trend, I wanted to solve this problem with Node.JS.
Unsurprisingly, there's a lot of information out there about timezones. Whenever I've worked with timezones in the past, I've always gotten a little bit lost so this time, I decided to actually read a bit and find out what was supposed to happen. In essence, if you're doing this sort of task. you do not want to have to figure out the actual time yourself. Nope. It's quite similar to one of my top web dev rules:
Never host your own video.
(Really, never deal with video yourself. Pay someone else to host it, transcode it and serve it up. It'll will always work out cheaper.)
What you want to do when working with timezones is tie into someone else's database. There are just too many rules around international boundaries, summer time, leap years, leap seconds, countries that have jumped over the international date line (more than once!), islands whose timezone is 30 minutes off the adjacent ones...
To solve this problem, it needs to be split into two: the first part is to determine which timezone the coordinate is in, the second is the harder problem of figuring out what time it is in that timezone. Fortunately, there are other people who are already doing this. Buried near the back of the bottom drawer in very operating system is some version of the
tz database
. You can spend hours reading up about it, its controversies and history on Wikipedia if you like. More relevantly, however, is what it can do for us in this case. Given an IANA timezone name – "America/New_York", "Asia/Tokyo" – you can retrieve the current time from the system'stz database
. I don't know how it works. I don't need to know. It works.Node
Even better for reaching a solution to this problem, there's a node module that will abstract the problem of loading and querying the database. If you use the
zoneinfo
module, you can create a new timezone-aware Date object, pass the timezone name to it and it will do the hard work. awsm. The module wasn't perfect, however. It loaded the system database synchronously usingfs.readFileSync
which is I/O blocking and therefore a Bad Thing. Boo.10 minutes later and Max had wrangled it into using the asynchronous, non-blocking
fs.ReadFile
. Hooray!Now all I needed to do was figure out how to do the first half of the problem: map a coordinate to a timezone name.
Nearest-Neighbour vs Point-in-Polygon
There are probably more ways to solve this problem but these were the two routes that jumped to mind. The tricky thing is that the latitude and longitude provided could be arbitrarily accurate. A simple lookup table just wouldn't work. Of course, the core of the problem was that we needed to figure out the answer fast.
Nearest Neighbour
- Create a data file containing a spread of points across the globe, determine (using any slow solution) the timezone at that point.
- Load the data into an easily searchable in-memory data-structure (such as a k-d tree)
- Given a coordinate, find the nearest existing data point and return its value.
Point in Polygon
- Create a data file specifying the geometry of all timezones.
- Given a coordinate, loop over each polygon and determine whether this coordinate is positioned inside or outside the polygon.
- Return the first containing polygon
This second algorithm could be improved by using a coarse binary search to quickly reduce the number of possible polygons that contain this point before step 2.
Despite some kind of qualification in mathematic-y computer-y stuff, algorithm analysis isn't my strong point. To be fair, I spent the first three years of my degree trying to get a record deal and the fourth trying to be a stand-up comedian so we may have covered complexity analysis at some point and I just didn't notice. What I do know, however, is that k-d trees are fast for searching. Super fast. They can be a bit slower to create initially but the point to bear in mind is that you only load it once while you search for data lots. On the other hand, while it's a quick task to load the geometry of a small number of polygons into memory, determining which polygon a given point is in can be slow, particularly if the polygons are complex.
Given this vague intuition, I settled on the first option.
If I wanted to create a spread of coordinates and their known timezones from scratch, it might have been an annoyingly slow process but, the Internet being what it is, someone already did the hard work. This gist contains the latitude and longitude for every city in the world and what IANA timezone it is in. Score! A quick regex later and it looks like this:
module.exports = [ {"latitude": 42.50729, "longitude": 1.53414, "timezone": "Europe/Andorra"}, {"latitude": 42.50779, "longitude": 1.52109, "timezone": "Europe/Andorra"}, {"latitude": 25.56473, "longitude": 55.55517, "timezone": "Asia/Dubai"}, {"latitude": 25.78953, "longitude": 55.9432, "timezone": "Asia/Dubai"}, {"latitude": 25.33132, "longitude": 56.34199, "timezone": "Asia/Dubai"}, etc…
All that's left is to load that into a k-d tree and we've got a fully-searchable, fast nearest neighbour lookup.
Source
The source for this node module is, of course, available on GitHub and the module itself is available for install via npm using:
npm install coordinate-tz
When combined with the
zoneinfo
module (or, even better, this async fork of the module), you can get a fast, accurate current time lookup for any latitude and longitude.Not a bad little challenge for a Monday evening.
-
Building a Proximity Search
This is the detailed post to go with yesterday's quick discussion about proximity search. All the code is available on GitHub.
This assumes a bit of NodeJS knowledge, a working copy of homebrew or something similar.
Install
- MongoDB -
brew install mongodb
- NodeJS
- NPM (included in NodeJS installer these days)
These are included in the
package.json
but it can't hurt to mention them here:npm install twitter
(node twitter streaming API library)npm install mongodb
(native mongodb driver for node)npm install express
(for convenience with API later)
Start
mongod
in the background. We don't quite need it yet but it needs done at some point, may as well do it now.Create a Twitter App
Fill out the form Then press the button to get the single-user access token and key. I love that Twitter does this now, rather than having to create a full authentication flow for single-user applications.
ingest.js
(open the ingest.js file and read along with this bit)
Using the basic native MongoDB driver, everything must be done in the
database.open
callback. This might lead to a bit of Nested Callback Fury but if it bothers you or becomes a bit too furious for your particular implementation, there are a couple of alternative Node-MongoDB modules that abstract this out a bit.// Open the proximity database db.open(function() { // Open the post collection db.collection('posts', function(err, collection) { // Start listening to the global stream twit.stream('statuses/sample', function(stream) { // For each post stream.on('data', function(data) { if ( !! data.geo) { collection.insert(data); } }); }); }); });
Index the data
The hard work has all been done for us: Geospatial Indexing in MongoDB. That's a good thing.
Ensure the system has a Geospatial index on the tweets.
db.posts.ensureIndex({"geo.coordinates" : "2d"})
Standard Geospatial search query:
db.posts.find({"geo.coordinates": {$near: [50, 13]}}).pretty() (find the closest points to (50,13) and return them sorted by distance)
By this point, we've got a database full of geo-searchable posts and a way to do a proximity search on them. To be fair, it's more down to mongodb than anything we've done.
Next, we extend the search on those posts to allow filtering by query
db.posts.find({"geo.coordinates": {$near: [50, 13]}, text: /.*searchterm.*/}).pretty()
API
Super simple API, we only have two main query types:
/proximity?latitude=55&longitude=13
/proximity?latitude=55&longitude=13&q=searchterm
Each of these can take an optional
'callback'
parameter to enablejsonp
. We're using express so the callback parameter and content type for returning JSON are both handled automatically.api.js
(open the api.js file and read along with this bit)
This next chunk of code contains everything so don't panic.
db.open(function() { db.collection('posts', function(err, collection) { app.get('/proximity', function(req, res) { var latitude, longitude, q; latitude = parseFloat(req.query["latitude"]); longitude = parseFloat(req.query["longitude"]); q = req.query["q"]; if (/^(-?d+(.d+)?)$/.test(latitude) && /^(-?d+(.d+)?)$/.test(longitude)) { if (typeof q === 'undefined') { collection.find({ "geo.coordinates": { $near: [latitude, longitude] } }, function(err, cursor) { cursor.toArray(function(err, items) { writeResponse(items, res); }); }); } else { var regexQuery = new RegExp(".*" + q + ".*"); collection.find({ "geo.coordinates": { $near: [latitude, longitude] }, 'text': regexQuery }, function(err, cursor) { cursor.toArray(function(err, items) { writeResponse(items, res); }); }); } } else { res.send('malformed lat/lng'); } }); }); });
If you've already implemented the
ingest.js
bit, the majority of thisapi.js
will be fairly obvious. The biggest change is that instead of loading the data stream then acting upon each individual post that comes in, we're acting on URL requests.app.get('/proximity', function(req, res) {
For every request on this path, we try and parse the query string to pull out a latitude, longitude and optional query parameter.
if (/^(-?d+(.d+)?)$/.test(latitude) && /^(-?d+(.d+)?)$/.test(longitude)) {
If we do have valid coordinates, pass through to Mongo to do that actual search:
collection.find({ "geo.coordinates": { $near: [latitude, longitude] } }, function(err, cursor) { cursor.toArray(function(err, items) { writeResponse(items, res); }); });
To add a text search into this, we just need to add one more parameter to the
collection.find
call:var regexQuery = new RegExp(".*" + q + ".*"); collection.find({ "geo.coordinates": { $near: [latitude, longitude] }, 'text': regexQuery }
This makes it so simple, making it it kind of feels like cheating. Somebody else did all the hard work first.
App.net Proximity
This works quite well on the App.net Global Timeline but it'll really become useful once the streaming API is switched on.
Of course, the code is all there. If you want to have a go yourself, feel free.
- MongoDB -
-
Proximity Search
Now that geolocated posts are beginning to show up around app.net, I found myself wondering about proximity search. Twitter provides one themselves for geotagged tweets. What a proximity search does, essentially, is provide results from a data set ordered by increasing distance from a given location. This can be further enhanced by combining it with a text search either before or after the distance sorting. This would give you a way to search for a certain query within a certain area.
When I first started thinking about the tech required for a proximity search, I remembered Lukas Nowacki back in our old Whitespace days implementing the Haversine formula in MySQL (Alexander Rubin has a good overview of how to do this). As much as I love my trigonometry and logarithms, I must admit, I was looking around for a simpler solution. Actually, I was looking around for a copy-paste solution, to be honest. I may even have spent some time going down that route if Max hadn't pointed me in the direction of MongoDB.
I'd been putting off digging into NoSQL databases for a while because, well, I had no real reason to. Recently, I've either been focused on front-end dev or hacking away at Java and never really had any good reason to investigate any of these new-fangled technologies get off my lawn you kids.
MongoDB
After 10 minutes of messing around with Mongo, I pretty much just found myself saying “No... way. There's no way that's actually working” I'm sure those of you experience with document-oriented databases are rolling your eyes right now but for those few of us left with an entirely relational concept of databases, let me just explain it like this: you know those things you want to do with a database that are just a hassle of multiple joins and confusing references? Document databases do some of those things really really well.
The biggest selling point for me, however was the native geospatial indexing. That pretty much made the majority of my proximity search complete. All I needed to do was wrap it in a nice interface and call it a day...
I'll follow up tomorrow with a more detailed 'How-to' guide.