Efficiently paging geospatial data with MongoDB — forwards and backwards (Part 1)

Photo by NASA on Unsplash

In this article series I will describe some essential techniques for querying geospatial data in MongoDB, which can be useful if you want your app or API to provide access to information ordered based on distance from some specific location. For example:

  • businesses ( restaurants, shops etc. ) or other places of interest,
  • other users of the app ( as in the case of dating apps ).

The techniques will allow your service to scale and remain efficient as they enable constant time access to the data (regardless of the amount of data in your database) and minimise caching required on the client side.

What you will learn

In the following sections I will show you how to:

  1. store location data ( longitude, latitude pairs ) in MongoDB documents
  2. query such documents using the location data, with results sorted by distance from a specified point ( from closest to furthest )
  3. efficiently page through results of such queries

You should be able to follow this tutorial even if you haven’t used MongoDB before. On the other hand if the first 2 points sound familiar, you might want to skip straight to section 3.

We’ll be using Node.js and the official Javascript MongoDB driver. Code snippets will be in Coffeescript 2.

If you want to run the code examples locally, clone the accompanying repo, and follow the instructions in the README.md.

Storing location data.

For longitude, latitude pairs we need to use the GeoJSON Point object format.

The doc above is what you’d insert into a mongo collection, it has a field called location which value is a GeoJSON Point object. The name of the field is immaterial, we could have used any other valid key name instead of location. We could also nest the GeoJSON deeper into the object structure, however putting type and coordinates at the top level of doc wouldn’t work. It’s also possible to store multiple GeoJSON objects in one document, for example:

Note that:

  • longitude comes first ( this is a reverse to what you might be used to from, for example, google maps queries )
  • longitude values need to be between -180 and 180 ( both inclusive )
  • latitude values need to be between -90 and 90 ( both inclusive )

Now that you know the basics, lets generate some data to work with.

We’ve created 6 documents, with locations starting at the equator, increasing the latitude in equal intervals, while keeping longitude fixed at 0. Here are the points plotted on a sphere:

See code used to generate this on JSFiddle. Note: you can click and drag on the above graphic to rotate the sphere.

Note that in the following code examples we will omit the boilerplate required to obtain a mongodb collection object and insert documents into it. In the accompanying repo, this boilerplate has been contained in the with_collection and with_collection_with_points helper functions.

Querying for documents based on location.

To query documents based on their distance from a specific point we are going to use the $near query operator. But before we do this, we need to create a 2dsphere index on the field containing the GeoJSON Point objects.

runnable example on github

Setting the background option is very important when creating an index on a live database, since by default createIndex will block all other operations on the database while the index is being created ( which might take a while if the collection is big ).

Now the basic query, which returns all documents sorted by distance ( great-circle distance to be precise ) from point [ 0, 0 ], based on data at location field in the documents:

runnable example on github

Unless your intention is to process all documents in the collection ( in which case you’ll probably call .stream instead of .toArray ), you’ll want to limit, the number of returned documents. Here’s how its done:

runnable example on github

On the illustration below, the white point marks the location used in the above query ( [ 0, 0 ] ) and the points circled in green represents the results of the query with limit set to 2.

see code used to generate this on JSFiddle

Paging through results, from closest to furthest ( forwards ).

Note: the technique discussed in this section has been previously described by A. Jesse Jiryu Davis, who implemented the MondoDB feature which makes this technique possible. His article goes into details of why this method is performant, so its worth a read, however code examples are in python, therefore we’ll go through it step by step here for the benefit of the Node.js community.

Building on the query we defined in the previous section, the simplest way to implement paging would be to use limit to control the page/batch size, and the skip method to set the desired page offset. For example:

runnable example on github

The above query will return the 2nd page of results when querying our test data, as illustrated below.

see code used to generate this on JSFiddle

However the performance of skip decreases linearly as the offset increases ( as demonstrated in the article mentioned above ), since the MongoDB server needs to scan through all the query results from the beginning until the offset is reached.

A constant time alternative involves using the $minDistance query operator, to exclude results which lay within given radius from the query point.

Assuming that we know the distance between the query point and the furthest document from a given page of results, we could query for the next page as follows:

But where do we get the distance from? We could try to calculate it using a hand rolled implementation of a formula for a distance between two points on a sphere. Or use something like getDistance from Geolib. However we’d have to make sure that our chosen implementation matches the implementation used by MongoDB. Fortunately we don’t have to go through all that hassle since we can ask MongoDB to attach a dynamically calculated distance to each document in the query results. We’ll just have to convert our find query into an equivalent aggregation pipeline, using $geoNear stage with its distanceField option.

Now we can use the calculated_distance property of the last document from the query results ( since they are sorted by distance, in ascending order ), to fetch the next page. Lets use the above function to fetch first two pages.

runnable example on github

And here is a visualisation of the results of the second call to fetch_page, with the yellow circle enclosing the area excluded from the query by using minDistance.

caption: see code used to generate this on JSFiddle

However, this is not exactly what we want: the last document of the previous page is included in the next page, since minDistance only excludes documents, which distance is smaller then given value. To prevent this, we need to add the following query to our $geoNear aggregation stage.

runnable example on github

The query uses the $nin operator to skip documents with given _ids when gathering the results. Are we done? Nearly! Consider the following set of points.

Notice that the 2nd and 3rd points lay at exactly the same distance from point [ 0, 0 ]. Now, if we set the query_point to [ 0, 0 ] and page_size to 3, what would the result of fetching the second page be? Assuming that the points returned by the first query were ordered as in the array above ( [ 0, -15 ] last ), here is how it would look like.

see code used to generate this on JSFiddle

This is because neither minDistance nor $nin exclude the document with coordinates [ 0, 15 ]. Therefore instead of just using the _id of the last document, we need to collect _ids of all documents which distance is equal to distance of last_doc. Putting it all together:

Given current_page here’s how you would fetch the next one:

runnable example on github

Note that the logic in get_last_distance and get_ids_to_exclude will most likely be executed on the client, hence it was not included in the fetch_page function, which expects precomputed exclude_ids and last_distance to be passed in. This is to minimize the amount of data sent over the wire. Alternatively the server could include a HTTP Link Header with all information necessary to fetch the next page, in which case all of the above code would be executed on the server.

Finally we need to handle a case where there are so many documents with the same distance, that the last document in a page ends up having the same distance as the last_distance used to fetch that page. For example, when fetching 3rd page from the following set of points ( with query_point = [ 0, 0 ] and page_size = 2 ):

Using the above implementation we would get points [ [ 0, -15 ], [ 0, 30 ] ] instead of desired [ [ 0, 30 ], [ 0, 45 ] ], since _id of the last point from page 1 ( [ 0, -15 ] ), is not excluded even though its distance is the same as minDistance. Therefore in such cases, we need to carry exclude_ids from previous query to the next.

runnable example on github

It is worth noting that in the extreme case off all documents having the same distance, the size of exclude_ids array, and hence the amount of data sent over the wire on each request, would increase linearly as we progress through the pages. On top of this, query performance itself would degrade in similar fashion ( and my bet is that it would be worse then the naive implementation relying solely on skip ). To mitigate this, instead of accumulating exclude_ids, we could keep count of documents to skip ( and use it in conjunction with minDistance ). This would keep the size of the request constant, and the query performance within bounds of the simple skip solution. However keeping track of _ids to exclude has some advantages over using skip in cases where queried data is highly dynamic ( when changes are expected to happen in between fetches ), which we might discuss in future articles.

Finally, I’d like to draw your attention to the fact that the above method doesn’t facilitate jumping directly to a specific page ( without fetching all pages in between ). However this can be achieved by adding appropriate skip ( equal to page_size * pages_to_skip_count ) to the query. Using skip will obviously incur a performance penalty proportional to the amount of documents skipped, but there is really no way around it ( unless instead of specifying amount of pages/documents to skip, your use case could do with with using minDistance to skip an unknown number of documents ). Fortunately we only need to pay this price once per page jump, since once the desired page is fetched using skip, the next one can be queried for based using only minDistance and exclude_ids.

Conclusion

We have demonstrated how to store and query geospatial data in mongodb, and discussed how to efficiently page through large amount of such data ( from closest location to furthest ). In Part 2 of this series, you will learn how to use a nifty trick to page through locations in reverse order.

Don’t forget clone the accompanying repo and play with code examples which can be easily ran from the command line.

read original article here