Building an Address Autocomplete web service in Elasticsearch and Go (Part 1)

Apr 13 2015

I'm always amazed at all of the things that the Elasticsearch can do while staying relatively simple to use. Recently I wanted to try something new with the search engine so I wondered, how could it be used to create an address autocomplete web service?

Perhaps you've noticed such services on websites: as you start typing something into an input box, like maybe the city that you live in, the input box displays suggestions. If you select one of the suggestions it populates the input box so you don't have to type it manually. I haven't looked for studies to back this up, but I'd bet that users love this kind of feature because it allows them to input monotonous information quicker and get on with their day. In applications that requires addresses to perhaps zoom to a point on a map, suggestions are even more important. Geocoding (converting an address to a point on a map) is not an exact science and users type all kinds of things into address inputs. If the application doesn't find a location match for what they typed, they may give up on the app out of frustration (or even worse they can't avoid using it and they curse the developers every time it fails). An address autocomplete input helps prevent this by guiding the user as they type, giving hints along the way as to what is an acceptable input.

Because every new character typed will potentially require another call to such an endpoint, performance is important. Elasticsearch, and the Lucene library used under the hood, are quite performant so they are perfect for fit for this use case. But how?

After looking at Elasticsearch's docs for a bit, I noticed a Completion Suggestor feature that will return similar looking terms (in our case addresses) for an input. To try it out I inserted some documents (a NoSQL equivalent to rows) into an addresses index (equivalent to a table but schema-less). Interaction with the engine occurs over the built-in REST API, which I've grown to love because it follows HTTP conventions: use GET to fetch documents, PUT to add new ones and DELETE to remove.

After inserting some addresses, I used the /_analyze resource to view how the text analysis worked when I tried different analyzers and tokenizers, kind of like an EXPLAIN in SQL. I settled on a standard index analyzer that would index the addresses as they were inserted in a way that was working for good results.

For the search analyzer (which splits the input on a set of rules) after some experimentation it seemed that I need a combination of a simple and a whitespace. I had to split on the whitespace of an address and I needed to convert the input to lowercase, however the simple analyzer was dividing on non-letters, which was screwing up pretty much every address search because they usually start with integers. I realized that I had to create my own custom analyzer for this case.

To do that, I first created a new index. When you insert your first document into an index you implicitly create the index, but you can also do:

PUT http://localhost:9200/addresses
{
    "acknowledged": true
}

(I won't be including Elasticsearch responses from here on out because they're usually something similar to this one if all is well)

Next I had to define my mapping. Although you don't need a schema per se to start inserting and querying documents in Elasticsearch, often you'll want to have more control on the types of the fields you are entering and how the data is indexed. A mapping is how you define this before you insert any data.

Before I could even define my mapping however, I had to define the custom analyzer that I spoke of earlier so that I could use it in my eventual mapping. To do this (so much preamble!) I had to first close the index to read/write changes:

POST http://localhost:9200/addresses/_close

PUT http://localhost:9200/addresses/_settings
{
  "index": {
    "analysis": {
      "tokenizer": {
        "whitespace": { "type": "whitespace"}
      },
      "analyzer": {
        "address": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": ["trim", "lowercase"]
        }
}}}}

Without going through every line, what I'm doing in the second request is creating my custom analyzer named "address" that hits the sweet spot between the simple and whitespace analyzers that I need for properly dealing with the addresses.

And now I can (finally!) define my mapping:

PUT http://localhost:9200/addresses/address/_mapping
{
  "address": {
    "properties": {
      "name": { "type": "string" },
      "suggest": {
        "type": "completion",
        "index_analyzer": "standard",
        "search_analyzer": "address",
        "payloads": true,
        "preserve_separators": false
}}}}

Here I'm using my new "address" analyzer, defining the address document type as a completion suggestion type and enabling attached payloads for each document. We'll see later where the payloads comes in handy.

Finally, we are ready to insert some address data! How about the street addresses for my fair city of Philadelphia? Luckily the City of Philadelphia OIT department publishes a big CSV of standardized address on GitHub that'll work perfectly because they are considered clean instances of actual addresses.

Here's how an document insert looks:

PUT http://localhost:9200/addresses/address/1

{
  "name" : "108 WHARTON ST",
  "suggest" : {
    "input": "108 WHARTON ST",
    "output": "108 WHARTON ST",
    "payload" : { "lon" : -75.146863501, "lat": 39.931288703 }
  }
}

You'll notice that our payload object here includes the latitude and longitude coordinates as provided in the original CSV file. While these fields don't affect the search results, they are attached with every address returned. My demo later shows how this is helpful: essentially if a user selects a suggestion as the address that they want to go with, there's no need to make another request to get the location details, you already have it. You can imagine that there may be other fields that you would want to add here, depending on your use case.

Finally, once we have all of our data in Elasticsearch, we can get suggestions like this:

POST http://localhost:9200/addresses/_suggest
{
  "address-suggest" : {
    "text" : "1234 m",
      "completion" : {
        "field" : "suggest",
        "size": 10
     }
  }
}

{"_shards":{"total":5,"successful":5,"failed":0},
"address-suggest":[
  {"text":"1234 m","offset":0,"length":6,"options":[
    {"text":"1234 MAGEE AVE","score":1.0,"payload":{"lon":-75.07945141,"lat":40.04414173}},
    {"text":"1234 MARKET ST","score":1.0,"payload":{"lon":-75.16097759,"lat":39.95166992}},
    ...
    {"text":"1234 MERCY ST","score":1.0,"payload":{"lon":-75.1669501,"lat":39.92437999}}
]}]}

Pretty cool, huh?

Check out a demo app here (UPDATE 11/5/15: AWS costs can add up so I've turned off this server for now) that includes the functionality that I've been working towards. On the backend I've got a Go app (so great to code an API with!) in front of an instance of Elasticsearch. The performance of the endpoint is pretty good considering there are over 600K addresses and it's currently running on an AWS EC2 micro instance. All of the code is available on GitHub including instructions and I welcome pull requests.

You'll see that the results aren't exactly ideal at this point, for one there must have been something going on with North/N and South/S in the dataset (or perhaps it's my fault). Also, ordering can be important. Currently there is no sorting because the index doesn't really have anything to sort on. Chances are though that if you type "M" you probably want "Market" or another major street that starts with "M" instead of a side alley. The Streets Dept's centerline dataset includes a CLASS field that gives a numerical weight to each street segment (major road, minor, etc.). Coupled with a completion suggesters weight field you could order by street class to place more common streets at the top of the suggestion list. This would require a level of GIS wizardry that I'm rusty on, so a pull request adding that column to the data would be appreciated (hint hint)!

There are probably other strategies in which the results can be improved (such as aliasing), so this is a work in progress for now that gives me a chance to dive deeper into Elasticsearch. What's interesting to me about this project is that even though it's essentially text-based information retrieval, you can apply GIS concepts to give you better results. Additionally, like most traditional Elasticsearch use cases, it's a situation in which a backend API service can be thought of from a usability standpoint. You'll want to configure it differently depending on your application that would consume this web service.

The next blog post in this series will discuss how I used Docker containers to make deployment of this web service easy.

Discuss with me on Twitter.

Send a pull request for this post on GitHub.

Dave Walk is a software developer, basketball nerd and wannabe runner living in Philadelphia. He enjoys constantly learning and creating solutions with Go, JavaScript and Python. This is his website.