/ elasticsearch

Autocomplete in ElasticSearch - array aggregations

The problem at hand

Recently while working with Elasticsearch I encountered a small challenge I couldn’t handle for some time so I decided to write a little blog post about it.
I was trying to create an autocomplete functionality to give some suggestions to the user of how the phrase they typed could be completed. I used aggregations available through Elasticsearch API for that purpose. Aggregations are keys describing a set of documents that belong to that key.
Let’s say that you have 3 documents of blog posts looking like this:

{
	"title": "Blog post 1",
	"body": "Blog post about …",
	"category": "programming" 
},
{
	"title": "Blog post 2",
	"body": "Blog post about …",
	"category": "cooking" 
},
{
	"title": "Blog post 3",
	"body": "Blog post about …",
	"category": "programming" 
}

It would be nice to get all categories of the blog posts and you can do that with aggregations. This way in your search results you would get additional property:

{
	"aggregations": {
		"categories": {
			"buckets": [
				{
					"key": "programming",
					"doc_count": 2 
				},
				{
					"key": "cooking",
					"doc_count": 1 
				}
			]
		}
	}
}

Convenient, isn’t it?

Implementation

The index I was working with keeps a list of companies. Each company has a lot of properties, and one those properties is locations.
The locations property is an array field:

["United States", "New York", "Buffalo"]

I wanted to gather those locations to use them for autocomplete. However, as I quickly learned it was impossible to do so having this particular data structure.
The aggregations query looked like this:

{
	"aggs": {
		"locations": {
			"filter": {
				"bool": {
					"must": {
						"query": {"match": {"locations": "ger"}}
					} 
				}
			},
			"aggs": {
				"filtered_locations": {
					"terms": {
						"field": "locations",
						"size": 10
					}
				}
			}
		}
	}
}

This query should return you 10 aggregated terms from field locations, which start with ger.
The problem was that the field I was trying to aggregate was an array and because of that the collection of buckets consisted of all values in the array regardless of the term in the filter query. So to show you an example. Let’s say that we have two companies and the index looks like this:

{
	"company": {
		"name": "Digester",
		"site": "www.digester.com",
		"locations": ["United States", "New York", "Buffalo"]
	},
	"company": {
		"name": "Gloomer",
		"site": "www.gloomer.com",
		"locations": ["Germany", "Bavaria", "Munich"]
	}
		
}

If I used the query for aggregations from above I would get the results like this:

{
	"aggregations": {
		"categories": {
			"buckets": [
				{
					"key": "Germany",
					"doc_count": 1
				},
				{
					"key": "Bavaria",
					"doc_count": 1 
				},
				{
					"key": "Munich",
					"doc_count": 1 
				}
			]
		}
	}
}

That’s because I was filtering for the documents which locations matched ger and gathered aggregations from the field locations of that documents so with the data arranged as an array it suggested all three values of the array: "Germany", "Bavaria" and "Munich". At first I thought it was a nice functionality because phrases for countries suggested cities and states of that country as well. As it quickly turned out it was not so nice. When the number of the documents in the index started to grow the results started to get messy. They were messy because aggregates are ordered by the number of the documents and that created some major issues. When the user wants to search for the United Kingdom, and started to type united the United Kingdom could never get on the list. The reason was simple. There is another country that starts with "United" - The United States - which has many states and many cities so when you typed united and because locations field was an array you also got other locations (cities and states) that were in the array containing value matching united. As a result Elasticsearch gathered aggregations for that additional locations as well. With ordering the results by the number of documents United Kingdom had too few companies to get on the list of 10 most accurate locations, because locations like New York, California etc. jumped in before the United Kingdom. In the best case scenario the United Kingdom would appear but after New York or other unrelated query term search result which seems a little odd.

Solution: Take one!

So the thing to do was to get rid of the bucket keys that did not match query term. At first I tried to use include option which allows you to specify regex pattern of the terms that should be included in the results, but it quickly revealed some downsides. It doesn’t work very well when your query characters are down-cased but your aggregate terms are up-cased or capitalize - generally when the case of the query and the case of the aggregated term doesn’t match. There are some tricks like using Java regex flags, however, from my experience that is a wrong way. Don’t go there or you’ll be baffled by the results you'll get and this is quite problematic do debug later.

Solution: Take two!

Next try was to use significant_terms.
At the beginning the results were quite relevant, but after some time I hit some problems with that too so I quickly abandoned this path.

Solution: Take three!

Next! Save it as an array of objects. This, sadly, didn’t give me expected results as well. As a matter of fact the results where exactly the same. It was due to the reason how the documents are stored in the Elasticsearch. When you have the list of users like this:

{
"users": [
      {
		"first_name": "Joe", 
		"last_name": "Smith"
	  }, 
	  {
		"first_name": "Stanley", 
		"last_name": "Ipkis"
	  }
]
}

after saving it becomes:

{
	"users":  ["Joe", "Stanley" ],
	"users": ["Smith", "Ipkis"]
}

What I tried to do was to store locations like this:

{
    "locations": [
		{"name": "United States"},
		{"name": "New York"},
		…
	]
}

And after it was flattened by Elasticsearch it became:

{"locations.name": ["United States", "New York", …]}

So this is exactly the same, but now instead of querying locations field I needed to query locations.name field, and get the aggregates from locations.name. No gain at all. Well…you could think so but this wouldn’t be exactly true.

Solution: Final take!

As it turned out the data arrangement from the previous solution led change mapping type for locations property. The mapping I should use was of nested type. After I mapped locations with the nested type, reindexed it, and changed query a little I got the results I wanted. This is because this type of mapping, in the opposite to the before one makes Elasticsearch to store the properties of the document in the following way:

[
{
	"employees.first_name": "Joe",
	"employees.last_name": "Smith"
  },
{
	"employees.first_name": "Stanley",
	"employees.last_name": "Ipkis"
}
]

This is exactly what I needed. I tweaked my aggregates query to consider locations as the nested type, because nested fields require a little different structure for the query and voila! The autocomplete suggests the values properly. Now after typing ger in the query you don’t get the cities and states (I suppose it could be done somehow because I still think it’s a nice functionality) and when you type united the United Kingdom shows up properly in the list of the aggregation buckets.

Autocomplete in ElasticSearch - array aggregations
Share this