In a previous post I talked about how Elasticsearch can break catastrophically. In this post I'll talk about ways that it can run out of resources and slow down queries.

What resources?

CPU time is super important to us but IO is seems to be what causes us more trouble.

What requires disk IO

Queries in Lucene are mostly memory mapped reads of write once segments which is pretty friendly to disk caching. As such anything that disrupts disk caching like a large merge or _optimize or swapping an alias from one active index to another.

Positional queries require loading postings. See the codecs documentation in Lucene for the other bits of data being loaded.

Deleted documents are read from disk but don't really have much CPU cost - just IO cost. They also affect the scores. So if we have too many deleted documents sitting around in the index that causes us trouble as well.

The spinning disk incident

It'd be fair for you to ask when this has ever caused up trouble. Well, a little over a year ago we added spinning disks to the Elasticsearch cluster and knocked it out. You see, I didn't know that we had SSDs in the other servers so the new server with a spinning disk looked just like the old servers to me. Same specs otherwise. Yeah. No. I learned how to sudo hdparm -I /dev/sda real fast.

This was all my fault but it brings up a real problem: when a node is sick it can slow down all queries. And it can slow down queries enough to cause a brownout. That is what happened in that incident. The trouble is that all the queries that were sent to that node will slow down, taking up resources on the node that is acting as the client node for that query. Elasticsearch sends those requests asynchronously so they don't tie up room in a thread pool. The WMF uses a pool counter to prevent overrunning Elasticsearch with thousands of concurrent queries and the slow queries take up room in that. Ultimately the pool counter tripping is what showed up as a brownout. Without the pool counter I suspect we'd have hit the error anyway but caused by exhaustion of other resources.

Do you even timeout?

The usual way to work around issues like this is with timeouts. In a perfect world the client node would timeout the request to the data node in, say, 500ms and we'd give the user what result we had and log a nasty message about the bad server. You'd have alerts on those logs and manually or automatically remove the server from the cluster. Its bad to give users degraded results when one of the nodes is sick but its better than losing the whole service. This kind of thinking takes you to the fun world of routing algorithms and the tradoffs around using "least load" or "most free executors" over "least recently used" or random routing algorithms. But I'm knowledgeable to get into these so I'll get back to Elasticsearch.

WMF sets its timeouts very long for two reasons:

  1. Some requests genuinely take longer and we're ok with that. Requests with phrase searches, regular expressions, and wildcards all take much longer than requests without. We could attempt to detect those queries in CirrusSearch but it'd require yet more regexes. So we play it safe an use a longer query.
  2. Timeouts in Elasticsearch are best effort since there is no facility in Java to kill a thread or for a thread to set a timeout for itself that will work 100% of the time. So the only way to get a 100% timeout is to add it on the request side. Suboptimal because the query will continue to run, consuming server resources not tracked by the pool counter and because client side timeouts are incapable of giving partial results. We still set both a client side and an Elasticsearch side timeout so we can get a 100% reliable timeout but it is even longer than the Elasticsearch side timeout to give us a chance at partial results.

If heap exhaustion doesn't get us, IO will

Bad nodes aside, its seems to me that after bugs causing heap exhaustion IO is our greatest enemy. Once the data is in the disk cache Elasticsearch really flies. Term queries are so fast they never show up. Most of what I see when I take a thread snapshot of a system under load is phrase queries. For us, most queries have a phrase rescore attached to improve relevance. The size of that rescore window is just about the most important variable performance wise on our cluster. Primarily what that phrase query is doing is reading and decoding term positions.

Beyond that, when I was performance testing Elasticsearch against the search logs from our old search system I kept finding IO related things to fix. The IO issues with the fast-vector-highlighter we found are why I wrote the experimental highlighter - term vectors are just too expensive. They take up a ton of disk space and loading them is slow.