As useful as Elasticsearch is, sometimes it breaks and when it breaks you have to know how its broken before you can fix it. This post talks about two ways that can cause catastrophic breakage across multiple servers.

GC death spiral

Elasticsearch runs in the JVM which has no way of limiting the amount of memory that a single thread can own and it makes it difficult to measure how much memory a thread owns. This is a side effect of its shared memory concurrency model - essentially all data written by one thread is visible to all other threads modulo some esoteric checkpoints around volatile variables. This gets us into trouble when bugs in Elasticsearch allow queries to take up emense amounts of ram. From my experience the most likely place you'll hit this is in regex and wildcard queries, and that's mostly been fixed by the max_determinized_states patch which went out in Elasticsearch 1.4.3 and Wikimedia-Extra 0.0.2. Wikimedia-Extra only contains the fix for its source_regex filter. Anyway, that almost certainly hasn't made these queries bullet proof, just less likely to eat lots of memory.

I've heard tell that aggregations used to consume troubling amounts of memory but the Elastic folks worked very hard around Elasticsearch 1.0.0 to push all the work these do into pre-allocated slabs and to kill queries that consume too many of these slabs.

This isn't a worse problem because most of Lucene's data is held off head and iterated with classes like TermsEnum. They abstract the fancy data structures used by the index and whether or not they are on disk. Lucene does good job of accounting for memory too - see RamUsageEstimator and Accountable.

So you can identify this state a few ways. The easiest is to look at Elasticsearch's logs and look for WARNING message about long garbage collections. You can get pretty immediate feedback with jstat as well:

sudo -u elasticsearch bash -c 'jstat -gcutil `cat /var/run/elasticsearch.pid` 3s 100'

This is what a very low load Elasticsearch looks like:

vagrant@mediawiki-vagrant:~$ sudo -u elasticsearch bash -c 'jstat -gcutil `cat /var/run/elasticsearch.pid` 3s 100'
  S0     S1     E      O      P     YGC     YGCT    FGC    FGCT     GCT   
  0.00 100.00  39.06  30.01  99.68      5    0.264     0    0.000    0.264
  0.00 100.00  39.06  30.01  99.68      5    0.264     0    0.000    0.264
  0.00 100.00  39.21  30.01  99.68      5    0.264     0    0.000    0.264
  0.00 100.00  39.21  30.01  99.68      5    0.264     0    0.000    0.264
  0.00 100.00  40.45  30.01  99.68      5    0.264     0    0.000    0.264
  0.00 100.00  40.45  30.01  99.68      5    0.264     0    0.000    0.264
  0.00 100.00  40.45  30.01  99.68      5    0.264     0    0.000    0.264
  0.00 100.00  40.45  30.01  99.68      5    0.264     0    0.000    0.264
  0.00 100.00  40.59  30.01  99.68      5    0.264     0    0.000    0.264
  0.00 100.00  40.59  30.01  99.68      5    0.264     0    0.000    0.264
  0.00 100.00  40.59  30.01  99.68      5    0.264     0    0.000    0.264
  0.00 100.00  40.59  30.01  99.68      5    0.264     0    0.000    0.264
  0.00 100.00  40.59  30.01  99.68      5    0.264     0    0.000    0.264
  0.00 100.00  40.74  30.01  99.69      5    0.264     0    0.000    0.264
  0.00 100.00  40.92  30.01  99.69      5    0.264     0    0.000    0.264
  0.00 100.00  55.55  30.01  99.75      5    0.264     0    0.000    0.264
  0.00 100.00  93.86  30.01  99.37      5    0.264     0    0.000    0.264
  0.00 100.00  99.70  30.01  99.45      6    0.264     0    0.000    0.264
 77.65   0.00   9.73  33.27  99.51      6    0.295     0    0.000    0.295
 77.65   0.00  14.23  33.27  99.56      6    0.295     0    0.000    0.295
 77.65   0.00  21.31  33.27  99.71      6    0.295     0    0.000    0.295
 77.65   0.00  23.59  33.27  99.71      6    0.295     0    0.000    0.295
 77.65   0.00  25.85  33.27  99.72      6    0.295     0    0.000    0.295
 77.65   0.00  28.16  33.27  99.72      6    0.295     0    0.000    0.295
 77.65   0.00  31.68  33.27  99.73      6    0.295     0    0.000    0.295
 77.65   0.00  40.80  33.27  99.78      6    0.295     0    0.000    0.295
 77.65   0.00  51.56  33.27  99.38      6    0.295     0    0.000    0.295
 77.65   0.00  64.15  33.27  99.40      6    0.295     0    0.000    0.295
 77.65   0.00  81.41  33.27  99.49      6    0.295     0    0.000    0.295
 77.65   0.00  83.52  33.27  99.50      6    0.295     0    0.000    0.295
  0.00  63.53  30.17  33.27  99.53      7    0.304     0    0.000    0.304
  0.00  63.53  36.28  33.27  99.55      7    0.304     0    0.000    0.304
  0.00  63.53  42.20  33.27  99.58      7    0.304     0    0.000    0.304
  0.00  63.53  46.75  33.27  99.60      7    0.304     0    0.000    0.304
  0.00  63.53  58.61  33.27  99.63      7    0.304     0    0.000    0.304
  0.00  63.53  62.99  33.27  99.65      7    0.304     0    0.000    0.304
  0.00  63.53  84.71  33.27  99.67      7    0.304     0    0.000    0.304
 88.50   0.00  20.66  33.27  99.68      8    0.311     0    0.000    0.311
 88.50   0.00  73.06  33.27  99.73      8    0.311     0    0.000    0.311
 88.50   0.00  82.40  33.27  99.76      8    0.311     0    0.000    0.311

And here is an Elasticsearch that I hosed with a bad wildcard query:

vagrant@mediawiki-vagrant:~$ sudo -u elasticsearch bash -c 'jstat -gcutil `cat /var/run/elasticsearch.pid` 3s 100'
  S0     S1     E      O      P     YGC     YGCT    FGC    FGCT     GCT   
  0.00  54.18  30.79  34.57  99.48      9    0.326     0    0.000    0.326
  0.00  54.18  37.29  34.57  99.50      9    0.326     0    0.000    0.326
  0.00  54.18  45.62  34.57  99.51      9    0.326     0    0.000    0.326
  0.00  54.18  54.77  34.57  99.53      9    0.326     0    0.000    0.326
  0.00  54.18  56.46  34.57  99.53      9    0.326     0    0.000    0.326
  0.00  54.18  65.12  34.57  99.54      9    0.326     0    0.000    0.326
  0.00  54.18  74.89  34.57  99.55      9    0.326     0    0.000    0.326
  0.00  54.18  90.38  34.57  99.58      9    0.326     0    0.000    0.326
 50.46   0.00   5.46  34.57  99.59     10    0.335     0    0.000    0.335
 50.46   0.00  26.70  34.57  99.61     10    0.335     0    0.000    0.335
 50.46   0.00  31.71  34.57  99.63     10    0.335     0    0.000    0.335
 50.46   0.00  83.15  34.57  99.65     10    0.335     0    0.000    0.335
    ---- Sent it a degenerate but not devestating wildcard here ----
 67.93   0.00  48.53  44.54  99.67     16    0.486     0    0.000    0.486
 67.93   0.00  61.87  44.54  99.67     16    0.486     0    0.000    0.486
 67.93   0.00  63.58  44.54  99.67     16    0.486     0    0.000    0.486
 67.93   0.00  76.98  44.54  99.67     16    0.486     0    0.000    0.486
 67.93   0.00  89.32  44.54  99.68     16    0.486     0    0.000    0.486
      ---- Sent it an even worse one, still not bad enough ----
 56.12   0.00  46.31  52.53  99.68     20    0.627     0    0.000    0.627
  0.00  88.85  65.28  45.96  60.00     35    0.950     2    0.146    1.096
  0.00 100.00   0.00  79.67  60.00     51    1.275     3    0.162    1.437
  0.00  54.70  74.03  82.39  59.85     63    1.653     5    0.246    1.899
 71.22   0.00  39.71  77.94  59.86     76    1.998     9    0.324    2.323
  0.00  42.65  73.86  88.48  59.86     87    2.289    13    0.443    2.732
 67.77   0.00  53.12  89.33  59.86     96    2.486    17    0.575    3.061
 67.77   0.00  61.62  89.33  59.86     96    2.486    17    0.575    3.061
 67.77   0.00  62.63  79.61  59.86     96    2.486    18    0.670    3.156
 96.01   0.00 100.00  94.39  59.86    101    2.690    22    0.836    3.526
 82.84   0.00  29.93  55.20  59.66    113    2.811    22    1.284    4.095
 77.26   0.00  57.59  58.67  59.67    125    3.092    24    1.349    4.441
                         ---- Almost! ----
  0.00 100.00  58.03  71.92  59.67    138    3.239    24    1.349    4.588
  0.00 100.00  33.35  69.32  59.67    150    3.501    28    1.453    4.955
100.00   0.00   0.00  82.93  59.67    163    3.712    31    1.512    5.224
  0.00  93.19  52.65  79.21  59.67    174    3.862    35    1.557    5.419
  0.00 100.00 100.00  89.67  59.67    185    4.079    41    1.746    5.825
 89.71   0.00  12.43  86.08  59.67    194    4.185    45    2.369    6.554
 98.72   0.00   1.45  88.68  59.67    203    4.272    50    3.108    7.380
 39.39   0.00  49.60  93.27  59.67    211    4.393    54    4.195    8.588
100.00 100.00 100.00  99.88  59.67    222    4.502    57    4.472    8.974
  0.00   0.00   0.01  96.33  59.67    222    7.076    59    5.014   12.090
  0.00   0.00 100.00  99.65  59.67    227    7.198    63    6.518   13.715
                 ---- And here its beyond dead.  ----
100.00   0.00 100.00 100.00  59.67    228    7.198    68    8.421   15.618
100.00   0.00 100.00 100.00  59.67    228    7.198    72   10.213   17.410
100.00   0.00 100.00 100.00  59.67    228    7.198    77   11.944   19.142
  0.00   0.00  21.06  99.99  59.67    228    7.198    82   14.290   21.488
100.00   0.00 100.00 100.00  59.68    228    7.198    86   15.696   22.893
100.00   0.00 100.00 100.00  59.68    228    7.198    90   17.789   24.987
100.00   0.00 100.00 100.00  59.68    228    7.198    95   19.587   26.784
100.00   0.00 100.00 100.00  59.68    228    7.198   101   21.878   29.076

Now to understand this you'd have to read a ton about how garbage collection works in Java but there are a few ways to recognize that its hosed: 1. The GCT is the number of second spent doing garbage collection. When the server is healthy, only the young GCs stop the world. When its sad, both full and young GCs do. Anyway, if GCT increases by an appreciable percentage of the time between probes (3 seconds here) then you know you are hosed.

  1. When the O column stays at or near 100 even after a full GC. You know a full GC has happened because the FGC column has counted up. Young GCs don't even try to clean the old generation, so don't get worried about those. Just the full GCs.

  2. When the FGC column increases quickly. In production you expect a full garbage collection every few hours. In a system with less ram it is going to be more frequent but it still shouldn't be several times a second.

  3. When three of the first four columns are 100.0 you are done. Its fine for one or two of them to be at 100.0 so long as they go down after the next GC. When three out of four are 100.0 then you memory is full.

  4. Ignore the P column. Its almost never a problem no matter what number it has.

If the node is in this state there is nothing to do but restart it. You want to figure out how it got there in the first place so you can fix the root cause. For that you should look at the query logs. You can get a thread dump and look there but its likely difficult to tell. You can, and, if you are with it, should get a heap dump. With 30GB heaps they are hard to analyze, but you at least can try.

The best part is that when this happens on one node it often will happen on one node per shard. English Wikipedia is sharded 7 ways on our cluster, so a poison request to enwiki will knock out seven nodes.

Master is deaf and dumb

If the current master node doesn't respond to requests to join the cluster then restarting nodes can, well, leave them out of the cluster and get you into worse trouble. This can happen if your master is also a data or client node and you hit it with something that puts it in a GC death spiral (see above). The way to avoid this is to use dedicated master nodes. The WMF is waiting for something like deb support for this.

Its also possible there might be bugs in Elasticsearch around the master's task list, especially in older versions. Versions before 1.6 are known to have their cluster state update thread get clogged up when data nodes are slow to respond to requests and when you have a lot of shards. There are synchronous, per shard, per data node messages sent as part of the shard assignment process. So if you have any machines in a GC death spiral then you'll have to restart all of them for it to get unstuck. In 1.6 this has been changed to a synchronous but parallel set of requests which should make the master more resilient to bad nodes if not totally resilient to them.

The trouble with this problem is that its hard to figure out what the cause is. Luckily there is one surefire way to fix it. Restart the whole cluster. Its not a good way, but when you cluster is so hosed, as in half the nodes aren't responding and the masters aren't letting restarted nodes join the cluster, then its the best you are going to get. You do that by stopping Elasticsearch on all the nodes and then starting them. I shut them all down like so:

for server in $(cat servers); do ssh $server sudo puppet-run -disable; done
for server in $(cat servers); do ssh $server sudo /etc/init.d/elasticsearch stop; done

I verify they are all stopped like so:

for server in $(cat servers); do ssh $server ps aux \| grep elastic \| grep java; done

I often do that in another window to make sure that puppet isn't restarting them behind my back (sneaky!). Anyway, then you can start them again like so:

for server in $(cat servers); do ssh $server sudo /etc/init.d/elasticsearch start; done

I'm sure there are fancier things than for loops and ssh but when things are totally sideways you reach for what you know. And this is what I can write on 0 hours of sleep.

The cluster will recover all its primary shards quickly - WMF's come back in two or three minutes. The replicas shared will have always be cloned from a primary shard. That will take some time proportional to the number of modification operations performed on the shard since the last time it was copied from that primary. Since that is hard to predict you can use the amount of time since the last cluster restart (rolling or full) as a rough proxy. Six hours isn't unheard of for the WMF cluster.

What to do with all that data that you lost during the full cluster restart

For the most part Elasticsearch won't lose any data during the full cluster restart but all write operations will fail to it. Hopefully those operations are simply retried. That is the point of wrapping them in a job queue after all. Assuming that doesn't work you can always rebuild the whole index. For Cirrus you can do that with the forceSearchIndex.php script. It has parameters to do a time range but don't use that - its too slow. Use the id range parameters. This is how I do it:

screen
mwscript extensions/CirrusSearch/maintenance/forceSearchIndex.php --wiki $wiki --queue --buildChunks 10000 sed 's/^php /mwscript extensions\/CirrusSearch\/maintenance\//'  | tee rebuild.$wiki.sh
bash rebuild.$wiki.sh

For enwiki that takes about a week, mostly because MediaWiki is slow.