Introduction

I'm trying something new-to-me, namely, giving a presentation from a blog post instead of slides. My hope is that this will be more useful for folks that don't see the presentation or that just want a reference. So while this is paired with my Wings Stay On presentation at GOTO Amsterdam 2015, you can read it on its own.

So the subtitle of this presentation, "how we replaced search on Wikipedia and didn’t break anything", contains two lies. Never believe someone when they tell you they didn't break anything. Never. The other lie is in the Wikipedia part: we didn't just replace search on Wikipedia. We did it on Wiktionary, Wikidata, Wikimedia Commons, Wikiquote, and Wikipedia AND some more projects that I don't have the energy to list. Most of those have an instance per language so that's about 888 wikis across 265 languages.

Now if you've been internet stalking me you probably are thinking "oh, he's reusing his presentation from Elastic{ON} 2015." and you'd be right up until this point. You can go watch that presentation if you want to get an overview of we use Elasticsearch. But that slot was 25 minutes and this one is 50 so it'd be unconscionable to try and reuse that presentation.

Instead I'm going to talk about how we went about not breaking things. Since this is the Elasticsearch track I'll go into some of the Elasticsearch features we used but I'm going to be more general and not every point I make is going to be about Elasticsearch.

Why we chose Elasticsearch

Before I go into how we did this, I think its worth mentioning why we chose Elasticsearch over Solr or hand rolling something backed by Lucene.

Active community of users and developers

Our old search system, lsearchd, used Lucene directly and was written entirely for MediaWiki and relied on some arcane, poorly documented MediaWiki extension. There were a couple of folks that had set it up outside of the Foundation, but very few. There wasn't an active community of developers either.

When I was first evaluating Elasticsearch I had a bash script that did something silly and it uncovered a bug in Elasticsearch. I filed it. Someone at Elastic confirmed it. I sent a pull request. And we talked about it. I modified some stuff. And then they merged it. It was painless and it convinced me that the developers were good to work with.

And there is obviously a community of users. I see them on discuss.elastic.co and I've met a few of you in person at meetups and conventions.

Automatic shard allocation

We were quite enamoured with automatic shard allocation. We liked that the cluster would just take care of rebuilding itself if we removed a node and that we didn't have to pay attention to where the index ends up. The allocation algorithms aren't perfect and I've personally become pretty familiar with how they work and made pull requests against them but having automatic shard allocation is a huge piece of mind. It just works most of the time and when it doesn't there are workarounds. By comparison, we'd forgotten how to move shards from server to server using the old search system so doing it would have been an excercise in digital archaeology, must less doing it without downtime.

Cute, useful tools

This is kind of a backwards reason, in that we didn't know it at the time, but Elasticsearch has a number of cute, useful tools that you really want to have when working with Lucene. The _analyze api is a great example. So are the _cat apis. You might not realize that you want them until you have them and you'll wonder how you lived without them.

None of these are rocket science to build yourself on top of Lucene but you don't have to build them! They already exist.

API for creating setting up mapping and analysis

At the time we were evaluating tools SOLR didn't have an API for to setup the mapping. You had to copy files to machines. Elasticsearch had a nice API for it.

And then we figured out that there was an HTTP GET you could send SOLR to delete and index. I pasted the link to my buddy Chad in IRC and he moused over it and nuked his development index when his IRC client tried to give him a preview of the page. And we walked away from SOLR. I'm sure that GET isn't still in SOLR but that's the story.

Slow and careful

Alright, lets get back to how we didn't break many things. One of our primary techniques was to be slow and careful.

No forking Elasticsearch

We ended up sending a decent handful of patches to Elasticsearch and Lucene and we could have gotten to production faster by forking Elasticsearch but we chose to wait for releases mostly because we had a two person development team and maintaining backports was way more work then we were willing to do. Now that we're a somewhat larger team it'd be possible but we still really appreciate the review process. We've never had any code merged into Elasticsearch without some changes suggested by reviewers anyway. Much of our work is in plugins these days and we've done a few backports to those so this is less important.

I should mention here that Elasticsearch 1.6's synced flush should allow us to do weekly deploys of those plugins once we figure out how to pause our update stream. We're super excited for that because our current restart process takes three days.

Brave, brave beta communitiess

A few communities were really excited by new search and opted to use the new search long before everywhere else. Much thanks to Italian and Catalan Wikipedias. They delt with us learning Elasticsearch and gave us tons of good feedback.

We also deployed CirrusSearch to mediawiki.org very early in the process. Its primary users are Mediawiki developers and administrators and they are very willing to complain. It was great.

BetaFeatures

Soon after those initial deploys some other folks at the Foundation built and installed the BetaFeatures extension which let users opt into some feature. They built it for VisualEditor or Hovercards or something. Anyway, we let users opt into the replacement search. We had to be careful about which wikis we allowed users to opt in on because we didn't have many servers at first, nowhere near enough to even index all the wikis.

This was hugely successful. Our first users came in this way and we had thousands of people do this over a year. We fixed so so so many issues this way. This process, and the year of soak time, was critical in making sure we supported the myriad workflows used by Editors and WikiGnomes.

Let me stop here and define some terms. Editors are folks that edit the Wiki. Its a general term. WikiGnome is more silly, more specific term for editors that like to make small quality fixes like switching direct links to archive sites or updating the wiki pages of a piece of software when a new version is released. My favorite WikiGnomes are the folks that correct spelling errors. These folks often keep a stable of searches and fire them off every once in a while to find new instances of their spelling error. These folks are super sensitive to stemming. And to transclusion.

Lets take another detour and talk a bit about how MediaWiki works so I can define transclusion. Wiki pages are built using markup text and templates. For example, this is some text from the Barack Obama article:

constitutional scholar [[Laurence Tribe]] while at Harvard for two
years.<ref>{{YouTube|wzmmBZ7i4BQ}}</ref>

Those {{ and }} are the template markers, YouTube is the template name and wzmmBZ7i4BQ is template parameters. Here is the template source. Templates can be very complex - even written in Lua. But that doesn't matter for our purposes.

Wiki by wiki rollout

I'll get back to transclusion in a minute but let's unroll the stack back to ways we were intentionally slow and talk about wiki by wiki rollout. In a sense this is the obvious way to do it because we had to index the wikis and we could allow users to opt in to the new search only once the wiki is indexed for the first time.

At some point we decided that the opt in times where done. We hadn't heard feedback from the community about issues for a while and it was time to switch everyone over and see if they uncovered new issues. And we did that on a wiki by wiki basis. For a while we'd cut over fifty or sixty a week. Then we'd do just one or two a day. The last six we did one on Tuesday the next on Thursday, etc. This helped us estimate load from the cutover.

What we learned from our beta users

insource

When we started replacing search one of the design goals was that normal search would search the rendered version of the page rather than the wikitext. Within days of deploying it to mediawiki.org people complained that they still wanted a way to search in wikitext. So we implemented the insource operator to search in the page's source rather than the page's rendered text. To do this, of course, you have to index both.

PascalCase and camelCase and snake_case and CONSTANT_CASE

Some of the mediawiki.org users complained that they userd to be able to $wgNamespaceAliases when they searched for namespaceAlias. This used to work because $wgNamespaceAliases contains this:

{{ {{TNTN|SettingSummary}}
|name=NamespaceAliases
|version_min=1.10.0

Thats right! They were finding a template parameter. Rather than make some special template parameters searchable we just made the analyzer segment on caseChanges and underscore. We only did it to English, sadly.

Hyphens

The old search system differentiated between foo-bar and foo bar. When you searched for foo-bar you'd only get foo-bar and never get foo bar but when you search for foo bar you'd get both. This behavior is really useful if you want to replacefoo-bar with foo bar but otherwise not super useful. And implementing it would have required an extension and a custom analyzer and when this came up we didn't have the capability to deploy extensions. Remember that first lie I said I'd get back to? Well we broke this behavior. We ended up implementing a brute force regex search instead. And it was slow and horrible; pretty much unusable. But in the mean time we'd figured out how to deploy plugins. So we wrote a plugin to accelerate regex search. That was a fun trip. Its still not perfect - and still kind of slow, but way way way faster. Like, actually usable faster. Besides, people had been asking for regex search for years.

Testing

Integration testing

The other side of all this discovery is making sure that we don't break the features we've already implemented. For the most part testing our search comes in six flavors like this:

Unit Integration
CirrusSearch Very little Lots
Elasticsearch Plugins Some Lots
Elasticsearch Lots Tons

Every row's integration tests rely on features from all lower rows. So CirrusSearch's integration tests rely on Elasticsearch plugin features and Elasticsearch features. For a concrete example, look at the test in CirrusSearch for finding unicode variants or the highlighting tests in CirrusSearch.

For the most part we concentrate on integration testing driven by a browser or an api client because:

  1. We want to be able to upgrade Elasticsearch and plugins with confidence. We couldn't do that if we mocked Elasticsearch out of the testing.
  2. Unit testing MediaWiki Extensions is hard. There is a lot of setup code to run and lots to learn.
  3. I didn't know how to easily run a single unit test with right bits of MediaWiki infrastructure all stood up. I literally didn't know the magic unix invocation. Now we have a makefile to make it easy form me.
  4. Maybe they'll be useful if we ever have to replace search. Again. This time around would have been a ton easier if the old search had had integration tests.

Anyway, that first point is probably valid for any project and I'm a firm believer in it. Of course I don't think you should mock the database when you test a DAO or Repository object, but I'm just picky that way.

Load testing

We wrote a dead simple, probably incorrect python script to replay the logs from the old search system against the new one. With sampling rates and stuff. It worked well. When we first started it we could handle about 20% of the load that we needed. Horrible! We used two techniques to figure out what was up:

  1. Run the hot_threads api manually and automatically. We have a script that runs it every once in a while and logs the results. This works ok. The problem is that Elasticsearch figures out which threads are hot and then collects the stack traces. For our queries that means that the stack traces will frequently be for threads that were hot but now are inactive.
  2. Just hit it with jstack and redirect the resulting stack trace to a file. SIGQUIT will dump the stacktrace to stdout but we want them bundled. Anyway, we wrote some simple simple bash/grep thing to attempt to classify the threads.

From there we found the hotspots: the phrase suggester, the fast vector highlighter, or an mvel bug for example. Some we fixed on our own and some we upstreamed, some we had to fix on both sides. For the most part when we hit a nasty performance issue we'd just send a patch and the Elastic team was wonderful at reviewing and merging them. I'm pretty sure we did wait for Elasticsearch to fix some of our issues but I remember the ones we fixed better because I stared at that code.

And yes, we did that load testing against production because we didn't have non-production hardware at the right scale. We were careful not to overwhelm it and make things slow and horrible for beta users. You can do that too, but be very very careful.

MediaWiki-Vagrant

MediaWiki-Vagrant is super important for us! Its the simplest way to get a very production-like environment up and running in development. Getting up and running to write our search is as simple as installing VirtualBox, Vagrant and then:

git clone --recursive https://gerrit.wikimedia.org/r/mediawiki/vagrant
cd vagrant
./setup.sh  # or setup.bat on Windows
vagrant up
... wait wait wait...
vagrant enable-role cirrussearch
vagrant provision
... wait wait wait...

That'll download an Ubuntu VM, spin it up, clone our code, setup Apache, install MySQL, install Redis, install Elasticsearch, install some Elasticsearch plugins, create a couple of wikis, and index them. From there you can run the integration tests or import parts of a wiki or just create pages.

Plugins

I aluded to plugins earlier and now I'd like to brag about them for just a minute. Please humor me.

We've started to do plugin development because:

  1. We don't have to wait for Elastic to review, merge, and release the changes. We can do all that!
  2. Some things just don't belong in Elasticsearch's core.
  3. We can backport some Elasticsearch or Lucene changes in a plugin and get them now now now.

We have three plugins:

  1. experimental highlighter - a nice highlighter that is super pluggable
  2. wikimedia extra - a grab bag of neat stuff like trigram accelerated regex search, some backports, native scripts, and a safer query wrapper.
  3. swift repository - save backups to the shared filesystem we already have deployed

We're pretty proud of them! Each one represented a significant milestone for us - our highlighter got us speed, the extra repository's regexes got us acceptance from more editors, and the swift repository is going to get us backups soon. We couldn't have succeeded being able to make plugins.

We use a few plugins Elastic provided plugins as well - mostly ICU. Still super userful.

What's been a pain

Jobs

MediaWiki has a job queue system and having a job queue is really really nice when working with Elasticsearch. You can retry jobs if you break things. You can pause them for a rolling restart. Nice stuff. But they add complexity when testing - instead of the one second refresh time you have to wait for the jobs to finish and then wait for the refresh. Uhg!

Some bugs can bring the down many nodes

Because we send all requests to each shard if we get a poison query it poisons many nodes. Now, obviously we should fix any bugs that let users run poison queries. But it'd be nice if poison queries didn't knock out 7 of our 31 nodes.

Installing plugins

We don't allow our production servers to access the internet and you shouldn't either. That means that we can't use the plugin installer. We have to use our own hand rolled process. Compared to just installing the deb packages this is a pain. And we still haven't streamlined the process very well, but that is our fault.

Wrap up

So to sumarize we owe our resonable degree of success to:

  1. Beta testers who quickly pointed out problems and put up with less than ideal search for a while.
  2. Phased rollout so we don't overwelm the cluster and to limit the number of critical bugs filed by folks who we break that didn't beta test.
  3. Canary deployments to mediawiki.org.
  4. Lots of integration testing to keep from breaking things we've already decided are important.
  5. Load testing to find hot spots.
  6. Vagrant for helping us to build prod-like environments.
  7. The plugin infrastructure so we can quickly fix important issues without forking Elasticsearch.
  8. The Elastic team for making wonderful software and working with us to make it better.