Understanding and tuning your Solr caches

Small picture of Mark
Founder & Expert Button Pusher of Teaspoon Consulting

To obtain maximum query performance, Solr stores several different pieces of information using in-memory caches. Result sets, filters and document fields are all cached so that subsequent, similar searches can be handled quickly.

The caches aren't magic, however, so it's important to understand what they do, and to tune them to suit your levels of search activity. This article describes the different caches, looks at the amount of memory they'll typically use, and discusses how to know whether you've sized them correctly.

Search your Solr index for velvet pants and you might find it takes tens or hundreds of milliseconds to get results back to you. Try the same search again and get the same results in only a few milliseconds. What you're seeing here is caching: both at the operating system level (as the operating system caches the blocks of the Solr index that you just hit), and also within Solr itself.

The types of caches

Solr caches several different types of information to ensure that similar queries don't repeat work unnecessarily. There are three major caches:

In the following sections we'll look at these different caches and the amounts of memory that each is likely to use. Then we'll look at some tools you can use to determine whether these caches are sized correctly.

The query cache

Solr (well, Lucene, technically) handles a search for velvet pants in the following way:

As queries become more complex, Solr has to do more work. If you search for velvet pants propellerheads -category:apparel, Solr must now find four sets of document IDs, combine the first three into a single set, then subtract any document IDs that matched the last (negated) term.

Thanks to Zipf's Law, it's often the case that certain queries are performed over and over again, with a long-tail of queries that will only be seen once. Look at a log of searches performed on a popular website and you will probably find that the most popular search query is seen twice as frequently as the next-most-popular query. Follow the list of queries down and you will find it quickly trails off into obscurity, with the vast majority of queries appearing only once.

For those often-repeated queries, Solr's query cache can make a big difference. Each time it calculates a set of matching document IDs, Solr stores the result in its query cache. If another query for velvet pants propellerheads -category:apparel comes along moments later, Solr can smugly pull the answer from its cache and do hardly any work at all.

The query cache isn't just for popular queries, though. Interfaces offering paginated search results benefit greatly from the query cache too: from Solr's point of view, a user clicking "Next page" repeatedly is very much like a user performing the same query over and over again, and the query cache can reduce the work required to fetch subsequent pages.

Under the hood, Solr stores its query cache as arrays of integers, so each cache entry will be (roughly) the number of bytes for the query string itself, plus 8 bytes per document in the result set. To give a very back-of-the-napkin example:

If you configure Solr's query cache to hold up to 1,000 entries, that's about 80MB of memory required to store that cache.

The filter cache

The filter cache is a close friend of the query cache. Filters provide a way of limiting search results to a subset of documents. For example, a Solr query like:

 # "fq" stands for "filter query", and tells Solr to build a filter.

will return documents matching the query velvet pants, but only if they have a category field containing apparel.

You might think that you could just as well write:


and you would mostly be right. But there are a couple of benefits to using a filter instead:

Conceptually, you can think of a filter as a giant set of document IDs. To handle a filtered query, Solr does the query as normal to produce its set of document IDs, then intersects that result with the set of document IDs belonging to the filter. Any document in the result set but not in the filter gets discarded, and what's left is our filtered search result.

Notice that once you've got your set of document IDs for category:apparel, you can use that set again and again to handle different queries with the same filter. If a user performs three searches within the "apparel" category:




then the filter can be applied to the last two searches at virtually no cost. This makes filters ideal for cases where you know users will search within predictable subsets of the collection, such as:

This re-use of filters is supported by the filter cache: once Solr has built the set of document IDs required for a filter, it stores it in the filter cache and re-uses it where possible.

Like the query cache, the memory use of filter cache is potentially quite large. Solr represents the document IDs in a filter as a bit-string containing one bit per document in your index. If your index contains one million documents, each filter will require one million bits of memory—around 125KB. For a filter cache sized to hold 1,000 cache entries, that's in the area of 120MB.

The document cache

The final cache of interest is the document cache. When you query Solr, you don't just want document IDs: you want titles, product names, authors, descriptions, or any other number of descriptive fields. During indexing, we ask Solr to keep these bits of information as "stored fields" on each document, allowing us to get them back in our search results.

When Solr sends back search results, it sends along the requested stored fields for each document. To get these, it must separately read them from the index, querying the on-disk data structures to find the stored fields corresponding to each document ID. This is likely to be comparatively slow to reading the same data from memory.

If certain documents are requested frequently, Solr can save itself a lot of trouble by keeping their stored fields in memory. That's the role of the document cache—to hold the stored fields of commonly accessed documents. Generally speaking, the document cache isn't as performance critical as the other two caches seen so far. It's unusual for the same document to be fetched multiple times, so the hit rates on the document cache are often quite low.

If you have many stored fields, or large stored values, then you will probably want to keep your document cache relatively small, as this sort of data can consume a large amount of memory.

Tuning Solr's caches

A warning!

Newcomers to Solr are often tempted to make the caches larger than they need to be. After all, if your machine has lots of memory, why not assign most of it to Solr and make the caches massive? Doing this can actually hurt performance rather than helping:

If in doubt, favour caching less instead of more, and only increase your caches when you have a demonstrated need to do so.

Measuring cache effectiveness

The trick, then, is to work out whether your caches are paying for themselves. The best place to look for this information is Solr's web interface, which provides broad statistics on the different caches discussed so far. The particulars may vary depending on your version of Solr, but in Solr 4 you will find statistics by browsing to your Solr URL, then clicking Core Selector[core name]Plugins / Stats.

Here you will find sections for each of the caches described above, plus a few extras.

Each set of statistics has a number of different metrics. To determine the effectiveness of a cache, the most interesting figures are:

The ultimate measure of a cache's performance is its hit ratio. You will need to experiment to find your optimal cache sizes, but keep an eye on your hit ratios to make sure you're making things better (not worse). Some tips:

It's not an exact science, but with a little experimentation and attention to detail you can make big improvements to your overall performance. Happy searching!