Understanding and tuning your Solr caches

Mark Triggs <mark@teaspoon-consulting.com>

Founder & Expert Button Pusher of Teaspoon Consulting

To obtain maximum query performance, Solr stores several different pieces of information using in-memory caches. Result sets, filters and document fields are all cached so that subsequent, similar searches can be handled quickly.

The caches aren't magic, however, so it's important to understand what they do, and to tune them to suit your levels of search activity. This article describes the different caches, looks at the amount of memory they'll typically use, and discusses how to know whether you've sized them correctly.

Search your Solr index for velvet pants and you might find it takes tens or hundreds of milliseconds to get results back to you. Try the same search again and get the same results in only a few milliseconds. What you're seeing here is caching: both at the operating system level (as the operating system caches the blocks of the Solr index that you just hit), and also within Solr itself.

The types of caches

Solr caches several different types of information to ensure that similar queries don't repeat work unnecessarily. There are three major caches:

The query cache — Stores sets of document IDs returned by queries. If your velvet pants query returns 1,000 results, a set of 1,000 document IDs (integers) will be stored in the query cache for that query string.
The filter cache — Stores the filters built by Solr in response to filters added to queries. If you search for velvet pants with a filter parameter like fq=category:apparel, Solr will build a filter for that category and add it to its cache.
The document cache — Stores the document fields requested when showing query results. When you ask Solr to do a search, you will generally request one or more stored fields to be returned with your results (using the fl parameter). Solr squirrels those document fields away as well, reducing the time required to service subsequent requests for the same document.

In the following sections we'll look at these different caches and the amounts of memory that each is likely to use. Then we'll look at some tools you can use to determine whether these caches are sized correctly.

The query cache

Solr (well, Lucene, technically) handles a search for velvet pants in the following way:

It queries the index to find all documents containing the term velvet. This yields a list of internal document IDs, representing the documents that contain that term.
It does the same for the term pants, yielding another list of IDs.
Finally, it combines the two lists. If you're using the default OR operator, Solr will take the union of both sets of document IDs. If you're using AND it takes an intersection (returning only documents containing both terms).

As queries become more complex, Solr has to do more work. If you search for velvet pants propellerheads -category:apparel, Solr must now find four sets of document IDs, combine the first three into a single set, then subtract any document IDs that matched the last (negated) term.

Thanks to Zipf's Law, it's often the case that certain queries are performed over and over again, with a long-tail of queries that will only be seen once. Look at a log of searches performed on a popular website and you will probably find that the most popular search query is seen twice as frequently as the next-most-popular query. Follow the list of queries down and you will find it quickly trails off into obscurity, with the vast majority of queries appearing only once.

For those often-repeated queries, Solr's query cache can make a big difference. Each time it calculates a set of matching document IDs, Solr stores the result in its query cache. If another query for velvet pants propellerheads -category:apparel comes along moments later, Solr can smugly pull the answer from its cache and do hardly any work at all.

The query cache isn't just for popular queries, though. Interfaces offering paginated search results benefit greatly from the query cache too: from Solr's point of view, a user clicking "Next page" repeatedly is very much like a user performing the same query over and over again, and the query cache can reduce the work required to fetch subsequent pages.

Under the hood, Solr stores its query cache as arrays of integers, so each cache entry will be (roughly) the number of bytes for the query string itself, plus 8 bytes per document in the result set. To give a very back-of-the-napkin example:

if your average query string is 50 characters (say, 100 bytes)
and your average search returns 10,000 results (8 bytes per integer for 80,000 bytes)
then each cache entry might be a little over 80KB

If you configure Solr's query cache to hold up to 1,000 entries, that's about 80MB of memory required to store that cache.

The filter cache

The filter cache is a close friend of the query cache. Filters provide a way of limiting search results to a subset of documents. For example, a Solr query like:

 # "fq" stands for "filter query", and tells Solr to build a filter.
 /select?q=velvet+pants&fq=category:apparel

will return documents matching the query velvet pants, but only if they have a category field containing apparel.

You might think that you could just as well write:

 /select?q=(velvet+pants)+AND+category:apparel

and you would mostly be right. But there are a couple of benefits to using a filter instead:

Using a filter query doesn't influence the result set's relevance ranking, so documents don't get boosted for having the word apparel in them.
Keeping the extra term out of the query means that we can hit the Solr query cache for the velvet pants query and possibly save having to do the search at all.
Filters can be cached and re-used, so they're faster if you're going to perform the same filtered query more than once.

Conceptually, you can think of a filter as a giant set of document IDs. To handle a filtered query, Solr does the query as normal to produce its set of document IDs, then intersects that result with the set of document IDs belonging to the filter. Any document in the result set but not in the filter gets discarded, and what's left is our filtered search result.

Notice that once you've got your set of document IDs for category:apparel, you can use that set again and again to handle different queries with the same filter. If a user performs three searches within the "apparel" category:

 /select?q=velvet+pants&fq=category:apparel  

 /select?q=shoes&fq=category:apparel  

 /select?q=top+hat+with+monocle&fq=category:apparel

then the filter can be applied to the last two searches at virtually no cost. This makes filters ideal for cases where you know users will search within predictable subsets of the collection, such as:

"documents in the 'dinosaurs' collection"
"documents I have permission to view"
"documents added in 1995"

This re-use of filters is supported by the filter cache: once Solr has built the set of document IDs required for a filter, it stores it in the filter cache and re-uses it where possible.

Like the query cache, the memory use of filter cache is potentially quite large. Solr represents the document IDs in a filter as a bit-string containing one bit per document in your index. If your index contains one million documents, each filter will require one million bits of memory—around 125KB. For a filter cache sized to hold 1,000 cache entries, that's in the area of 120MB.

The document cache

The final cache of interest is the document cache. When you query Solr, you don't just want document IDs: you want titles, product names, authors, descriptions, or any other number of descriptive fields. During indexing, we ask Solr to keep these bits of information as "stored fields" on each document, allowing us to get them back in our search results.

When Solr sends back search results, it sends along the requested stored fields for each document. To get these, it must separately read them from the index, querying the on-disk data structures to find the stored fields corresponding to each document ID. This is likely to be comparatively slow to reading the same data from memory.

If certain documents are requested frequently, Solr can save itself a lot of trouble by keeping their stored fields in memory. That's the role of the document cache—to hold the stored fields of commonly accessed documents. Generally speaking, the document cache isn't as performance critical as the other two caches seen so far. It's unusual for the same document to be fetched multiple times, so the hit rates on the document cache are often quite low.

If you have many stored fields, or large stored values, then you will probably want to keep your document cache relatively small, as this sort of data can consume a large amount of memory.

Tuning Solr's caches

A warning!

Newcomers to Solr are often tempted to make the caches larger than they need to be. After all, if your machine has lots of memory, why not assign most of it to Solr and make the caches massive? Doing this can actually hurt performance rather than helping:

Allocating memory to Solr takes it away from the operating system. The operating system generally makes extremely good use of spare memory and will cache data at the block level to improve I/O performance. This is ideal for Solr, as it reduces the time required to traverse its indexes, improving query performance for all queries, not just the ones that hit the caches.
"Large cache" is just a fancy word for "garbage". If you let objects accumulate in the caches, the Java Virtual Machine's garbage collector is eventually going to have to clean it all up. Having lots of garbage increases the duration of garbage collections and hurts your application's responsiveness.

If in doubt, favour caching less instead of more, and only increase your caches when you have a demonstrated need to do so.

Measuring cache effectiveness

The trick, then, is to work out whether your caches are paying for themselves. The best place to look for this information is Solr's web interface, which provides broad statistics on the different caches discussed so far. The particulars may vary depending on your version of Solr, but in Solr 4 you will find statistics by browsing to your Solr URL, then clicking Core Selector → [core name] → Plugins / Stats.

Here you will find sections for each of the caches described above, plus a few extras.

Each set of statistics has a number of different metrics. To determine the effectiveness of a cache, the most interesting figures are:

The cumulative hit ratio (cumulative_hitratio) — The percentage of queries that were satisfied by the cache (a number between 0 and 1, where 1 is ideal).
The cumulative number of inserts (cumulative_inserts) — The number of entries added to the cache over its lifetime.
The cumulative number of evictions (cumulative_evictions) — The number of entries removed from the cache over its lifetime.

The ultimate measure of a cache's performance is its hit ratio. You will need to experiment to find your optimal cache sizes, but keep an eye on your hit ratios to make sure you're making things better (not worse). Some tips:

If you see a high number of evictions relative to inserts, try increasing the size of that cache and monitor the effect on its hit ratio. It might be that entries are being evicted too quickly for your levels of search activity.
If a cache has a high hit ratio but very few evictions, it might be too large. Try reducing the cache size and see if there's any corresponding change in the hit ratio.
Don't be discouraged if your hit ratio remains low for certain caches. If your queries are generally non-repetitive then no amount of cache sizing is going to get that number up, and you might as well opt for a small cache size.
Resist the urge to turn the caches off completely, even where the hit rates are low. Some of the caches will still have performance benefits for single queries, so having a small cache set is still worthwhile.
When changing your cache sizes, try to estimate the worst-case memory usage using the kind of rough calculations used in the previous sections. Make sure that you're not creating a slow memory leak by making your caches too large.

It's not an exact science, but with a little experimentation and attention to detail you can make big improvements to your overall performance. Happy searching!