Key value stores are no longer confined to caching small objects over a local network. memcached is today deployed in many different environments, including geographically diverse datacenters, data warehouses and machine learning training systems. These new layouts have items sized kilobytes to megabytes, giving value to stitching devices together or tiering devices of different speeds.
Our previous post introduced Extstore, an extension to memcached which allows moving cache data onto flash storage. Working with Intel and Packet, we tested extstore on high speed flash SSD’s and Optane NVMe devices. These tests left many interesting questions unanswered however:
- how can we manage expectations of latency at various throughputs?
- how does extstore perform as the number of devices increases?
In this post we will attempt to answer these questions, running tests under both higher scrutiny and worse conditions than before. memcached can process over 50
million keys per second on a 48 core machine using only RAM and heavy batching. While impressive, real world
scenarios are filled with overhead: small individual requests, syscalls,
context switching, and so on. To avert potential benchmarketing, we’ll attempt to replicate these nuances when testing across RAM, flash and Optane devices.
Tail Latency can
cause a minority of slow backend requests to impact a majority of user traffic. While response time can be improved
with caching systems, services with a high response cardinality can
instead be further impacted. Low hit rates can mean the time taken to query a
cache service is just a time tax on those missed requests.
now exists for shared cache pools behind services with many backends. This
covers some but not all use cases, as backends have increasingly large responses
that can overwhelm RAM-backed caches. The geographic expansion of applications
can also drive cost as every service has to be deployed in every location.
To attempt to solve these issues by expanding storage space code was tested in two configurations. For the benchmark we used “Just a Bunch of
Devices” (JBOD). All devices create one large pool of
pages which spreads read and writes evenly. JBOD support was released
in memcached 1.5.10.
Another approach we did not benchmark was tiered storage. With extstore, stored data are
organized into logical buckets. These buckets can be grouped onto specific devices,
allowing users to deploy hardware with a mix of small/fast and large/cheap
NVMe devices. Even networked block devices could be used.
The draft can be followed here. To be detailed in future posts.
Testing used tag 4 of mc-crusher, under the “test-optane” script.
The test hardware was a server with dual 16 core Intel Xeon
(32 total cores), 192G RAM, 4TB flash SSD, and 3x 750G optane drives.
CPU and memory were bound to NUMA node 0, perfoming like a single socket
server. This should be close to what users deploy with, and will
allow us to focus our findings.
Other testing we’ve done focused on throughput with sampled request latency. Heavy batching
was used over few TCP connections to allow fetch rates of millions of keys per second. This test focused on a worst case scenario in both single device and JBOD
configurations. Instead of few connections
blasting pipelined requests, we have hundreds of connections periodically
sending sets of individual GET commands.
A RAM-only test exists as a baseline: showing our worst case configuration
without touching extstore at all. It’s important to compare the
percentiles and scatter
plots of RAM vs the drives in question, to determine the impact of
moving cache to disk.
- Item values are all 512 bytes in size. With buffered IO performance was the
same up to 4k, but smaller values were used to reduce start time for each
- Tests were scaled by a target request rate, increasing by 50k gets per second.
- GET requests were pipelined together in stacks of 50 from the bench client to the server, which
saves syscalls mainly for the benchmark client.
- Every key fetch must be individually requested from extstore. If allowed to
batch extstore is able to process many IO requests with fewer syscalls.
- Memcached will make at least one network syscall for each GET response.
This lands us in a middle ground, with a bias toward the worst case. Since most
users have some level of batching via multi-key fetches, this benchmark should
We plot the latency percentiles as the
target request rate increases. We also have a point cloud of every
sample taken over time. This shows
outliers and how consistent the performance is during each test.
It’s important to stress how using percentile averages to measure performance
is problematic. We use them in the first graph to identify trends. We
then complement with the latency
summaries and a scatter point graph of the full
duration of the test. Without this, it’s possible to have tests which look
fine in the line graph, but in reality have severe outliers or drop all traffic for several
– Mouse over your production query rate to see a detailed breakdown
- Partway through the graph average latencies take a huge jump: This is
the peak performance rate of that device configuration. For example, between 500k and 550k
the SSD is no longer increasing throughput, and requests are queueing for
- When moving to multiple devices, the latency cliff comes later but is
worse than single device. Three Optane drives perform almost the same as two. This is due to lock contention on the filesystem buffer cache.
- As you dial the percentile back, 90 and below, multiple devices continue to
outperform single devices. Meaning we get more outliers as contention
- The benchmark does not have perfect scheduling. A lot of outliers
(especially under the RAM test) can happen due to periodic dogpiling of
requests. This can mean the 95th percentile results are closer to what users
- If your entire request rate is below the latency cliffs, and the latency is
acceptable, you can reduce RAM very aggressively. The Optane drive is
much closer to RAM’s performance under this worst-case scenario, but the SSD
performs admirably as well.
- If your request rate is above the latency cliff, adding RAM is a
straightforward fix. Most workloads heavily bias towards recency,
which means moving a relatively small percentage to RAM can greatly reduce pressure on
- NUMA can help give access to more RAM and CPU, but single socket machines give the most consistent latency.
- Future work to improve internal batching and support for direct IO could have
a huge impact on performance.
The extra tests we show here demonstrate a baseline of a
worst case scenario for the performance of a single machine with one or more
devices. With request rates around 500,000 per second with under a
millisecond average latency, most workloads fit comfortably. While expanding
disk space works well, further development is needed to improve throughput
with multiple high performance devices.