Every producer of storage software (including what is hidden inside of “hardware”) has spent years and many engineering hours developing their cache algorithms. Disks have cache; controllers have cache; subsystems have cache; operating systems have cache. The better the algorithm is, the better the ratio of cache hits to cache misses. Ultimately though, the decision of what to have in cache is a guess. It may be a well informed and well thought out guess, but it is still a guess.
Let’s start with a definition of Cache from Wikipedia:
“In computing, a cache is a component that stores data so future requests for that data can be served faster; the data stored in a cache might be the result of an earlier computation, or the duplicate of data stored elsewhere. A cache hit occurs when the requested data can be found in a cache, while a cache miss occurs when it cannot. Cache hits are served by reading data from the cache, which is faster than recomputing a result or reading from a slower data store; thus, the more requests can be served from the cache, the faster the system performs.”
As this blog has discussed before, mechanical disks, spinning rust, are slow. For random accesses, it takes milliseconds to move a drive’s heads (mechanical latency) from one location to another and then, on average, 2-5 milliseconds more to wait for the needed sector to spin around (rotational latency) and pass the heads. SSD latency is often 1-2 orders of magnitude less than that, RAM is that much faster again and the cache inside a CPU is that much faster again. By comparison, a spinning disk is quite slow.
So caching some often read data sounds good: a cache hit comes from a much faster / closer / lower latency resource than a cache miss which comes from the actual non-volatile data repository. For a cache miss, a process probably needs to wait milliseconds.
Writes to storage are often cached also to allow a process to continue before the long process of getting data onto a disk completes. However write caching is trickier, because the application and the operating system are led to believe that a write operation has completed, when that data has not really made it all of the way to its final destination. If something bad happens (e.g., power loss) before the data reaches its final destination, data can be lost and file systems can be corrupted.
To prevent data loss, RAM used for write caching is often backed by battery or capacitor to keep the data viable for some amount of time in hopes that when operations recover, that “data on the fly” can be safely written to its destination. Most RAM solutions are now backed with Flash to keep it valid indefinitely. SSDs used for write cache are typically mirrored to prevent data loss due to the failure of a single SSD. Cached writes, also called a write back cache , must be handled properly so that no data is lost during a fault and subsequent recovery. In most cases, an understanding of that process in advance and a set of procedures to deal with such a possibility make it quite safe to use.
What could be wrong with all of that? For many applications, almost nothing at all. That statement, though, was “for many applications”, not “for all applications”. Why? Because for some applications the difference between a cache hit and a cache miss is extremely significant. For a given solution, the latency (and resulting throughput) for a cache hit or a cache miss can be calculated, so that there is a deterministic answer for worst case performance in each situation.
It is much harder, though, to calculate that hit ratio. Ratios for synthetic benchmarks are barely relevant, if relevant at all. Benchmarks based on the actual application and similar data or a subset of the data are not very relevant either. The odds of finding data from a much smaller subset in cache are much better. The only way to really know, is to run the actual application in the actual environment with the actual data set. While that caveat is true for most benchmarks, it is not always very practical.
To complicate things even further, one cannot really expect those hit ratios to be static over time: data evolves; data usage patterns evolve; applications evolve; user requirements and user behavior evolve; even cache algorithms can evolve when updates are applied. Finally, since there are several levels of caching being used in many situations (disk, controller, subsystem, operating system), how close or how far is the cached data from the requester?
Even a known hit ratio for the real application with real data only yields a kind of statistical expectation of how many I/Os will yield a hit and be serviced in time, A, and how many will be a miss and be serviced in much longer time, B. It does not provide insight into the performance for any particular item of data. If data patterns leading up to a specific access are all known and the size of the cache is known and the cache algorithm is understood then a reasonable prediction could be made. The basis of communications theory is that there is no communication if the data was already known at the receiving end, and so, all of that information is probably not known, or at least, not easily analyzed, just before any particular random access.
The result is that caching results in nondeterministic behavior. True, cache miss behavior is probably deterministic. On a truly random access that was not predicted in any level of cache, we can calculate the worst-case behavior. Worst-case latency can be calculated, along with the resulting bandwidth. And typical latency may be estimated from an analysis of actual usage on actual systems. Calculation of the specific latency of a particular I/O access, in advance, is just not possible.
Caching is good, but as we saw, the only deterministic answer for any particular access is to assume as a worst case that the data is not being held in any cache. Some applications and some environments require a deterministic measurement. How does one improve the deterministic latency for storage I/O? Today, the answer is to use storage with arrays of enterprise-class NAND Flash SSDs for the primary storage. Sure there are lots of layers in between that may try to help by caching. Some will get it right and some will still make assumptions that worked much better for rotating mechanical disks, but in the end, the deterministic worst-case answer is now tens or hundreds of microseconds instead of milliseconds or tens of milliseconds.
By next year, NVMe SSD products based on 3D XPoint™ technology will probably be the answer. And that technology will improve year over year. So the answer is never final. The answer is just, what is the best we can do within our budget now?