Every producer of storage software (including what is hidden inside of “hardware”) has spent years and many engineering hours developing their cache algorithms. Disks have cache; controllers have cache; subsystems have cache; operating systems have cache. The better the algorithm is, the better the ratio of cache hits to cache misses. Ultimately though, the decision of what to have in cache is a guess. It may be a well informed and well thought out guess, but it is still a guess.
Let’s start with a definition of Cache from Wikipedia:
“In computing, a cache is a component that stores data so future requests for that data can be served faster; the data stored in a cache might be the result of an earlier computation, or the duplicate of data stored elsewhere. A cache hit occurs when the requested data can be found in a cache, while a cache miss occurs when it cannot. Cache hits are served by reading data from the cache, which is faster than recomputing a result or reading from a slower data store; thus, the more requests can be served from the cache, the faster the system performs.”
As this blog has discussed before, mechanical disks, spinning rust, are slow. For random accesses, it takes milliseconds to move a drive’s heads (mechanical latency) from one location to another and then, on average, 2-5 milliseconds more to wait for the needed sector to spin around (rotational latency) and pass the heads. SSD latency is often 1-2 orders of magnitude less than that, RAM is that much faster again and the cache inside a CPU is that much faster again. By comparison, a spinning disk is quite slow.
So caching some often read data sounds good: a cache hit comes from a much faster / closer / lower latency resource than a cache miss which comes from the actual non-volatile data repository. For a cache miss, a process probably needs to wait milliseconds.
Writes to storage are often cached also to allow a process to continue before the long process of getting data onto a disk completes. However write caching is trickier, because the application and the operating system are led to believe that a write operation has completed, when that data has not really made it all of the way to its final destination. If something bad happens (e.g., power loss) before the data reaches its final destination, data can be lost and file systems can be corrupted.
To prevent data loss, RAM used for write caching is often backed by battery or capacitor to keep the data viable for some amount of time in hopes that when operations recover, that “data on the fly” can be safely written to its destination. Most RAM solutions are now backed with Flash to keep it valid indefinitely. SSDs used for write cache are typically mirrored to prevent data loss due to the failure of a single SSD. Cached writes, also called a write back cache , must be handled properly so that no data is lost during a fault and subsequent recovery. In most cases, an understanding of that process in advance and a set of procedures to deal with such a possibility make it quite safe to use.
What could be wrong with all of that? For many applications, almost nothing at all. That statement, though, was “for many applications”, not “for all applications”. Why? Because for some applications the difference between a cache hit and a cache miss is extremely significant. For a given solution, the latency (and resulting throughput) for a cache hit or a cache miss can be calculated, so that there is a deterministic answer for worst case performance in each situation.
It is much harder, though, to calculate that hit ratio. Ratios for synthetic benchmarks are barely relevant, if relevant at all. Benchmarks based on the actual application and similar data or a subset of the data are not very relevant either. The odds of finding data from a much smaller subset in cache are much better. The only way to really know, is to run the actual application in the actual environment with the actual data set. While that caveat is true for most benchmarks, it is not always very practical.
To complicate things even further, one cannot really expect those hit ratios to be static over time: data evolves; data usage patterns evolve; applications evolve; user requirements and user behavior evolve; even cache algorithms can evolve when updates are applied. Finally, since there are several levels of caching being used in many situations (disk, controller, subsystem, operating system), how close or how far is the cached data from the requester?
Even a known hit ratio for the real application with real data only yields a kind of statistical expectation of how many I/Os will yield a hit and be serviced in time, A, and how many will be a miss and be serviced in much longer time, B. It does not provide insight into the performance for any particular item of data. If data patterns leading up to a specific access are all known and the size of the cache is known and the cache algorithm is understood then a reasonable prediction could be made. The basis of communications theory is that there is no communication if the data was already known at the receiving end, and so, all of that information is probably not known, or at least, not easily analyzed, just before any particular random access.
The result is that caching results in nondeterministic behavior. True, cache miss behavior is probably deterministic. On a truly random access that was not predicted in any level of cache, we can calculate the worst-case behavior. Worst-case latency can be calculated, along with the resulting bandwidth. And typical latency may be estimated from an analysis of actual usage on actual systems. Calculation of the specific latency of a particular I/O access, in advance, is just not possible.
Caching is good, but as we saw, the only deterministic answer for any particular access is to assume as a worst case that the data is not being held in any cache. Some applications and some environments require a deterministic measurement. How does one improve the deterministic latency for storage I/O? Today, the answer is to use storage with arrays of enterprise-class NAND Flash SSDs for the primary storage. Sure there are lots of layers in between that may try to help by caching. Some will get it right and some will still make assumptions that worked much better for rotating mechanical disks, but in the end, the deterministic worst-case answer is now tens or hundreds of microseconds instead of milliseconds or tens of milliseconds.
By next year, NVMe SSD products based on 3D XPoint™ technology will probably be the answer. And that technology will improve year over year. So the answer is never final. The answer is just, what is the best we can do within our budget now?
A couple of more thoughts:
Some more about “read-ahead cache”.
Read-ahead caching is one technique to help cope with the enormous latencies of rotating magnetic media. The concept is, if the drive has already conpleted a seek to this cylinder, read the rest of the track while the heads are there because it is reasonable to guess that the data just after this request will be requested next, or soon anyway.
Read-ahead caching is something that often happens at several levels in the stack. That can cause a cascade of lower levels in the stack reading even farther ahead assuming that the request received did not already include any read-ahead.
In days gone by, disks were described by their geometry: heads x cylinders x sectors per track. And so, read-ahead, for a single physical disk, could be attempted reasonable. For more than a decade, that uniform geometry has not accurately described disks. Each platter has a number of zones as you move from the center f the platter to the outer edge, because there is more area in those outer zones and so, manufacturers put more sectors on outer tracks than they do on inner tracks. The only thing that can really be know, is the number of logical blocks and the assumption that a disk can move from one logical block to the next without inordinate seek overhead.
It is really worse than that though. Most enterprise systems do not directly access physical disks. Requests pass through RAID controllers or SAN infrastructure with RAID on the other end. All of that further obscures the understanding which an operating system or application might have of what requests cause seeks.
Related to this is the concept of “seek optimization” which is also known in some places as the elevator algorithm. File systems and block I/O systems will often re-order their requests to storage so as to minimize seeks. The thought is, if I am going to seek past location already-requested-location Y in my attempt to get data from location X, which was requested first, then stop at Y on the way by and reduce the average seek latency. Most elevator systems adopt a similar approach, re-ordering their stops at intervening floors, and not strictly honoring the order of calls.
Seek optimization can work well with a single disk. If RAID geometry is unknown to the system, then seek optimization is a relatively futile waste of time. Some file systems like XFS and EXT4 can be informed of RAID geometry at format time and this can help the file system arrange its meta-data, but it still does not help the block driver below that level access real data any better.
With all-SSD storage, which does not have any mechanical latency, the best approach is almost always to disable read-ahead cache and disable seek optimization. In other words, instruct the storage layers to skip all of that optimization they have developed through the decades to make rotating magnetic media not so horrible.