In a recent article, The Next Platform’s Timothy Prickett Morgan argues that “cache is king” as he examines the implications of adding another layer of cache hierarchy to the general use of L1, L2, and L3 caches in current processor architectures by discussing the topic in hardware terms. The concept: Adding L4—another cache layer to the hierarchy.
Focusing on hardware solutions is certainly one approach. But one could also make the argument that by intelligently increasing the cache hierarchy, one could also increase overall performance. Further, by doing it in software, one is not bound by the traditional limitations of silicon footprint and the traditional physical constraints of power and heat dissipation.
The good news is that this isn’t just theoretical; it’s happening today. Let me back up a bit and explain the issue in layman’s terms.
A cache of cash!
Consider the scenario of walking into a bank branch to cash a check. And let’s also assume there is no line in front of the teller. One presents the check, suitably endorsed. The teller checks it over, opens up a drawer with cash in it, counts out the appropriate amount and hands it to you along with a receipt.
But what happens if the check is large, and there is not enough cash in the drawer to cover the check? Well, the teller disappears into a back room of the branch and out comes more cash taken from the stack of cash held at the bank for just this scenario. The time it takes to go to the back room to get the additional cash adds to the time it takes for the bank to service the customer. We call this latency.
But what happens if the check is so large that there isn’t enough cash in the bank branch to cover the check? Well, the bank branch manager has to get some additional cash from other branches or institutions to cover the check. However long this takes, it certainly is going to be longer than drawing it from cash reserves held at the local bank branch.
But what happens if the check is large enough that there isn’t enough cash reserves in all the bank branches combined to cover the check? That’s when the Federal Reserve Bank steps in and covers the cash need, assuming of course that the bank is a member of the Federal Deposit Insurance Corporation.
Now you have an idea of what L1 (the teller’s cash drawer), L2 (the total cash at the bank branch), L3 (the sum of all the cash in all the bank branches) are, and what might be called L4 (The Federal Reserve). Is L4 different than main memory? It’s really just a question of semantics, and that’s all I’ll say about that now.
Generalizing the algorithm
If we look at the bank transaction within a computing paradigm, the abstract algorithm that covers the deposit and withdrawal is the same at any given cache level. The only difference between the cache levels has to do with the physical realization of storing the cache elements, and the corresponding latencies involved. It’s also the case that a branch manager or a bank or even the Federal Reserve might adjust the cash requirements as it learns what the changing patterns are of cache usage. The size of a software cache need not be fixed in the way a hardware cache might have to be.
The number of software cache levels in the hierarchy is not fixed by hardware. Indeed, this is truly self-evident when one considers that hardware might have three levels of cache, but a modern operating system might manage its own set of caches, for example to store pages of memory or blocks of disk storage. Further, an SQL database might have its own set of caches used for its own purposes, such as storing indices, recovery data, or other data it sees being retrieved or updated on a frequent basis. At any architecture level, the considerations are mostly the same.
A software-defined approach: L4 cache for all
Now, consider a software-defined approach that utilizes the cache layers that are built into the hardware. This is TidalScale’s approach, and we use it to build a distributed virtual machine. It’s a virtual machine whose physical realization spans multiple physical servers. We call this a Software-Defined Server. In the virtual machine, we define what we call guest virtual processors and guest virtual memory. These guest resources become operational by mapping them to real physical processors and real physical memory as needed, on a demand-driven basis. If no re-mapping need occur, which is highly unlikely, there would be no performance penalty in the mapping, largely implemented in hardware. The physical and virtual would coincide for both memory and processors. So, it’s important to try to keep the re-mapping to a minimum. TidalScale’s Hyperkernel, which enables the creation and optimization of Software-Defined Servers, uses machine learning algorithms to achieve this.
Now, getting back to the cache levels, how might we think about an L4 cache? It’s actually pretty simple: Each server’s DRAM is an L4 cache of the virtual machine. It’s algorithmically invisible except for observed differences in latencies which we work to minimize.
So TidalScale’s Software-Defined Server technology already delivers an L4 cache. In other words, TidalScale utilizes a cache-only design.
Extending the hierarchy
Might we consider an L5 cache? Certainly. Consider a TidalScale configuration containing an additional physical server whose processors either do not exist, or are not used. Only the physical memory of the server is used. The guest physical memory is extended to include the physical memory of this additional server. We can call this additional memory an L5 cache.
Another advantage of this L5 cache is that it might be implemented out of either dynamic memory or persistent flash memory. As seen by an application or a database, this additional memory appears to be byte-addressable, just like all the other memory in the running system. No changes to an operating system or an application are needed. Everything just works as before.
Further, one could have multiples of these additional servers. The TidalScale architecture does not limit a Software-Defined Server to any fixed number of them. Finally, it’s easy to conceive of a mode of operation in which one could add or subtract these additional servers dynamically, just as one could add or subtract normal servers.
There’s no need to ‘invent’ L4 cache
Multiple levels of cache hierarchy are not a speculation. They are already here. It’s just a question of how we think about the problem.