Computers fail. We wish it weren’t so, but it is. Personal computers fail more often than we would like. Maybe it’s because of a hardware failure (e.g. “I dropped my laptop”, or “the dog chewed on my iPhone”) or a component failure, like a memory or battery failure.
Servers in a data center fail also. They might be in a public or a private cloud. Those failures are often less visible, because the data centers were designed for high reliability from the start, and there is an operations staff that often can quickly respond to those failures. The larger cloud providers have taken steps to ensure that users are most often shielded from these failures. This may take advantage of features built in to modern applications, particularly databases, that ensure transparent recovery in the presence of either application or hardware failures.
Hardware reliability mechanisms rely on early detection and can then use a combination of redundancy or shadowing, which can be applied to many components, including networks, mass storage, or error correcting memory. Software mechanisms might include features like consistency checking and error logging, failover, and transaction restart and rollback for recovery.
At TidalScale, we have the ability to aggregate multiple commodity physical servers into what appears to be a large scale-up virtual server. Based on configuration parameters, the virtual server can have the sum of all the processors, all the memory, all the network ports, and all the storage devices of the physical servers. We call these virtual servers Software-Defined Servers.
When we started talking about Software-Defined Servers, people wanted to know what the impact would be on reliability. It was a very natural, obvious question to ask. After all, simple math would dictate that if each physical server has 99.9% hardware reliability, aggregating them together would have only 96% reliability (99.9% ^ 40 < 96%). This is clearly not acceptable. But it’s also not an accurate rendition of the problem. There is more to the story.
First of all, the source of unreliability might be hardware, or it might be software. A commonly held belief is that modern hardware is far more reliable than modern, increasingly complex and distributed software. If an integrated system is running fewer instances of software, it stands to reason that there are fewer sources of unreliable behavior. With TidalScale, for example, there is a single instance of an operating system running in a TidalScale cluster, not one instance of an operating system running on each server. So, it stands to reason that if there are 20 servers, running a single instance of an operating system, that’s going to be more reliable than 20 separate servers running 20 operating system instances. For this article, let’s take that as a given.
Similar things can be said for application instances. Over time, more and more error recovery is being built into software layers. Often this happens through redundancy, quick failover recovery, relaxed guarantees on software (like “eventual consistency” even when strong consistency is what is really desired). On the other hand, if applications have already taken steps to increase reliability, they can continue to be used in a TidalScale Software-Defined Server without modification.
Minimizing the impact of hardware failures
But, we asked ourselves, is there a different, perhaps a better way of minimizing the impact of hardware failures?
To answer that question, consider that the TidalScale software automatically migrates virtual resources around a set of physical servers, under control of a software layer of machine intelligence that sits below the operating system. Virtual processors, which, as far as a guest operating system is concerned are physical processors, migrate to where they are needed. Virtual pages of memory, which, as far as a guest operating system is concerned are physical pages of memory, also migrate to where they are needed.
Mass storage of course cannot migrate virtually since it’s physical and stateful - there are actually bits on each storage device, but memory buffers associated mass storage devices can migrate and I/O requests can be made to be virtual. And, if mass storage is remote, this becomes a non-problem; there is no local storage to migrate. The key point is that processor and memory migration take place in a way that is completely invisible to the guest operating system. This happens extremely quickly and is many orders of magnitude faster than can be accomplished with human intervention.
By using masquerading techniques, the same can be said for network interface cards (NICs).
TidalScale’s approach to Recovery
So, TidalScale has had for some time now sufficient mechanisms to deal with hardware failures. It’s just a question of putting the mechanisms together in the right way. Consider today’s car. When there is a problem with the car’s oil level, a red light alerts the driver that there is a problem, and that service is required. The car does not immediately stop. The alert has been raised. The same thing happens with today’s servers. If one physical server of our software- defined server is exhibiting some errors, would it be possible to bring the physical server down and either fix it or replace it without having to disrupt or reboot the entire software-defined server and the applications running on it?
Let’s take a specific example. Suppose we have a rack of 20 physical servers. And let’s also suppose one is a hot spare, and 19 are running as a single software-defined server. Modern motherboards expose a lot diagnostic telemetry data. They can alert us when the temperature on a motherboard is rising above normal acceptable limits. They CAN alert us when DRAM error rates are increasing, and when error rates on NICs are increasing. they can also tell us other things, such as whether or not we’re running an out of date BIOS on a server that needs to have some security patches installed.
For illustration, consider this example. If there is a problem with a fan, temperature on that server may increase. This can trigger the following actions:
We can bring the spare server online. We simply tell the other servers that it’s available. It’s now part of the software-defined server, and can accept virtual processors, and active guest pages of memory the other servers might decide to send it as if it’s been there since the system was booted up.
We can quarantine the failing server. We do this by telling all the other servers that we’re planning to remove it, and none of the other servers should consider migrating virtual processors to it or migrating active pages of memory to it.
We then tell the failing server to evict its active virtual processors, and its active pages of memory to other servers. This can be done quickly.
When the evictions are complete, the failing server is no longer part of the software- defined server. It’s not generating any requests, and it’s not receiving any requests. We can power it down. Then we can fix it or replace it in the rack and make it the new spare.
So, we started with 19 servers in the software defined server and a spare. And after this process is complete we end up with 19 servers and a spare. We did all this without having to reboot the software defined server! The servers themselves have become hot swappable. Fred Weber, a close associate of TidalScale, suggests that we could make the claim that all components have become “hot swappable”.
Conceptual Fault Tolerance
The key point is this: instead of trying to come as close as possible to the reliability of bare metal, we have presented a thought experiment that shows in fact, it is feasible to make a TidalScale software-defined server far more reliable than a bare metal server.
This has been tried before at companies like Tandem and Stratus. Those companies spent many millions of dollars to modify hardware, operating systems, and databases to achieve higher levels of reliability. With TidalScale, higher reliability is achieved using the basic mechanisms already present in the TidalScale hyperkernel.
In the near future, we will make this capability available to our customers.
But, the story gets better. Since we have the ability to add and subtract servers from a running software defined server without having to reboot it, why not monitor the load on an operating system, with activity monitors that already exist, to identify when a software defined server is in danger of getting overloaded. It might be that memory utilization is increasingly high, or that the load on the processors is very high.
Once we identify this condition, we can add another server to the system to help alleviate the excess load. We do this, of course, without disrupting the running system.
And, oh by the way, it works in the other direction also. If we identify a software-defined server that is underutilized, we can subtract servers as well.
Automatic Dynamic Thin Provisioning
But, the story gets even better. If humans can monitor the system and instruct the software defined server to add or subtract more servers, why can’t the software defined server do it automatically? Answer: it can.
We do not claim to have achieved a 100% level of reliability. What we have shown was that because of other features already in place at TidalScale, we can take advantage of preventative maintenance features in the hardware to achieve a reliability level in excess of single physical servers.