Breakthrough flexibility & performance in third generation of TidalScale software GET THE DETAILS >

The TidalScale Blog

    Why Not Just Build a Bigger Box?

    Dr. Ike Nassi founded TidalScale on the premise of aggregating the resources available in one to many commodity servers so they can handle huge database, graph, simulation and analytics computations entirely in memory. 

    The mantra was to scale the system to the problem, versus having to invest in expensive large systems or adding to the software burden by forcing users to carve their problems up to fit available resources. In the process, that ability promised to address a major shortcoming in data centers, where servers -- unlike storage and networking – have always been an essentially fixed resource.

    Still, we often hear people ask, “If I need a 4TB system, why don’t I just buy one – or build one?” The size might be 1 or 2 terabytes, but could be 8 or 10. The question is a worthy one, and it’s loaded with subtlety. What if 2TB suffices today but you end up needing 5TB next quarter? Or, what if you overestimated and 2.5TB would suffice (and your quarterly budget is up for review)? 

    Building spectacular virtual machines to handle big problems is just one facet of the TidalScale value proposition. The flexibility to split a 4TB system into a pair of 2TB servers or to lash it with additional compute nodes to have a 7TB server goes to the true core of our mission: to bring flexibility to modern data centers by right-sizing servers on the fly to handle any workload. In this context, a rack of servers, or a row of racks, is a menagerie to be deployed as Software-Defined Servers in a wide range of sizes.

    The Question

    But this blog is about single instances. So, why not just build a conventional, physical 4TB box if you need one? Or 10 or 16TB?

    If you’ve ever shopped online for a server you know the ritual questionnaire. Here’s a taste: :

    • How many CPU sockets on the board?
    • How many cores per CPU?
    • How much cache?
    • What is the CPU clock rate?
    • How many DIMM slots on the board for memory?
    • What DIMM sizes are available and at what speed and cost?

    The questions are interrelated, and they are tied to your application. They also beget others, including some that are essential to building a better big box. For instance, what ratio of CPU cores per gigabyte of memory will deliver the optimal application performance?

    A 6TB Example

    Our interest here is large, in-memory calculations, so let’s configure a 6TB box. How many DIMMs will we need? Today, 32GB DIMMs are a cost-effective choice for large-memory systems, but they don’t have the memory density we need for this monster. We choose 64GB DIMMs, and pay a premium. A little arithmetic says we need sixteen 64GB DIMMs per terabyte, for a total of 96 DIMMs to reach 6TB.

    Now, how many CPUs (sockets, not cores) do we need? A typical x86 CPU designed for use one or two to a board can access up to 12 DIMMs. But that’s not enough. We’ve got to go with CPUs designed for deployment at four or more to a board. They typically access 24 DIMMs, so we’ll take four to have a total of 96 DIMMs. At this point we choose our core count to match our needs. The options range from 4 to 24 cores per CPU. Suppose we choose 18 cores, for a system total of 72. That’s 12 cores per terabyte, not an especially high ratio.

    We’ll probably need a 4U (rack unit) chassis to hold all this hardware, along with whatever networking and storage devices we add.

    One System or Four Systems?

    We roughly configured a 6TB server with 4 CPUs totaling 72 cores. It’s a little more complex under the hood. Each CPU has its own memory controller attached to 24 of the 96 total DIMMs. From one point of view, our potent system is really four distinct 1.5TB systems.

    Screen Shot 2017-05-01 at 14.32.06.png

    What ties this picture together is QuickPath Interconnect, or QPI, a point-to-point interconnect that allows each CPU to reach the blocks of memory beyond its immediate addressability.

    While three lanes of high-speed QPI may be supported in the system, latency occurs anytime a CPU reaches beyond its immediate memory. Ultimate performance is an artifact of how the memory is filled; it does not improve with usage.

    Some observations on our build-out:

    • A 4U chassis carries four CPUs, 16 cores each, and 6TB of DRAM
    • Each CPU addresses 1.5TB directly, relying on QPI to reach the whole memory array
    • We needed 64GB DIMMs to achieve a full 6TB
    • We needed 4 CPUs to address the 96 DIMMs
    • Growing the system to 7TB or 8TB is not feasible
    • Shrinking the system to 4TB or less requires careful rearrangement of the memory and CPUs

    How a 6TB Software-Defined Server Compares

    TidalScale simplifies the construction of a 6TB server. Here’s a quick recipe:

    • A good building block is a 1U chassis with two CPU sockets and 24 DIMM slots
    • Use cheaper 32GB DIMMs to reach 768GB per unit; choose CPUs with 12 cores for a total of 24 cores per unit
    • Assemble 8 units to achieve 6TB with a total of 192 cores, which is 32 cores per terabyte

    In these units, each CPU addresses 384GB directly, with QPI to reach the other 384GB in the chassis.

    How do the two systems compare? This table lays out the basic numbers.

    Screen Shot 2017-05-01 at 14.33.01.png

    The Software-Defined Server is built from “sweet-spot” components in the current market, using cheaper DIMMs (per gigabyte) and cheaper processors. The software-defined server offers real flexibility – and when implemented using TidalScale’s HyperKernel software, it requires no modifications whatsoever to your OS or your application. In the larger context of a rack, a row, or a datacenter, Software-Defined Servers are a big win.

    It’s interesting to look at the performance implications of the QPI design. The 4-socket physical system diagrammed earlier bears a vague resemblance to a poor-man’s Software-Defined Server – four units of CPU plus memory connected by an interconnect. Yes, QPI is a very fast interconnect, but it lacks the machine learning TidalScale employs to reduce latency by collecting compute threads and their data according to the application’s behavior, and then automatically migrating them around the aggregated resources that will extract the optimal performance from that workload. On a single-box system, the more CPUs there are, the more latency there will be. There is no similar optimization, no pooling of workloads around the resources (cores or DRAM, for instance) those resources need most. Just an array of CPUs and an increasing rate of latency that degrades the net value of your big box.

    Build Big or Scale Big?

    Deciding whether to build big or scale big with a Software-Defined Server is an interesting question. Our intuition may suggest that a single, integrated box is best.But in many contexts (and in environments where different workloads tax different aspects of a large system at different times), the ability to right-size servers on the fly is the superior option, and offers far greater opportunity to handle varying workloads, spikes in transaction traffic, the demands of new applications and services, and all the other unpredictable inevitabilities of managing a modern data center.

    A TidalScale Software-Defined Server uses fast, commodity CPUs and DRAM, leverages machine learning to drive down processing latency and self-optimize performance, and guarantees the freedom to scale up and down easily as requirements change.

    Test drive a Software-Defined Server to see for yourself:

    Take TidalScale for a Test Drive

    This blog is dedicated to the memory of Carl A. Gjeldum (1922-2017).

    Topics: software-defined server