300x Performance Gains Without Changing a Line of Code

In Gary' Smerdons last post, he listed eight ways Software-Defined Servers can help reduce OpEx and CapEx, while helping data center managers extract maximum use and value from existing IT resources.

As vital as these benefits are to IT, operations, finance and other areas, the ability to scale your system to the size of your problem is just as beneficial to scientists and analysts – the people on the front lines of big data analytics.If you fall into that camp, then you’re probably familiar with the dreaded “memory cliff.” That’s the point at which your problem size overtakes the amount of DRAM available on your server. When you fall over the memory cliff, your system starts paging from storage, which can bring performance to a crawl. Consider that the processing latency of DRAM is a speedy 50ns. But when you fall off the cliff and you put your problem in the hands of slower storage, the performance cost is enormous. Put into human terms, hitting the memory cliff is like seeing your weekly spend on groceries skyrocket from $50 (that’s DRAM) to $150,000 (NVMe Flash, the next fastest media). Imagine what that does to application performance.Screen Shot 2017-03-14 at 11.15.52.png

As data grows in volume, variety and velocity, the memory cliff will only grow closer. So it’s no wonder users of R, TERR, Spark, Python and other tools are seeking out solutions that don’t involve complex sharding of data, algorithms that compress the data space, or downsizing data to fit the constraints of available systems. These workarounds can delay insights or force data scientists to live without them altogether.

There are compelling reasons why this is unacceptable.

  • Analysts are well aware that finer granularity improves the accuracy of predictions, but as granularity increases, so do RAM requirements. The memory cliff is a barrier to insight.
  • Scaling back data often causes analysts to miss crucial relationships – sometimes with disastrous results.
  • IoT, sensor, and mobile data is fueling startling data growth, which can be worrisome for analysts who know that working with more recent information can lead to better predictions.

Software-Defined Servers make it possible to prevent these costly oversights, and they do so without onerous costs or unnecessary complexity. As Gary described last week, Software-Defined Servers combine one or more multiple servers into a single system that allows users to access all the memory, cores and I/O associated with those servers. The Software-Defined Server can present a single large system or can be easily configured as nodes of different sizes to handle varied workloads. This is all achieved on demand, with standard hardware and with no changes to software.

The implications for actual application performance can be profound. Today, memory constraints present performance problems when tools like Spark, Python and OpenSource R are scaled across traditional clusters. Spark experiences multiple problems when running across a cluster, including memory issues and random errors (and requires significant expertise when those issues arise). As for Python RF, mounting a cluster is, as with Spark, a challenging task for all but the most expert users, requiring constant monitoring and optimization. And R stores all objects in memory – its Achilles’ heel – which means the dreaded memory wall can appear quickly.

When run on a Software-Defined Server, however, those performance-sapping complexities, errors and limitations disappear. No more down-sampling. No more sharding across clusters. No more re-coding applications. Data scientists get to focus on their actual job, not on learning how to get big problems to run across clusters.

Let’s look at what this means.

One issue that arises when trying to run a model in R is that the data footprint in memory is often larger than the footprint on disk. In a patient expense prediction test we ran on the Center for Medicare & Medicaid Services (CMS) public use data set, a set of 20 .csv samples, each originally sized at 6.1GB on disk, totaled more than 680GB once loaded into memory.

For most, this is likely to lead to a harrowing trip over the memory cliff. But with a Software-Defined Server – at TidalScale, we call it a TidalPod – needing more memory simply requires adding more nodes to the TidalPod. We see this when we test models in Random Forest (see line 205 here), a mainstay for R users. Note how this Random Forest graph doesn’t show what you’d normally see when you keep loading more and more data into memory, namely an exponential or parametric curve upward. What you see on the TidalPod is a near-linear scale. No cliffs and no processes slowing to a crawl even while, under the TidalScale covers, this workload is being smeared across 5 servers. Screen Shot 2017-03-14 at 14.29.42.png

Using our CMS-based, full featured Open Source R benchmark, let’s see how running a problem entirely in memory on a TidalPod with five 128GB nodes compares to trying to run that same problem on a typical bare metal server with 128GB of RAM. As you can see, the single bare metal server meets the memory cliff right at where you’d expect, and execution time soars to the point where it would take more than five months to run the problem. The TidalPod, in contrast, runs roughly 300 times faster. That’s the in-memory performance difference that a Software-Defined Server can make.comparative_r_performance2.png

This is especially good news for organizations equipped with numerous “sweet spot” servers – down-to-earth configurations chosen for their price/performance virtues – that could be marshaled into service as part of a memory- or core-rich TidalPod. With Software-Defined Servers, your army of sweet spot servers instantly becomes more useful, more scalable and more flexible than even your largest big data system.

For more details on the performance advantages of Software-Defined Servers, check out our new webinar where I look more closely at these and other examples. In related topics, we have related blog posts on Tips for Scaling Up Open Source R, and The Secret to Keeping your R Code Simple. Also, we presented our benchmark results at the Open Source UseR conference in 2016. You can access the R whitepaper here:

Be sure to check back next week to learn how all these benefits come together in a Software-Defined Server solution that’s astoundingly easy to use.

Take TidalScale for a Test Drive

Topics: TidalScale, software-defined server, in-memory performance