INTRODUCTION - R IN-MEMORY
When we started searching for large scale Open R benchmarks we were surprised to find few good workloads for multi-terabyte sized TidalScale systems. We ended up writing our own R Benchmark that allowed us to scale R workloads to arbitrarily large in-memory sizes. In the process we learned a few tips and tricks that we thought we'd share for how to run large workloads using Open Source R.
BRIEF REVIEW OF OPEN SOURCE R
R is an analytic and statistical programming language whose use is rapidly spreading as organizations sharpen their ability to understand and learn from data they amass. Many operations in R are memory intensive, and analysts and data scientists often struggle to keep their working data sets within the limitations of a single computer to avoided the dreaded "Memory Cliff". Customers report that once R hits the Memory Cliff it becomes effectively unusable and they are forced to either downsample their data or completely rewrite their application in Python specifically for a scale-out architecture (if the problem is able be split across multiple machines).
BRIEF REVIEW OF TIDALSCALE
The TidalScale HyperKernel provides the ability to scale-up the execution of applications beyond the limits of a single physical server. Unlike traditional scale-up or scale-out approaches — which either require purchasing new hardware or rewriting software to run across clusters — TidalScale’s approach creates a Software-Defined Server that runs on existing commodity hardware and doesn’t require any changes to operating systems or application software. The TidalScale HyperKernel’s software defined scalability effectively removes the Memory Cliff and delivers a scalable and high performance platform for running R analytic loads against datasets that exceed the capacity of the traditional single servers upon which the TidalScale HyperKernel is running.
LESSONS LEARNED RUNNING LARGE R WORKLOADS ON TIDALSCALEI learned several lessons in the process of running this benchmark:
- It is easy to use R on a TidalScale system: The scripts we wrote on our laptops ran unmodified on our very large TidalScale system.
- Data exploration is fast on a TidalScale system: The iterative process of exploring what relationships exist in the data is tremendously faster when the complete data set is sitting in system RAM.
- Control Garbage Collection explicitly: Precise control of garbage collection speeds throughput (for this workload). Through trial and error we discovered that it was optimal to:
- Turn garbage collection off for file load and join, and then
- Turn garbage collection on for all subsequent analysis steps.
The results of this benchmark test demonstrates that TidalScale successfully traverses the Memory Cliff that applications typically encounter when they exceed the size of reasonably priced hardware systems. As R scales in its memory requirements, TidalScale can provide a larger platform to run the application, without modification, at larger in-memory workload sizes. We also learned that
- It's easy to deploy R on TidalScale systems,
- Data exploration is fast on TidalScale systems, and
- Its important to explicitly control garbage collection at large memory sizes
Click on either button below to 1) see a video presentation at the UseR conference on the results of this benchmark, or 2) read the R performance white paper.