Open Compute Rack & the Software-Defined Server

Let’s take a trip back in time. It’s 2009, and Facebook has just become the No. 1 social network in the United States.  In January of that year, Facebook reports it has 150 million users worldwide. Only eight months later, membership doubles to 300 million.

As Facebook careens toward ubiquity, its infrastructure team examines the cost of accommodating the company’s growth and sets out on a quest to build a better data center. Their efforts bear impressive fruit. After two years of work establishing a purpose-built data center, Facebook’s infrastructure team sees a 24 percent drop in operating costs.   

Meanwhile, other leading web sites and web services companies work to optimize their own data centers. But Facebook’s next move is something new entirely: The company shares its infrastructure designs with the public, bringing the open source movement to hardware.  In 2011 Facebook, Intel, Rackspace, Goldman Sachs and Andy Bechtolsheim launch the Open Compute Project (OCP) and incorporate the Open Compute Project Foundation to sustain this new open source hardware.  The organization’s goal is simply to enable mainstream delivery of the most efficient designs for scalable computing.  It will eventually grow to hundreds of members.

The Better Data Center

OCP starts with a blank canvas for datacenter design, which gives it a free hand to:   

  • Remove unneeded components and costs (I.e. faceplates, LEDs, buzzers – making a “vanity free” server).
  • Move components within designs to the optimal location.
  • Separate components that need to scale independently or have different lifecycles.
  • Centralize shared components for lowered costs and greater efficiency.

This leads to the Open Compute Rack and Open Compute Server designs which optimize thermals and shared a more efficient power system.  The sharing of a power system eliminates the per-server duplication of power components that were costlier and less efficient.  Moving and centralizing these components also enables purchasing and upgrading servers without changing out the longer-lived infrastructure. 

But why stop there?  What else can be removed, moved, scaled, and shared?

Fast forward to 2014, when Microsoft introduces the Open CloudServer OCS Specification.  More components are removed, moved, scaled and shared. The BMC moves to a shared BMC for up to 20 servers, the fans move out of the server to the rack, network transceivers move to a separate tray backplane, and JBOD trays enable independent storage capacity scaling and upgrading.

The next year, Facebook releases the multi-node server platform Yosemite.  This shared many components with the key addition of sharing a networking adapter in a separate location from the micro-server cards.

In 2016, Facebook releases the NVMe JBOD Lightning, which removes and moves the PCIe flash storage from the server so it can be shared and scaled.

Now the design innovation of removing, moving, scaling and sharing takes on a new moniker – rack disaggregation.  To completely blur the lines that once defined where one server stops and the next begins, we are left with the two remaining building blocks of computers: CPUs and memory.  What is holding them back from being moved, scaled, and shared?

The Final Step in Disaggregation

The challenge of sharing memory and CPU resources in a way that they can be independently replaced and upgraded boils down to two issues: a high-performance interconnect between machines that is cost effective and an operating environment that can provision and mobilize memory and CPU resources behind the scenes.  Both of these are challenges that can be overcome.

Interconnect

It is amazing how fast 10 Gigabit Ethernet went from a high-cost solution in 2011 to becoming a standard for most servers in 2015, and then onto full commoditization today.  A solution that can leverage Ethernet as an interconnect can also meet the cost-effective interconnect challenge.  To tackle the high-performance requirement two avenues are available – limit the traffic and increase the bandwidth.

Enter QCT Rackgo-X RSD servers

RackGo-X.png

QCT (Quanta Cloud Technology)’s Rackgo-X RSD combine four OCP servers with a shared embedded network design.  A single shared networking chip connects to each server and provides  8x PCIe Gen3 lanes  per server.  The same chip also embeds a switch with three 100 GigE ports to the outside world.   Altogether, this provides 8 CPUs,  64 DIMMs,  256 Gbps of internal interconnect and 300 Gbps out of the box.  The Rackgo-X RSD provides the hardware piece of the rack disaggregation puzzle.

Enter the TidalScale HyperKernel

To enable an application to run on a disaggregated rack, it needs to be oblivious to how the components have been disaggregated. They key to making this happen is the Software-Defined Server.

The TidalScale HyperKernel enables organizations to create Software-Defined Servers on demand, which in turn provides the ability to scale-up beyond the limits of a single physical server. TidalScale’s approach creates a Software-Defined Server that runs on existing commodity hardware and doesn’t require any changes to operating systems or application software.  This accomplished by virtualizing the CPU, memory and I/O resources so they can be mobilized across the servers without the application being aware.  The HyperKernel works to migrate working sets of memory and CPU resources to the same physical machine to reduce interconnect traffic and latency.  This mobilization completes the missing component needed to move, scale and share both CPUs and Memory.

A rack full of QCT Rackgo-X RSD servers and storage system provides a platform of cost-effective components with a high-performance interconnect.  With Software-Defined Servers provisioned on this rack-disaggregated hardware, applications can be deployed without even being aware that they are running across multiple servers. 

The ethos of the early days of innovative data center infrastructure design is with us still. Building a better data center has never been easier as both hardware and software innovations continue their march toward the vision for efficient scalable computing that OCP pioneered.

See TidalScale & QCT at OCP Summit

To find out more about TidalScale and OCP,  come see us at the OCP event in Booth B23.

To see the QCT Rackgo-X RSD servers and storage system go to Booth B5.

References:

Topics: TidalScale, software-defined server, OCP