We recently came across a problem that illustrates how software might be reconsidered in this new, software-defined-server environment.
Customer Problem Statement:
- Consider two tables of historical data for each of, for example 3,000 securities. One table is called “Left” and one “Right”.
- Each table for each security has a column of timestamps, and a column containing the name of the security its represents (e.g. “AAPL”), and additional data. The Left table might have, for example, 150 additional columns of data, and the Right table might have, for example, 100 columns of additional data.
- Each table, both Left and Right, might have, for example, 10,000 rows.
- The tables are initially either read from disk, or a read from a networked data repository, or in the case of our sample application, artificially generated.
- The data is rarely updated.
- All the constants mentioned above are arbitrarily selected. They can be much larger, or smaller, depending on need.
- All rows of data need to be merged into a single master table, sorted by timestamp.
- The sorted database then must respond to many different queries, not all of which are predictable in advance.
If you do the math, this is a lot of data. If you increase the number of securities, or the number of rows, or the number of columns of data per row, the amount of data can greatly increase. The cost of a single computer to do this can be high. The amount of network traffic for a scale-out implementation of this can also be high.
How can this be implemented efficiently when no server can contain this amount of data?
Answer: combine a bunch of servers into a “software-defined server” which looks like a single server large enough to hold all the data, and which responds to queries very fast.
How to write an application to take full advantage of a Software-defined Server:
- Under control of a parent thread, create one sub-thread per security.
- Each sub-thread generates (or loads from a local or remote database) the data for each security into memory; the memory is accessible by the parent process. The sub-threads all operate in parallel, making ingestion potentially much faster.
- When a sub-thread finishes, it notifies the parent that it has completed its job, and then terminates, ultimately leaving only the parent thread. All the data remains in main memory.
- The parent then builds a table of pointers to all the rows of all the tables, of all the securities (#securities * #rows * 2).
- The parent then sorts the table of pointers in order of timestamps of the rows to which it points.
- Start responding to queries.
- The merges are very simple. One can use a recursive merge sort (order of n log n).
- All the securities can be loaded in parallel by their own separate threads, which, taken together, are embarrassingly parallel.
- Once the tables for each security are read (or generated) they never move and are rarely or never updated.
- The TidalScale system will automatically spread them out over as many of the servers in the cluster as necessary or available. We call the cluster a Software-Defined-Server. The tables themselves never need to migrate.
- No data is ever lost. In fact, one could, if one wanted to do so, write the resulting (virtually) merged table to a backing store as a physical table for archival purposes or for checkpointing.