While it’s true that we can never predict tomorrow’s weather with 100 percent reliability (at least not yet), at the same time it’s true that we can predict yesterday’s weather with 100 percent certainty.
What does this have to do with anything?
Well, it turns out that meteorologists aren’t the only people who use historical data in an attempt to predict reasonable futures.
We do it all the time as we attempt to solve an ever-expanding class of problems. Of course, we don’t always succeed. (See: U.S. Presidential Election, 2016)
And, there’s a lot of history. There’s weather history, baseball statistics, political polling, inflation rates, inflation rate correlation with other economic indicators, and of course stock market trading and performance data.
In an earlier article, I talked about stock trading data histories. (See the source code published at https://www.tidalscale.com/how_to_use_large_memory).
There are two things interesting about historical data:
- It doesn’t change (Skynet and time-travel aside). That’s what we mean by “historical.”
- There’s a lot of it.
When you have a lot of data that doesn’t change, the best way to access it frequently is to put as much of it as you can in memory.* With TidalScale’s Software-Defined Servers, which allow users to right-size their servers on the fly by aggregating the resources of multiple commodity servers, customers can match their server configuration to their memory needs. As their data requirements grow, they only need to add more servers. Or, if today’s problem requires less memory than yesterday’s, they simply remove servers from their Software-Defined Server. If you’re familiar at all with TidalScale, then none of this should be news.
However, I’d like to point out something that may be less obvious. With TidalScale, we frequently migrate data, in the form of pages of memory, to where it’s needed. Consider many queries streaming into a server. Each one could be implemented as a thread, whose sole purpose in life is to access the historical records, respond and then terminate. These threads might run on any node in a TidalPod. In fact, they could also jump around. This is because they may need pages residing on several different nodes to process the query. Fortunately, TidalScale accommodates for this need. We copy pages at a furiously fast rate.
Historical data can be processed efficiently because it’s just that – historical. It doesn’t need updating. And performance-wise, updating a page is far more expensive than copying it. For instance, updating a page requires a specific sequence of events. First, one node takes “ownership” of the page and becomes the “owner of record.” All other nodes with a copy of the page then have to forget that page. Only when we know for certain that we have ownership of the page, and that no other node still has a copy, can we make the update. After that, the page may be freely copied as before. This is exactly what a processor in a multiprocessor system has to do, so the algorithms for doing this are very well understood.
So, using historical data is far less expensive than updating data. And, there’s a lot of it. And, it can be accessed frequently, and in unpredictable ways. This makes TidalScale particularly well suited to scale up on read-only or read-mostly data. Databases already do this with scale-out to read-only replicas. For instance, think of a retail website where roughly 96 to 97 percent of all traffic involves browsing versus the 3 to 4 percent of visits that end in a sale. Browsing traffic can be spread out to read-only replicas of the catalog, with calls to the master database reserved only for the 1-in-30 customers who makes a purchase. But configuring and managing scale-out replicas is complicated, so having a single logical server is a big advantage.
Fortunately, there’s a large class of these problems in the industry: real-time analytics, decision support, and in-memory databases come immediately to mind. As we have described in previous blogs, TidalScale is well-positioned to address this class of problem.
I’ll discuss other classes of problems in future blogs. Meanwhile, I have to go check on tomorrow’s weather.
* Structured data in a database can be optimized for faster retrieval via indexes, query plans and parallelism. The right index and query plan can cut the work to evaluate a big data query by orders of magnitude. However, in-memory computing on a Software-Defined Server offers, by using pointers, a way to achieve performance improvements of 300X without any optimizations or modifications to databases, applications or operating systems.