BIG DATA, SLOW DATA – DATA GRAVITY
Were you a fan of physics or math at school? If you work in IT and have got to grips with the various concepts that are involved in distributed systems, you will necessarily have come against a number of highly formal axioms and theorems. And this will certainly be the case if you’ve dug any deeper into these solutions and technologies. So you’re familiar with gravitation, right? But can data also develop gravity—and start to actually attract things?
Mass attracts mass And more mass attracts even more of the same. Until we get to a black hole, which can swallow up whole star systems and even light itself. I’m not really sure whether that’s quite true of data, if we consider data in strictly physical terms. But this metaphor—where data are compared with mass and the latter’s physical properties—does actually work rather well. And gravitation is an excellent concept for explaining one dilemma in digitalization. The phenomenon of data gravity was first mentioned in a 2010 blog post by Dave McCrory, then Cloud Architect at Dell and most recently Vice President of Software Engineering at GE Digital. What we see today is actually a contradiction in terms: more and more data, more and more virtual storage, greater flexibility, and all of this with outstanding performance. Smaller volumes of data combined with fewer and simpler arithmetic operations are certainly not the major challenge here. But once our data reach a certain ‘critical mass’, working with them becomes harder, less flexible, slower, and less efficient. And volumes are growing exponentially. Studies have been published forecasting around 175 zettabytes of data (1 zettabyte = 1021 bytes) worldwide by 2025 (33 ZB in 2018). Autonomous vehicles offer us a good example here. Today, one driverless car can easily produce 40 TB of data on a daily basis—which is a truly impressive volume.
Massed Data Tends to Get Slower
A recent example that illustrates the underlying causes very well is offered by a December 2019 press release issued by the Euronext stock exchange. The press release introduces Euronext’s new hybrid cloud model. Euronext’s IT unit is going to store large volumes of data in the AWS Cloud, with the proviso that some information will not be stored at Amazon because this would involve unmanageable latencies. This ‘latency’ can be described as the time required for a command (the sending of a data packet) to cross a network from the sender to the recipient. Or, to put things much more simply: You can offload huge volumes of data into the cloud, but if you need frequent, quick access to these data, then it’s better to keep them in the same network or data center where the business application is also being hosted.
So the problem of latency forces the Euronext exchange to manage its data close to the business application. We could say that the data and applications attract one another. One obvious solution would be to simply host the applications in the cloud and so avoid the problem of latency altogether. But who can say where this information is actually stored in this virtual environment? Where the physical storage or application are actually hosted will apparently always be the cloud provider’s little secret. Even the choice of ‘regions’ that is offered when configuring cloud services does not by any means pin this physical location down. Dependency on individual service providers would also increase to an extent where this monolithic technical infrastructure could only be switched out with the very greatest of difficulty. But following this related line of thought—that we could simply relocate the application into the cloud—does show that we have an intuitive grasp of who is attracting who here. And our intuition is right. Large volumes of data tend to pull applications and services towards them—and not vice versa. This is because latency is not the only problem in this context: throughput is also a key issue. Let’s assume that a company has stored large volumes of data with a cloud services provider. After a few successful years of this partnership, the service provider then simply decides to pull the plug. Generously, the service provider offers the customer a grace period for migration—perhaps offering a 50% discount on the costs involved into the bargain. What’s the solution? Simply take our data back out of the cloud and store them somewhere else? But we’re talking about several hundred terabytes of data, to start with, and apart from latency, we also have a problem with simply handling data throughput. If we’ve been using our storage pathways to the cloud to the full and building up a massive store of data, it’s relatively unlikely that this kind of volume could be migrated through the same pipeline in just a few months. So as data acquire mass, they also slow down. Just the same as with mass in physics. If we then also find ourselves in a situation involving legal compliance requirements, for example, the challenges involved in rapid migration can quickly develop into an expensive nightmare, if the integrity and completeness of the data need to be guaranteed as well, while simultaneously transferring these kinds of huge volumes.
Conclusions So large volumes of data attract applications and services, because these can usually be moved round more flexibly. The greater the volume of persistent data in one location, the slower it becomes. As volumes continue to rise, latency and limited throughput make it harder to work in a flexible, distributed manner or simply move these volumes of data around. This phenomenon makes it tempting to simply entrust everything to a single cloud provider and also host our applications in the same provider’s infrastructure. However, not only does this not solve all of our problems automatically (latencies) but it also makes us absolutely dependent on this one cloud provider. Hybrid or multi-cloud models with storage classes that take geolocation into account are therefore apparently the only sensible answer to these kinds of challenges for high-volume systems at this time. So, if you’re drawing up your future cloud strategy today, do think about what you might need tomorrow.