This article is an introduction to big data and explains how it relates to data lakes.
Big data has been challenging humans since the beginning of civilization. In ancient times, humans counted with fingers, so more than ten things posed some difficulties. They also counted by carving marks into woods, but they couldn’t make quick calculations that way. Some ancient languages had words for “one” and “two.” More than two would all be “many.”
Fast-forwarding to today, we see a similar theme playing out — we still can’t count. Of course, we can count anything we can see, but we have a much bigger appetite now. We also want to predict what we haven’t seen. Doing this is made possible by Probability Theory and, in particular, the Law of Large Numbers. According to this law, we can accurately get typical characteristics of any random object by examining a sufficiently large number of its samples. But how large is “sufficiently large”? It doesn’t say. This vagueness is the modern-day equivalent of the word “many” in ancient times.
As you can see, big data challenge is a natural part of human evolution. It manifests itself in different ways at different times. Nowadays, the following V-words summarize the challenge.
- Volume. If there were ever a Moore’s Law for data growth, it would most likely proclaim the duration for doubling of data volume is less than a year. We are witnessing an explosion of data from every walk of life.
- Velocity. Even if you have not felt the pressure from a large amount of data, you may have seen the need to process data quickly. To process 5 MB data is not challenging, but having to do so in a few microseconds could suddenly become a real challenge.
- Veracity. Your ability to deliver results quickly could be hampered by impurity in your data due to human errors (or malicious attacks), structural inconsistencies, or lack of structures. You may find yourself in a constant uphill struggle to maintain data quality.
- Variety. You may need to deal with data streams containing vastly different things. You want to make a sense out of messy data so that you can “massage” them into usable shapes for your users.
- Volatility. Time is the most common dimension in a big-data system. You may find yourself having to capture and process data with no identifiable breaks, or you risk data loss. Furthermore, both data and metadata may be subject to change over time.
- Value. Making data systems huge becomes a necessary evil because the more data we collect, the more probable we could derive some good value (thanks to The Law of Large Numbers).
The data processing world has been experimenting with various strategies in the past three decades, going from database to data-warehouse and gradually converging to data lakes.
Database systems fall short in coping with challenges from data velocity, veracity, variety, volatility. The ACID requirement makes it difficult to attain ultra-high speed. Furthermore, data must be pure when entering a database system. And it is cumbersome for traditional database systems to adapt to changes.
The essence of data warehousing is to precompute a massive amount of joins and aggregations so that decision-making queries can be lightning fast. It also performs extensive data quality processing to cope with issues from “data veracity” and “data variety.” Recently, data warehouses built with the immutable strategy (i.e., never update collected data) can also alleviate pressure on data volatility. But the data warehousing idea completely misses the target on data velocity.
The data lake strategy helps us meet all the challenges. A data lake captures real-world data with little processing, regardless of what they are. This data capturing is immediately followed by a real-time data purification mechanism, known as data pipelines. Data pipelines are free of joins of real-time data streams to avoid pipeline stalls.
A data lake still needs data warehouses or any join-efficient mechanisms (such as Spark or Pulsar) to deliver final values. But it makes the latter processes much more efficient since you can move lots of computations into data pipelines. The real-time nature of data pipelines makes it possible to amortize the cost of data preparation. Therefore, a data lake corresponds to the few foundation layers in a pyramid. It is more big-data-ish than upper layers and arguably, the most important in any big-data system.
For a detailed discussion about data lake, please read the article What is Data Lake.