Every day human beings eat, sleep, work, play, and produce data—lots and lots of data. According to IBM, the human race generates 2.5 quintillion (25 billion billion) bytes of data every day. That’s the equivalent of a stack of DVDs reaching to the moon and back, and encompasses everything from the texts we send and photos we upload to industrial sensor metrics and machine-to-machine communications.
That’s a big reason why “big data” has become such a common catch phrase. Simply put, when people talk about big data, they mean the ability to take large portions of this data, analyze it, and turn it into something useful.
Exactly what is big data?
But big data is much more than that. It’s about:
- taking vast quantities of data, often from multiple sources
- and not just lots of data but different kinds of data—often, multiple kinds of data at the same time, as well as data that changed over time—that didn’t need to be first transformed into a specific format or made consistent
- and analyzing the data in a way that allows for ongoing analysis of the same data pools for different purposes
- and doing all of that quickly, even in real time.
In the early days, the industry came up with an acronym to describe three of these four facets: VVV, for volume (the vast quantities), variety (the different kinds of data and the fact that data changes over time), and velocity (speed).
Big data vs. the data warehouse
What the VVV acronym missed was the key notion that data did not need to be permanently changed (transformed) to be analyzed. That nondestructive analysis meant that organizations could both analyze the same pools of data for different purposes and could analyze data from sources gathered for different purposes.
By contrast, the data warehouse was purpose-built to analyze specific data for specific purposes, and the data was structured and converted to specific formats, with the original data essentially destroyed in the process, for that specific purpose—and no other—in what was called extract, transform, and load (ETL). Data warehousing’s ETL approach limited analysis to specific data for specific analyses. That was fine when all your data existed in your transaction systems, but not so much in today’s internet-connected world with data from everywhere.
However, don’t think for a moment that big data makes the data warehouse obsolete. Big data systems let you work with unstructured data largely as it comes, but the type of query results you get is nowhere near the sophistication of the data warehouse. After all, the data warehouse is designed to get deep into data, and it can do that precisely because it has transformed all the data into a consistent format that lets you do things like build cubes for deep drilldown? Data warehousing vendors have spent many years optimizing their query engines to answer the queries typical of a business environment.
Sign up for CIO Asia eNewsletters.