Wednesday, June 05, 2013

Why we should probably retire "Big Data"

Data is hugely important. Long has been of course. It's often argued that applications are the longest-lived IT asset. Arguably the data that those long-lived applications create and access sticks around for at least as long. Data is only going to become more important.

Is there a lot of hype around data today? Sure. Raw data isn't information. And information doesn't necessarily lead to useful action if organizations (or their customers and users) aren't willing to change behavior based on new information. Just because pricing mechanisms can be used to reduce traffic congestion based on sensor data--just one of the ideas discussed under the "Smart Cities" term--doesn't mean that even basic congestion pricing is necessarily politically viable. 

Furthermore, I strongly suspect that the lots of data hammer will turn out to be a rather unsuitable tool for certain (perhaps many) classes of problems--the enthusiasms of the "End of Theory" crowd notwithstanding. For example, it's unclear to what degree more data will really help companies design better products or target their ads better. (To be clear, there's long been lots of market research in the consumer space. It's just that it hasn't been terribly effective in creating the killer ad or the killer product.)

But it would be a mistake to think that today's data trends are 1990s-style Data Warehousing with just a fresh coat of paint. Whatever the jokes about the local weatherman, weather forecasting has improved. Self-driving cars will happen, though they may take a while to come into mainstream use. DNA sequencing is now commonplace--although, in a common theme, we're still in the early days of figuring out what we can (and should) do with the information obtained. And we're well on the way to sensors of all sorts becoming pervasive. 

Which makes the "Big Data" term somewhat unfortunate in my view. I realize that may seem a bit of a contradiction given what I wrote above. Let me explain.

My first problem is that "Big Data" is too narrow. This is true even if we use the term in the broader sense of data that is atypical in some respect--not necessarily in its volume. The four Vs is a common shorthand. (For a less precise, but possibly more accurate description, I like the "Difficult Data" term that I heard from the University of Washington's Bill Howe.)

But an emerging culture of data doesn't have to be about big or even difficult. Discussions about data at the MIT Sloan CIO Symposium last month included big data examples, but it was also, in no small part, about cultures of data and the breaking down of silos. Just as with IT broadly and cloud computing, data and storage have to increasingly be based on a hybrid model in which data can be accessed when and where it is needed and not locked up in one department or even one organization. Governments and others are increasingly making open data available as well. 

It's also worth remembering Nate Silver made headlines for calling the last US presidential election correctly, not because he did big data stuff or even because he applied particularly innovative or sophisticated analysis to polling data, but mostly because he used data and not his gut.

The second issue I have with "Big Data" isn't really that term's fault. Rather, it's that "Big Data," today, is so frequently conflated with Hadoop.

Based on Google's MapReduce concept, Hadoop divides data into many small chunks, each of which may be executed or re-executed on any node in a cluster of servers. Per Wkipedia: "A MapReduce program comprises a Map() procedure that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a Reduce() procedure that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies)."

Hadoop also provides a distributed file system that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. (The standard file system is HDFS, but other filesystems, such as Gluster, can be substituted for higher scalability or other desirable characteristics. 

Hadoop is often a useful tool. If you can split up data sets and work on them with some degree of autonomy, Hadoop workloads can scale very large. It also allows data to be operated in-situ without being loaded and transformed into a database, which can greatly decrease overhead for certain types of jobs. (This presentation by Sam Madden at MIT CSAIL offers some benchmarks as well as some pros and cons of Hadoop relative to RDBMS systems.)

However, data can be processed and analyzed using a wide variety of tools, including NoSQL databases of various kinds, "NewSQL" databases, and even traditional RDBMs like PostgreSQL (which can still scale sufficiently to handle a great many types of data analysis and transformation tasks). In fact, we even see something of a trend with some of the new-style databases adding back in traditional RDBMS features that had been assumed to be unnecessary. 

Even high volume data doesn't begin and end with Hadoop. As Dan Woods writes for CITO Research: "The Obama campaign did have Hadoop running in the background, doing the noble work of aggregating huge amounts of data, but the biggest win came from good old SQL on a Vertica data warehouse and from providing access to data to dozens of analytics staffers who could follow their own curiosity and distill and analyze data as they needed."

Hadoop is an important tool in the kit when the amount of data is large. But there are lots of other options for that kit bad too. And never forget that it's not just about the bigness of the data but whether you share it with those who need it and whether you do anything with it.

No comments: