This post gives an overview of the Strata Conference 2013 in London and hopefully a short introduction to Big Data in general. In short: You need data to succeed and you need to make your data work for you.
What is Big Data?
You can tell big data is BIG right now filling a conference like Strataconf with interesting keynotes and sessions (and speakers and participants). But what’s with all the buzz, is Big Data like teenage sex?
In a way, Doug Cuttings opening keynote at Strataconf gives part of the answer when he states that “Hadoop dominates Big Data”. Doug may be a bit biased being the founder of Hadoop but everybody seems to be using it. Since a couple of years ago you don’t need a super computer for analysis of huge amounts of data. Apache Hadoop plus a bunch of servers gives you the poor mans version of HPC and storage. Plus an advantage, this setup scales well so in order to attack even larger sets of data, just add hardware nodes (or extend your cloud setup). Hadoop is an open source framework for storage and processing of large scale data sets. Apache Hadoop is actually a number of modules, all open source. The setup is also available as productified distributions, e.g. Cloudera and MAPR, with own versions of one or several Hadoop modules.
This leads us to one part of the Big Data buzz; the ability to handle large amounts of information. Also, the information is often unstructured, at least to start with. The information may also come from different sources; applications, log files, devices or even sensors. The information tends to stay unstructured for longer periods of time, since we don’t know all the use cases. Finally, the field of Big Data is characterized with an opportunistic or exploratory approach.
The conference
One great thing about Strataconf was the blend of different problem areas and domains. The sessions and keynotes spanned across areas like journalism, healthcare, Internet of things, social media and the more general open data concept. There was also a balance between practical use cases in these areas and more technical sessions around tools and frameworks.
There are a number of related subjects and roles to or within Big Data. All of them mentioned during Strataconf.
Open Data
The movement of sharing one organizations data with the rest of the world. Most commonly via an open API. In this fashion others outside the given organization can explore new possibilities with the information, maybe in combination with other organizations data and open up new opportunities. During Strataconf NHA (UK National Health Service) drew their ambitions were opening up their data was crucial in a time when budget gets smaller. According to NHA this would be the key to improve the overall customer satisfaction.
Data Scientist
The craftsmanship of data analysis and more. “Sexiest job of the 21st century”, according to Harvard Business Review in 2012. A paper handed out by O’reilly identifies four different types of Data Scientists:
Data Businesspeople: Focus on the organization and how data projects generates value and profit
Data Developer: Deals with the technical problem of managing data. How to get and store data and how to learn from it. This is the group focusing on developing with the Hadoop framework.
Data Researchers: Statistics experts
Data Creatives: The mixture of everything above, from data extraction to visualization.
Read the full report here
Data Journalism
Find the story in the data (where data could be big). A lot of similarities with big data in general because of the exploratory approach. Often though accomplished with more simple tools, e.g. spreadsheets. The simple tools approach was covered by Claire Miller from WalesOnline.
(Off topic: One of the more amusing keynotes actually dealt with spreadsheets in general. Have a look at the presentation)
Lean Analytics
If you’re a fan of Lean startup this is for You. The idea behind Lean Analytics is to give you means to figure out what you should be working on in your startup/product development using data. As with Lean startup, build -> measure -> learn is an important concept where Lean Analytics focuses on the measure part giving guidelines on what to measure. This is described in the book written by Alistair Croll and Benjamin Yoskovitz. At the conference, Alistair gave some good advices on choosing the right metrics to follow. The presentation is available on this page.
Summary
There are lots of thing going on in the Big Data field everywhere, of course also at Aftonbladet and Schibsted. The Strata conference gave us inspiring examples of how we can take advantage of concepts in the field of Big Data and how we can improve our use of tools and techniques. We will use these findings in product development, operational decision support and more.