This month’s T-SQL Tuesday falls on Valentine’s day so Steve Jones decided to talk about the Love affair that seems to be going on right now with the term “Big Data” and the DB industry’s rush to court Big Data. You can read more on his announcement post. Anyway here goes my T-SQL Tuesday contribution.
What Do They Mean by “Big Data” Anyway?
I don’t know (see I’m not afraid to say it 😉 ) – But seriously, I don’t know. It gets thrown about a lot lately. I am not sure if there is an official definition out there that all can agree on. I found the below quote as the first search result in my recent search for “what is big data?” It comes from Edd Dumbill at O’Reilly and I kind of like the definition (emphasis mine):
Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, you must choose an alternative way to process it.
So for the rest of this post, that’s what I mean when I say big data. Well let me back up – the bolded sentence anyway. I think of Big Data as the data that comes in so fast, the data that humans don’t directly make – or don’t necessarily think about making. Financial transactions – even at Amazon – isn’t really what comes into my head as Big Data (I’m sorry but I think I’m going to have to say that phrase a few more times in this post, feels weird typing it over and over). I think of it as data that doesn’t necessarily need to have a schema defined first. Data that doesn’t necessarily need to conform to a certain order right away. Today you may want to do x,y and z with the data and tomorrow you may want to do a,b and c – so you can’t format away data elements, you can’t structure it in a way that prohibits a mid-stream change in approach to it. It isn’t just data that we need to be a bit looser around the structure of and hoarders of – it is also “big”.. So it probably doesn’t come from you typing something into a POS or client application. Think scientific data; sensors; traffic patterns; tweets from everyone – even when someone famous dies or an earthquake happens; web visits and all the associated data – lots of data. mmmmm data.
What Am I Doing With It?
I’m currently helping a well known travel industry website. They use Hadoop and Hive – a query engine that sits on top of Hadoop – turns queries that look a lot like SQL into map reduce jobs.. So folks who know SQL can more quickly ask Hadoop questions without having to be Java programmers writing their own map/reduce jobs. But they also use SQL Server Reporting Service and SQL Server Analysis Services. The model we are moving towards with them seems to be a great model for companies who want their unstructured data and want to query it fast too. The main data they care about are primarily log files. Two kinds – 1.) the log files from the web servers – IP of visitor, length of visit, referral, etc. and 2.) Internal xml logs that all of their servlets write to as a user does stuff on the site. This amounts to over 100 million new rows of raw log data per day from the millions of visitors per day which turns into billions of rows when joined to various data sources. We then query this data in a few ways based on what the goal is:
Brand New Question
Let’s say an analyst wants to ask some question of the raw data that we’ve never thought of asking and never really developed a cube or report to ask that question with. Well instead of sitting in a series of meetings, ramping up a BI project, etc. – the analyst just asks that question. They would write a Hive query in the Ad-Hoc Hadoop cluster that joins, aggregates, etc. the billions of rows they are looking at (typically the data is partitioned by date created in Hadoop with this client and the analyst may ask a sample question of a day of data or perhaps a month at a time). This will take some time depending on their query and the amount of data. That’s fine, the cluster can handle it. The jobs spin up on all of the nodes that the pieces of the data live on and the cluster spins through the map and reduce jobs and eventually spits out an answer. If this was really just a one time question, or one asked infrequently that’s the end of it. They ask it this way and they are fine with the delay because they understand how much data they are crushing through to get the answer.
Same Question More Often
So when the analysts realize there are questions that are good indicators of business and critical decisions can be made to optimize search engine traffic, optimize click throughs, make a more positive user experience, etc. they want to ask these questions more often. This sounds like a more traditional data warehouse experiment. There are key indicators of how things are going, there are reports that indicate how partners are doing, etc. These shouldn’t have to be run by submitting jobs to the Hadoop jive engine or through Hive queries. So now we can approach this in a few ways:
Create base aggregations in Hadoop and export to SQL – Rather than export the full set of raw data into SQL Server and have to scale up to be able to join and aggregate the data – we aggregate the data down into the set the business unit cares about (for instance the B2B teams don’t care about all of the vacation rental clicks, the consumer teams don’t care about business properties and neither team cares about all of the search engine crawler traffic). We aggregate the data down as much as possible – letting Hadoop and all of it’s nodes do the leg work for us overnight and we export the resulting aggregates to SQL. These aggregates can then be queried directly or with Reporting Services for some basic reports. If that isn’t enough we can
Create SQL Server Analysis Services Cubes – So if users need to slice and dice more, if they need to get access to further pre-aggregation and the benefits of analytical processing, the teams can create cubes. Either load them directly from the initial base aggregations in Hive (and other reference data stored in Hadoop and on some other processing systems – lookups for location names and other reference information like that) or they can create a dimensional warehouse in SQL which is populated from the aggs in Hadoop and then populates the cube.
So what we’ve done and are doing is identifying the best tool for the job. In our case we’ve determined that it makes more sense to buy commodity compute clusters (though it’s hard to call them commodity, they are some nice servers that this client chooses to buy) with local storage for the big data, and then go with smaller SQL and SSAS environments that only contain the data necessary to answer those questions repeatedly asked. This is constantly changing – as new insights are gleaned from the raw data, new cubes and new reports in SSRS are created but we never have to worry about throwing away data and we never have to worry about having billions of useless rows stored in the relational databases slowing things down (or pushing us to more expensive SANs and servers) just because we were too afraid to throw the data away or put it “offline” or on cheaper storage someplace. – All of the data is available and the business can be more agile to market conditions and changes.
So yeah, I guess I kind of would say I love the concept of Big Data – but I still love my relational databases and I don’t think either minds my love for the other. When it comes to data, I guess I’m polyamorous. I’m glad that Microsoft also made the decision to start to love Big Data – I wrote this Big Data blog post for the SQL PASS Summit where Microsoft talked about some of their strategy around Big data.