This week at the SQL PASS Summit, Ted Kummert – Corporate Vice President of the Business Platform Division at Microsoft (think SQL Server) made an announcement in Wednesday’s keynote presentation. It is an awesome announcement for companies who have “big data” (think semi structured or even unstructured large data sets. Think data that is perhaps a bit too bulky or requires too much formatting to analyze effectively in what we think of when we think of relational databases… Clicks, Tweets, Texts, credit card transactions, health care data streams, etc.) and want to have newer and better ways to work with it.
You’ve heard of Hadoop? No? Well Jeremiah Peschka does a great job with a quick explanation in this post.
Well Microsoft wants you to work with data in Hadoop. So they announced that this is a large part of their data strategy and there are two really neat ways they are going to be implementing this:
- Interoperability –> We’ve already seen one example of this with the recent RTM release of the SQL Server to Hadoop connector that works with Sqoop. There will be much more coming really soon though. Additional connectors, yes but also an ODBC Driver to connect directly to Hive. That right there is great news. When this driver is released you can connect to Hive (Which is a query language, largely modeled on SQL and looks incredibly close 80% of the time, that sits on top of Hadoop and translates this Hive Query Language – HQL – into the java that gets the data out of Hadoop) directly from the Microsoft toolset. This means that you can let Hadoop do the work it is great at – large transformations, parallel and scaled out aggregations, computations across compute clusters, etc – and easily utilize the tools that Microsoft is really proving themselves in* – data analysis, data visualization, self service reporting. I am consulting with a live web commerce company right now and they have much data in hadoop. Mainly hundreds of millions of new clickstream and web access logs. They chose hadoop for many reasons but they have not been able to find a suite of analytics tools as good as or with as great a TCO as the Microsoft analytics tools (chiefly they are looking at SSAS, SSRS, and PowerPivot but that may be expanding into new visualization tools like what Project Crescent – Now called PowerView offer). The cost of creating their aggregations and working with the large raw data sets from their XML page view information made Hadoop a great choice. They can easily add nodes to their cluster as their business grows, the map reduce functionality and splitting the load up across their compute nodes makes shorter work. Getting that data into the analytics tools has been a bit of a challenge. They’ve had to create massive data warehouses to load from the data from Hive, effectively keeping duplicate copies of the same data – results of hive aggregation queries – and then build cubes and reports. Very soon they can cut out that step and begin doing much of their analysis direct against Hive.
- Become a Player Themselves –> This one had me scratching my head when I first heard it, “wait, Microsoft wants to deploy hadoop on Windows and Azure?! The facebook’s and .com’s of the world won’t ever buy it, they love the open source community, they hate paying for licensing.” But then it hit me… They are not aiming for the flash and hip web companies who have already embraced hadoop… They are actually offering something in the market that has a really compelling story and call to action. There are many companies that are in the data explosion situation. Credit card companies want to find fraud and stop it before the sale even happens. Phone companies want to trend their billing data. Political campaigns want to know the exact moment that all the tweets and facebook comments turned negative and they want to correlate it to data from lots of other sources, insurance companies want to spend less on healthcare and want to find trends in the reams of claim data they process. A lot of these companies are not turning to the open source community for help. The data is more sensitive than web visits or saving tweets or facebook status updates. The usage is different and there is a need for those “enterprise-y“words. Words like Management, Instrumentation, Security, Coordination, Modeling. These are the things that Microsoft already delivers for products like SQL Server. They have tools like SCOM/SCCM that do some of that management and instrumentation. They integrate their products with trusted active directory authentication. So… Microsoft will be releasing a distribution of hadoop (based on the same community driven architecture and standards that all the current flavors are released on) that sits on top of Windows Server and Windows Azure. I’m not normally a kool-aid drinker, and maybe it is because I have been immersed in a world of hive queries and hadoop lately but I actually think this is a huge shift for Microsoft and I think we are going to see endless possibilities and some great adoption stories really soon.
Why I’m Happy
Some would say that Microsoft sees a positive trend and tries to copy it normally. They try to make it their own and sometimes they get it right, sometimes they don’t. It is a copy though. They take some good and interesting ideas and “microsoft-ize” them. I’ve been working with (and loving) SQL Server for 12 years, so don’t get me wrong when I say this, but sometimes they miss the mark. With this? They aren’t copying, or borrowing or trying to redo… They are embracing. They are looking at why people use a tool like Hadoop. They are asking good questions about it and saying, let’s embrace the open source community their standards and all their work and let’s make a platform and integration for it. They are saying, “Hadoop – you do what you are great at, don’t go changing, here let’s help reach other customers and we’ll extend this great tool set with what we really know and are good at – enterprise support, manageability, instrumentation, reliability” That is cool. That is big.
We live in a world of data. It is getting tougher and tougher for each of us to spend a day without leaving behind hundreds of rows of data in various systems and touch points throughout our day. Just today I’ve had a quiet day and haven’t done much but I’ve left hundreds of data sets in my wake by all of my actions. This data tells a story. Sure, that story will be aggregated, analyzed, stored, charted to help turn more profit for companies but that data will also be used to change our world. I know Big Data is a buzz word that folks throw around a lot these days, like cloud, but we live in a big data world and, well, to paraphrase Karen Lopez, love your data. Pick up a copy of Hadoop: The Definitive Guide (get the second edition – linked here) to understand more about the concepts around Hadoop and all of it’s components. It is an exciting read and this is an exciting time to be a data geek.
* – This doesn’t mean I don’t think SQL Server proves itself. It really has and does. It is a great enterprise platform for your data. It just has a different use case than a toolset like Hadoop does. They will complement each other well. Keep your fast paced OLTP database in SQL, put some traditional dimensional databases there too for the reporting you do there. Look to these new offerings and interactions with Hadoop and SQL Server to be another piece in the puzzle.
I’ll keep my eye out for other posts about this announcement and may write up some follow up posts as I learn more about these changes. I’ll post links here as I find them.
Hortonworks –> This is a company that Microsoft was working with on some of the Hadoop changes. Ted talked about that in today’s keynote also