Wednesday, June 12, 2013

Unstructured Data can create chaos

Unstructured Data can create chaos
June 12th, 2013

It seems that no matter where you go these days, twitter, your favorite tech blog, email newsletters, that everyone is talking about big data.  Let's face it, big data is a buzzword, or buzz term, that many technology professionals are being forced to address.  Most database guys like myself have been dealing with large amounts of data for years.  What was once kilobytes of data turned into megabytes, then gigabytes, then terabytes and now even beyond that.  We've dealt with this sizable data in a lot of different ways including table partitioning, regular archival and purging and the creation of data warehouses that are away from our regular transactional databases.  We've had the time to analyze what is coming into our databases so we can transform it into something useful.  This latest wave of "big data" is taking some of these approaches away from us for a couple of different reasons, velocity and volume. 

At some point the size of the data becomes just too big to handle and the speed at which it is coming at us is too quick for our systems to handle.  Now, fast forward to the wonderful world of unstructured data.  This world states that we really don't care what the data that comes in looks like we'll just store it.  Then after awhile we'll be able to do something useful with it.  But just how realistic is this approach?  As a database professional, I like to ensure data quality.  By introducing unstructured data into my world you've thrown a lot of my ability to ensure data quality out the window.  I can store it for you.  I might even be able to query a lot of it and produce useful insight from it but over time the data just becomes more and more difficult to manage.

For example, once I've traversed the last 2 years of web logs and created a dashboard of how often our customers go to each of our web pages do I keep the detail information just in case I might come up with a new way to traverse and create new business knowledge?  If I do keep it, do I tie back my new business knowledge to the rows of unstructured data for purposes of drill down?  In some shops this may be impossible.  My only real option may be to archive it because while I'm analyzing the bulk unstructured data that is stored let's not forget that all my current customers are quickly producing mounds of new data that I'll have to do something with sooner or later.

To be fair, vendors are giving us ways to deal with this data.  Newer, open source, database technologies such as noSQL and CouchDB (a derivative of NoSQL) are document based solutions.  The Hadoop File system (HFS) provides file based storage that is, in theory, easy to get to and designed to store bulk data.  Developers are slapping SQL like interfaces like Hive on top of HFS in order to facilitate those of us with SQL skills access to the data in these new systems.  But wait, if it is in fact truly unstructured, how do I know what I need?  If data is coming in from multiple sources and just dumping away into an open file system how do I make sense of it?

Well, this is where the database guys come back into the picture.  This is also why, in my opinion, the relational database management system is not going anywhere soon.  Extraction, transformation and loading (ETL) techniques from these large unstructured data sources will still need to be written in order to make the data into usable and business ready forms.  The data will need to be tied to valid business entities such as users, clients or customers or real assets such as servers and/or data centers.  Without the knowledge of what a piece of unstructured data is directly tied to it will be difficult, if not impossible to derive any real value from it.

Let's also not forget the big RDBMS guys like Oracle and Microsoft are adding big data features into their tool sets.  Either that or they are buying companies that already have tools and creating hooks back into their flagship products.  SQL Server has introduced PolyBase with its 2012 iteration that promises to tie unstructured big data with its relational counterparts.  Oracle has its own NoSQL database and a fully configured Big Data appliance that is ready to capture your organization's data.  These are all relevant and good approaches to the problem but without data order there is data chaos. 

You, and your team, must take a systematic approach to what data you are capturing and why.  Then you must consider its value to the business.  Each data have a relationship to a business unit or units within your organization.  Once the data are categorized, governance rules must be created.  You can't simply say we'll keep all data for all time, that is not realistic and will eventually create chaos.  Data retention rules apply here, they may be self imposed or perhaps government imposed in the case of financial institutions but clear rules should be defined. 

Unstructured data cannot always be handled in real time either.  Clear rules should be defined around which data needs to be handled in which order.  If the data is directly relevant to revenue then obviously it is more important and should be handled with the fastest available applications and hardware.  This is where defined disk tiers can come in handy.  If data is not needed to be instantly accessible it can reside on older, slower, commodity disks.  But if data is needed in real time or near real time then perhaps solid state devices are needed. 

If you haven't already been approached to deal with unstructured data, you soon will be.  I hope that this has helped you think about some of the pitfalls of dealing with it.  Do yourself a favor and start familiarizing yourself with some of the free tools that are out there.  Cloudera, as an example, has a free Virtual machine you can download that has a full install of Hadoop with Pig and Hive on it for practice sessions.  With a reasonably powered laptop you can get this up and running in a half hour or so.  If you don't, just keep up with industry blogs and whitepapers to stay in touch with what is out there. 

Here are links to some of the stuff I talked about in this blog

CouchDB
Hadoop
SQL Server 2012 and Polybase
Oracle Big Data Appliance
Oracle NoSQL Database
Cloudera Quickstart VM - Cloudera Quickstart VM

Bill Schoonmaker
Data Architect, EMC Corp