In my last post titled “Unstructured Data is the Killer App of Big Data” I wrote about the difficulties I’ve had with describing big data to some colleagues and friends. When I run into this problem, I use the following sentence to help kick-start conversations about big data:
“Data warehouses work for structured data but what do you do with unstructured data?”
When people think about ‘data’ they normally think about a database where data resides in a well-defined structure. That’s not the case for unstructured data. By definition, there is no structure or data model or even organization to the data. It lives in many locations (e.g., email, files, websites, etc) and is normally very widely dispersed across an organization.
One of the biggest companies today owes its very existence to unstructured data. Google got its start from building systems and approaches to finding information in stacks of unstructured data. The ability to point a search engine at a large number of content containers and be able to find exactly the words or phrases you are looking for brings a great deal of value to users, but even more value can be found from this content if approached with an analytic mindset and the right tools.
Structured data lends itself well to analysis and analytical tools. It is fairly easy to point something at a database and say ‘show me X’ or to visualize things. Unstructured data is completely different as there is no data model to help those analytical tools ‘understand’ the data and to help return information to the user. Because of this, unstructured data requires new approaches, new tools and new skills.
No longer is it enough to be able to run queries against a database; data needs to be analyzed first for content and then categorized in some way to allow for analysis and visualization. It takes a different mindset and approach to find insight in unstructured data. In my next post on the subject, I”ll give some examples and insights into how organizations approach analysis and categorizing their vast stores of unstructured data.