Data Catalyst: Integrating Unstructured Data in a Data Warehouse

I've been meaning to write about this topic for a while. Here's a succinct excerpt of my thoughts.

Traditional data warehouses are generally relational and fed by back-end systems which contain structured data. Most often than not this data is generated by internal source systems or arrives as external data from partners. But what about the web? There's tons of data out there. Possibly about your company, competitor or a business trend about your industry..and the list goes on. As we share more data on the web, this list is expanding every day. The traditional warehouse is not designed to handle such unstructured data thereby limiting the locus of control of your decision support system.

Search engines are good at handling both structured and unstructured information in various formats (e.g. database tables, XML, PDF, DOC, etc.). Case in point - We helped a client index almost 75GB of unstructured data stored in .pdf, .doc, text files going as far back as 70 years. The ancient files were scanned pdfs which were later OCRed. This is massively helpful not just from a pure enterprise search standpoint but in the terms of opening up the data to other parts of the organization in an easily accessible fashion. So how does this tie to a warehouse? Well, the search index in itself is a warehouse.

So how to access it? - With your existing BI systems. If your tweak your BI tool, you can make REST based HTTP calls to a web server. In goes your query and within a second out comes your data. Search engines are inherently fast! You can use this data for discovering relationships you never thought existed. Obviously this works better with certain types of data than other. I foresee a large interest in this area in the future as enterprises explore more potentially crawl-able publicly available data sources.

Data Catalyst

Wednesday, June 8, 2011

Integrating Unstructured Data in a Data Warehouse

No comments:

Post a Comment