Tuesday, August 22, 2006

Unstructured Data and XML

As I mentioned when I started this blog, I am sort of a data geek. I like data, the storage of it, the retrieval of it and watching people use it and manipulate it for their needs. What I don't like about it is the strict rules we have to abide by to store and retrieve it. In the company I currently work for we have data spread all over the place. We have portals, wiki's, databases, and just random file storage. And in order to find what you need, you have to be in the right place and pretty much know exactly what you are looking for. I think this is a common theme among most companies.

So how do we solve the problem. First lets use what has become the standard name for our situation and that is unstructured data. And it is, just as it sounds, data that doesn't follow any sort of identifiable structure. Experience tells me that most of the useful information comes in this form. I say that for a couple of reasons. First, unstructured data is more prevalent than structured data. This includes everything from formal documents and memorandums to two co-workers emailing each other. The second reason it is more useful is because it is simply easier to do. Structured data requires, are you ready for this, structure. You have to follow rules and you can only put in what it is expecting. Unfortunately nobody thinks that way and because of that our ideas, thoughts, questions, conversations and other musings are often condensed, truncated or just not documented. This, for me, is a the real problem we face when talking about Business intelligence(BI) or any of the related topics. You can get what you want, so long as it fits in this nice little box (or cube for you data wharehouse types) and so long as it was captured or input. Now again, this works really well for reports that tell you what you did. It doesn't work so well in telling you how you did it or why you did it or anything about the process.

Now suppose we have the ability to search the emails, notes, documents and any other form of media related to the process. How much better would we understand why we did what we did and how we got the numbers in the report, why we got them, what we liked about them and so on. Not to mention the ability to learn from ourselves and each other by being able to look back into the past and retrieve data that at the time seemed trivial or inconsequential. So my question becomes, is XML the answer. Is it the way to "structure" unstructured data in a meaningful way that allows it to be indexed and searched on and retrieved.

XML has value in this arena, there is not doubt about that. Most apps already feature a save as "XML" data type and those that don't will in all likelihood in the very near future. The DBMS designers see the power in it. You can store XML in the db, you can have your output automatically formatted in XML and you can even query in XML with xqueary. The problem that companies have faced in going this route is how to XML-ify (for those of you who like to turn nouns into verbs) old documents. How do we go through the process of turning information stored in older formats into an XML format other than manually going one by one and converting them. Even if done programmability it is still a huge job. Maybe one worth thinking about though. The thing that turns me on about the XML idea is that it allows for the sharing of information across all types of traditional boundaries. Which of course is what is in intended to do, but the idea had more to do with messaging and documents then data or data retrieval. Imagine you allowing me access to your data by dumping me what you want to give me in XML format that I can then load in my db as XML and query, compare or analyze the data is if it were my own. Talk about systems integration.

Maybe though, the idea of XML is still too structured, maybe we need that algorithm that allows us to index and search anything. Maybe the ideas about the future of data are focusing on the wrong parts. What if we could just search and get what we need. I could search for say a project name from the past and I would get all of the relevant documents, email threads, video capture, recorded discussions, consultants reports and process information from where ever it currently lives. No one has to do anything except capture the info in some form. This is where the future of data lives. This is where the big boys are headed. This is where the Googles and Microsofts are playing. The idea that your entire infrastructure is one giant database of sorts that can be indexed and retrieved is upon us. For the most part that technology exists. The trick now is to be able to allow you to combine it, format it, edit it, use it in any way you need to.

This is the role we can play. What if we start thinking that way. What if we can get out of the box and even the cube in some instances and start thinking about data in more of a free form and how we could manipulate the data once we have it. Do we use Google, or some open source solution for data indexing. Do we think what Google and Microsoft are doing is the right thing or do we need to think of an even better way. These are the thoughts of the future and these are the things we need to think about. The future of data, how we view it, how it is stored and how it is manipulated is right around the corner and we can either lead or follow closely. Either way is fine. What we can't do is stand still and think we have the questions answered. The one constant that I learn each and every day is that I don't know anything. Everything I thought I knew is either wrong or in a state of change that will make it wrong at some point in the future.

No comments: