To structure or not to structure: that is the question: Whether it is nobler in the mind to suffer the slings and arrows of metadata, ontology, and sixth canonical normal form. Or to take up arms against 30 years of data structure dogma and piety and by opposing the convert to Web 2.0 search technology (and potentially ruin my career, the remaining shards anyway). Lately, I have felt as torn as Hamlet; stay with my data heritage or end it all with radical Web 2.0 abandon.
As one who came out of the late 1970′s as a DBA (data base administrator) religiously putting flat files and hierarchical DBMSs (data base management systems, IMS specifically) to the sword, evangelizing the purity of the CODASYL model and teleprocessing systems. Naturally, I was put to the sword in turn by Code, Date, and relational DBMSs. Later, we fought back with object oriented databases, but being older and wiser, detente reigned. The only good data was analyzed and structured data, fourth normal form (sixth is extreme) at a minimum. All carefully placed in some DBMS so it could be transacted, searched, and reported. Ultimately, this drive to structured data has lead to Business Intelligence (BI, oxymoron, like military intelligence), Corporate Performance Management (CPM) and Executive Dashboards (picture the Elmo dashboard toys you strap to the baby’s crib, spin it, ring it, beep it, Ha ha ha).
Like Galileo, unfortunately, I tasted some forbidden fruit and it has haunted me for years. I was first taunted by the BLOB construct, which allowed unstructured data to be put in a data base container without the DBMS caring what it was, properly tagged the unsearchable could be found, but it was still labor intensive. My second taste came from being one of the sorry set of individuals to develop on Apple’s Newton platform (great haiku, bad handwriting recognition). The development platform and runtime were a rich object soup giving incredible flexibility as to what constituted data and instruction. Now, I am severely tempted by HTML and Search in the guise of Web 2.0 (tie me to the stake and light me up, I confess).
Building out Enterprise Data Models for the average corporation or, even more difficult, Biomedical Data Stores for life sciences are extremely labor intensive, frustrating, and often futile endeavors. The difficulty (cost, time) is directly correlated to the need for precise metadata and ontology. Deriving, documenting, and retrofitting are massive efforts (and definitely not for the ADHD among us, who me?). All of the investment is up front, before the first benefit can be realized (real scary career-wise). However, this is the “right”, dogmatic, safe way to handle data.
This is why our data stores are embarrassing data dumps (landfills, complete with dozers and sea gulls). It is the difficulty and cost of proper classification and maintenance of data in a structured environment that feeds this end. Think of it as data entropy, devolving to the most basic disorganized state. If this basic unstructured state is where data is going, why not just leave it in the “natural” state? Use the human cognitive effort and Web 2.0 tools to promote the best and most useful data to the top of the heap and let the stuff of dubious integrity drop and disappear into the gravel in the bottom of the big data (fish) tank. Rather than spend all that up front investment before the first benefit; the process would be one of steady refinement over time.
The raw data permeating the Web is greater than any structured data store and seems infinite in type and variety. Like the ocean, people dip what they require and interests them with ever increasing success. The rate of evolution of the supporting technology is astronomical. If we could put half the effort into molecule discovery we put into Britney Spears antics the world would be a much better place.