Data Profiling: The BI Grail

In Healthcare analytics, as in analytics for virtually all other businesses, the landscape facing the Operations, Finance, Clinical, and other organizations within the enterprise is almost always populated by a rich variety of systems which are prospective sources for decision support analysis.   I propose that we insert into the discussion some ideas about the inarguable value of, first, data profiling, and second, a proactive data quality effort as part of any such undertaking.

Whether done from the ground up or when the scope of an already successful initial project is envisioned to expand significantly, all data integration/warehousing/business intelligence efforts benefit from the proper application of these disciplines and the actions taken based upon their findings, early, often, and as aggressively as possible.

I like to say sometimes that in data-centric applications, the framework and mechanisms which comprise a solution are actually even more abstract in some respects than traditional OLTP applications because, up to the point at which a dashboard or report is consumed by a user, the entire application virtually IS the data, sans bells, whistles, and widgets which are the more “material” aspects of GUI/OLTP development efforts:

  • Data entry applications, forms, websites, etc. all exist generally outside the reach of the project being undertaken.
  • Many assertions and assumptions are usually made about the quality of that data.
  • Many, if not most, of those turn out not to be true, or at least not entirely accurate, despite the very earnest efforts of all involved.

What this means in terms of risk to the project cannot be overstated.   Because it is largely unknown in most instances it obviously can neither be qualified nor quantified.   It often turns what seems, on the face of it, to be a relatively simple “build machine X” with gear A, chain B, and axle C project into “build machine X” with gear A (with missing teeth), chain B (not missing any links but definitely rusty and needing some polishing), and axle C (which turns out not to even exist though it is much discussed, maligned, or even praised depending upon who is in the room and how big the company is).

Enter The Grail.   If there is a Grail in data integration and business intelligence, it may well be data profiling and quality management, on its own or as a precursor to true Master Data Management (if that hasn’t already become a forbidden term for your organization due to past failed tries at it).

Data Profiling gives us a pre-emptive strike against our preconceived notions about the quality and content of our data.   It gives us not only quantifiable metrics by which to measure and modify our judgement of the task before us, but frequently results in various business units spinning off immediately into the scramble to improve upon what they honestly did not realize was so flawed.

Data Quality efforts, following comprehensive profiling and any proactive quality correction which is possible, give a project the possibility of fixing problems without changing source systems per se, but before the business intelligence solution becomes either a burned out husk on the side of the EPM highway (failed because of poor data), or at the least a de facto data profiling tool in its own right, by coughing out whatever data doesn’t work instead of serving its intended purpose- to deliver key business performance information based upon a solid data foundation in which all have confidence.

The return on investment for such an effort is measurable, sustainable, and so compelling as an argument that no serious BI undertaking, large or small, should go forward without it.   Whether in Healthcare, Financial Services, Manufacturing, or another vertical,  its value is, I submit, inarguable.

If Master Data Management is on your agenda

Start with Data Quality

Many organizations are currently working on Master Data Management (MDM) strategies as a core IT initiative. One of the fastest paths to failure for these large, multiyear initiatives is to ignore the quality of the data. This is a good post on other MDM design pitfalls.

Master Data Management (MDM) is defined as the centralization or single view of X (Customer, Product or other reference data) in an enterprise. Wikipedia says: “master data management (MDM) comprises a set of processes and tools that consistently defines and manages non-transactional data entities of an organization (also called reference data).” MDM typically is a large, multiyear initiative with significant investments in tools, with two to five times the investment in labor or services to enable the integration of subscribing and consuming systems. For many companies you are talking millions of dollars over the course of the implementation. According to Forrester, on average, cross-enterprise implementations range anywhere from $500K to $2 million and professional services costs are usually two dollars for every dollar of software license costs. When you consider integration of all your systems for bi-directional synchronization for customer or product information, the services investment over time can be up to five times the license cost.

At its simplest level, MDM is like a centralized data pump or the heart of your customer or product data (the most popular implementations). But once you hook this pump up, if you haven’t taken care of the quality of the data first, what have you done? You have just spent millions of dollars in tools and effort to pollute the quality of data across the entire organization.

Unless you profile the systems to be integrated, the quality of the data is impossible to quantify. The analysts who work with the data in a particular system have an idea of what areas are suspect (e.g., “we don’t put much weight in the forecast of X because we know the data is sourced from our legacy distribution system which has data ‘problems’ or ‘inconsistencies’”). The problem is that the issues are known at the subconscious level but are never quantified, which means a business case to fix the issues never materializes or gets funding to make improvements. In many cases, the business is not aware there is a problem until they try to mine a data source for business intelligence.
According to a study by the Standish Group, 83% of data integration/migration projects fail or overrun substantially due to a lack of understanding of the data and its quality. Anyone ever work on a data integration project or data mart or data warehouse that ran long? I have, and I’m sure most of the people reading this have too.

The good news is that data profiling and analyzing is a small step you can undertake now to prepare and position yourself for the larger MDM effort. With the right tools, you can assess the quality of the data in your most important data sources in as little as three weeks depending upon the number of tables and attributes. Further, it is an inexpensive way to ensure that you are laying the foundation for your MDM or Business Intelligence initiatives. It is much more expensive to uncover your data quality problems in user acceptance testing. Many times it is fatal.

Success of your MDM initiative depends on the quality of the data – you can profile and quantify your data quality issues now to proactively head off problems down the road and build a business case to implement improvement in your existing data assets (marts, warehouses and transactional systems). The byproduct of this analysis is that you can improve the quality of the business intelligence derived from these systems and help the business make better decisions with accurate information.