What I Learned at Health Connect Partners Surgery Conference 2012: Most Hospitals Still Can’t Tell What Surgeries Turn a Profit

What I Learned at Health Connect Partners Surgery Conference 2012: Most Hospitals Still Can’t Tell what Surgeries Turn a Profit

As I strolled around the Hyatt Regency at the Arch in downtown St. Louis amongst many of my colleagues in surgery and hospital administration, I realized I was experiencing déjà vu. Not the kind where you know you’ve been somewhere before. The kind where you know you’ve said the same thing before. Except, it wasn’t déjà vu. I really was having many of the same conversations I had a year ago at the same conference, except this time there was a bit more urgency in the voices of the attendees. It’s discouraging to hear that most large hospitals STILL can’t tell you what surgeries make or lose money! What surgeons have high utilization linked to high quality? What the impact of SSI’s are on ALOS? Why there are eight orthopedic surgeons, nine different implant vendors and 10 different total hip implant options on the shelves? It’s encouraging, though, to hear people FINALLY admit that their current information systems DO NOT provide the integrated data they need to analyze these problems and address them with consistency, confidence, and in real time.

Let’s start with the discouraging part. When asked if their current reporting and analytic needs were being met I got a lot of the same uninformed, disconnected responses, “yeah we have a decision support department”; “yeah we have Epic so we’re using Clarity”; “oh we just <insert limited, niche data reporting tool here>”. I don’t get too upset because I understand in the world of surgery, there are very few organizations that have truly integrated data. Therefore, they don’t know what they don’t know. They’ve never seen materials, reimbursement, billing, staffing, quality, and operational data all in one place. They’ve never been given consistent answers to their data questions. Let’s be honest, though – the priorities are utilization, turnover, and volume. Very little time is left to  consider the opportunities to drastically lower costs, improve quality, and increase growth by integrating data. It’s just not in their vernacular. I’m confident, though, that these same people are currently, more than ever, being tasked with finding ways to lower costs and improve quality – not just because of healthcare reform, but because of tightening budgets, stringent payers, stressed staff, and more demanding patients. Sooner or later they’ll start asking for the data needed to make these decisions – and when they don’t get the answers they want, the light will quickly flip on.

Now for the encouraging part – some people have already started asking for the data. These folks can finally admit they don’t have the information systems needed to bring operational, financial, clinical and quality data together. They have siloed systems – they know it, I know it, and they’re starting to learn that there isn’t some panacea off-the-shelf product that they can buy that will give this to them. They know that they spend way too much time and money on people who simply run around collecting data and doing very little in the way of analyzing or acting on it.

So – what now?! For most of the attendees, it’s back to the same ol’ manual reporting, paper chasing, data crunching, spreadsheet hell. Stale data, static reports, yawn, boring, seen this movie a thousand times. For others, they’re just starting to crack the door open on the possibility of getting help with their disconnected data. And for a very few, they’re out ahead of everyone else because they already are building integrated data solutions that provide significant ROI’s. For these folks, gone are the days of asking for static, snapshot-in-time reports – they have a self-service approach to data consumption in real time and are “data driven” in all facets of their organization. These are the providers that have everyone from the CEO down screaming, “SHOW ME THE DATA!”; and are the ones I want to partner with in the journey to lower cost, higher quality healthcare. I just hope the others find a way to catch up, and soon!

Epic Clarity Is Not a Data Warehouse

It’s not even the reporting tool for which your clinicians have been asking!

I have attended between four and eight patient safety and quality healthcare conferences a year for the past five years. Personally, I enjoy the opportunities to learn from what others are doing in the space. My expertise lies at the intersection of quality and technology; therefore, it’s what I’m eager to discuss at these events. I am most interested in understanding how health systems are addressing the burgeoning financial burden of reporting more (both internal and external compliance and regulatory mandates) with less (from tightening budgets and, quite honestly, allocating resources to the wrong places for the wrong reasons).

Let me be frank: there is job security in health care analysts, “report writers,” and decision support staff. They continue to plug away at reports, churn out dated spreadsheets, and present static, stale data without context or much value to the decision makers they serve. In my opinion, patient safety and quality departments are the worst culprits of this waste and inefficiency.

When I walk around these conferences and ask people, “How are you reporting your quality measures across the litany of applications, vendors, and care settings at your institution?,” you want to know the most frequent answer I get? “Oh, we have Epic (Clarity)”, “Oh, we have McKesson (HBI),” or “Oh, we have a decision support staff that does that”. I literally have to hold back a combination of emotions – amusement (because I’m so frustrated) and frustration (because all I can do is laugh). I’ll poke holes in just one example: If you have Epic and use Clarity to report here is what you have to look forward to straight from the mouth of a former Epic technical consultant:

It is impossible to use Epic “out of the box” because the tables in Clarity must be joined together to present meaningful data. That may mean (probably will mean) a significant runtime burden because of the processing required. Unless you defer this burden to an overnight process (ETL) the end users will experience significant wait times as their report proceeds to execute these joins. Further, they will wait every time the report runs. Bear in mind that this applies to all of the reports that Epic provides. All of them are based directly on Clarity. Clarity is not a data warehouse. It is merely a relational version of the Chronicles data structures, and as such, is tied closely to the Chronicles architecture rather than a reporting structure. Report customers require de-normalized data marts for simplicity, and you need star schema behind them for performance and code re-use.”

You can’t pretend something is what it isn’t.

Translation that healthcare people will understand: Clarity only reports data in Epic. Clarity is not the best solution for providing users with fast query and report responses. There are better solutions (data marts) that provide faster reporting and allow for integration across systems. Patient safety and quality people know that you need to get data out of more than just your EMR to report quality measures. So why do so many of you think an EMR reporting tool is your answer?

There is a growing sense of urgency at the highest levels in large health systems to start holding quality departments accountable for the operational dollars they continue to waste on non-value added data crunching, report creation, and spreadsheets. Don’t believe me? Ask yourself, “Does my quality team spend more time collecting data and creating reports/spreadsheets or interacting with the organization to improve quality and, consequently, the data?”

Be honest with yourself. The ratio, at best, is 70% of an FTE is collection, 30% is analysis and action. So – get your people out of the basement, out from behind their computer screens, and put them to work. And by work, I mean acting on data and improving quality, not just reporting it.

Are you Paralyzed by a Hoard of Big Data?

Lured by the promise of big data benefits, many organizations are leveraging cheap storage to hoard vast amounts of structured and unstructured data. Without a clear framework for big data governance and use, businesses run the risk of becoming paralyzed under an unorganized jumble of data, much of which has become stale and past its expiration date. Stale data is toxic to your business – it could lead you into taking the wrong action based on data that is no longer relevant.

You know there’s valuable stuff in there, but the thought of wading through all THAT to find it stops you dead in your tracks.  There goes your goal of business process improvement, which according to a recent Informatica survey, most businesses cite as their number one Big Data Initiative goal.

Just as the individual hoarder often requires a professional organizer to help them pare the hoard and institute acquisition and retention rules for preventing hoard-induced paralysis in the future, organizations should seek outside help when they find themselves unable to turn their data hoard into actionable information.

An effective big data strategy needs to include the following components:

  1. An appropriate toolset for analyzing big data and making it actionable by the right people. Avoid building an ivory tower big data bureaucracy, and remember, insight has to turn into action.
  2. A clear and flexible framework, such as social master data management, for integrating big data with enterprise applications, one that can quickly leverage new sources of information about your customers and your market.
  3. Information lifecycle management rules and practices, so that insight and action will be taken based on relevant, as opposed to stale  information.
  4. Consideration of how the enterprise application portfolio might need to be refined to maximize the availability and relevance of big data. In today’s world, that will involve grappling with the flow of information between cloud and internally hosted applications as well.
  5. Comprehensive data security framework that defines who is entitled to use the data, change the data and delete the data, as well as encryption requirements as well as any required upgrades in network security.

Get the picture? Your big data strategy isn’t just a data strategy. It has to be a comprehensive technology-process-people strategy.

All of these elements, should of course, be considered when building your big data business case, and estimating return on investment.

Why EMR’s Are Not Panacea’s for Healthcare’s Data Problems

So, you’ve decided to go with Epic or Centricity or Cerner for your organization’s EMR.

Think your EMR is Hamlin’s Wizard Oil?

Good, the first tough decision is out of the way. If you’re a medium to large size healthcare organization, you likely allocated a few million to a few hundred million dollars on your implementation over five to ten years. I will acknowledge that this is a significant investment, probably one of the largest in your organizations history (aside from a new expansion, but these implementations can easily surpass the cost of building a new hospital).  But I will argue: “Does that really mean the other initiatives you’ve been working should suddenly be put on hold, take a back seat, or even cease to exist?”Absolutely not. The significant majority of healthcare organizations (save a few top performers) are already years and almost a decade behind the rest of the world in adapting technology for improving the way the healthcare is delivered. How do I know this? Well, you tell me, “What other industry continues to publicly have 100,000 mistakes a year?” Okay, glad we now agree. So, are you really going to argue with me that being single-threaded, with a narrow focus on a new system implementation, is the only thing your organization can be committed to? If you’re answer is yes, I have some Cher cassette tapes, a transistor radio, a mullet, and some knee highs that should suit you well in your outdated mentality.

An EMR implementation is a game-changer. Every single one of your clinical workflows will be adjusted, electronic documentation will become the standard, and clinicians will be held accountable like never before for their interaction with the new system. Yes, it depends on what modules you buy – Surgery, IP, OP, scheduling, billing, and the list goes on. But for those of us in the data integration world, trying every day to convince healthcare leaders that turning data into information should be top of mind, this boils down to one basic principle – you have added yet another source of data to your already complex, disparate application landscape. Is it a larger data source than most? Yes. But does this mean you treat it any differently when considering its impact on the larger need for real time, accurate integrated enterprise data analysis? No. Very much no. Does it also mean that your people are suddenly ready to embrace this new technology and leverage all of its benefits? Probably not. Why? Because an EMR, contrary to popular belief, is not a panacea for the personal accountability and data problems in healthcare:

  • If you want to analyze any of the data from your EMR you still need to pull it into an enterprise data model with a solid master data foundation and structure to accommodate a lot more data than will just come from the system (how about materials management, imaging, research, quality, risk?)
    • And please don’t tell me your EMR is also your data warehouse because then you’re in much worse shape than I thought…
    • You’re not all of a sudden reporting real time. It will still take you way too long to produce those quality reports, service line dashboards, or <insert report name here>. Yes there is a real time feed available from the EMR back end database, but that doesn’t change the fact that there are still manual processes required for transforming some of this information, so a sound data quality and data governance strategy is critical BEFORE deploying such a huge, new system.

The list goes on. If you want to hear more, I’m armed to the teeth with examples of why an EMR implementation should be just that, a focused implementation. Yes it will require more resources, time and commitment, but don’t lose sight of the fact that there are plenty more things you needed to do with your data before the EMR came, and the same will be the case once your frenzied EMR-centric mentality is gone.

Keeping the Black Swan at Bay

A recent article in the Harvard Business Review highlighted some alarming statistics on project failures. IT projects were overrunning their budgets by an average of 27%, but the real shocker was that one in six of these projects was over by 200% on average. They dubbed these epic failures the “black swans” of the project portfolio.

The article ends with some excellent advice on avoiding the black swan phenomenon, but the recommendations focus on two areas:

  • Assessments of the ability of the business to take a big hit
  • Sound project management practices such as breaking big projects down into smaller chunks, developing contingency plans, and embracing reference class forecasting.

We would like to add to this list a set of “big project readiness” tasks that offer additional prevention of your next big IT project becoming a black swan.

Project Management Readiness: If you don’t have seasoned PMs with successful big project experience on your team, you need to fill that staffing gap either permanently or with contract help for the big project. Yes, you need an internal PM even if the software vendor has their own PM.

Data Readiness:  Address your data quality issues now, and establish data ownership and data governance before you undertake the big project.

Process/organization/change management readiness: Are your current business processes well documented? Is the process scope of the big project defined correctly? Are process owners clearly identified?  Do you have the skills and framework for defining how the software may change your business processes, organization structure and headcounts? If not, you run a significant risk of failing to achieve anticipated ROI for this project. Do you have a robust corporate communication framework? Do you have the resources, skills and experience to develop and run training programs in house?

Let’s face it: experience matters. If you’re already struggling to recover from a technology black swan, you are at considerable risk for reproducing the same level of failure if you don’t undertake a radical overhaul of your approach by identifying and addressing every significant weakness in the areas noted above.

We have developed a project readiness assessment model that can help you understand your risks and develop an action plan for addressing them before you undertake anything as mission critical as an ERP replacement, CRM implementation,  legacy modernization or other mission critical technology project. If you have a big project on your radar (or already underway), contact makewaves@edgewater.com to schedule a pre-implementation readiness assessment.

Thoughts on 2011 AHA Health Forum Leadership Summit: Coach K’s Five Challenges

The opening keynote address by Tom Brokaw was a motivational, inspiring start to the AHA Leadership Summit. Coach Mike Krzyzewski (Coach K) reluctantly spoke after Mr. Brokaw, his long-time friend and admittedly, “a tough act to follow.” Coach K spoke about what good leadership is and how it relates to those of us in healthcare. One of the most impactful lessons his mother ever taught him was told as a simple metaphor: “On the bus you drive through life, be sure to only let good people on…and if you’re trying to get on another bus, make sure there are only good people on that bus too.” It’s pretty straight forward – recruiting and scouting is everything. Just kidding….as he was, but it means a lot. The way you lead is reflective of the type of company you keep, and the ways in which people feel about your company and leadership.

In addition, Coach K emphasized the importance of a cohesive, collaborative healthcare environment. He leveraged a story Tom Brokaw told. Tom mentioned how during the Nixon Watergate scandal, the political environment was so divided, that before a Republican and Democrat came on his news show one day, they called ahead and wanted to be sure each other was not in the Green Room at the same time. How were these political leaders supposed to achieve anything if they literally couldn’t even stand in the same room as one another?! Coach K spoke about his emphasis on team-building exercises because every year he had new players to incorporate into their offensive and defensive schemes. The challenges, though, were similar. Players would come from backgrounds in different systems with unique styles, and the coaching staff had to find the right ways to make the collective team mesh. More importantly, he had to help his team win. Most importantly, he had to turn boys into men and prepare them for challenges bigger than a basketball court.

The challenges he posed to the audience were these:

  1. Communicate – “When you communicate, do you look your patient in the eye? Do you address them by their name and remember their kids sport? Their husbands name?”
  2. Trust – “Are the principles and practices of your office/hospital/clinic trustworthy? Are you honest and straightforward with your patients about your level of care? Compared to others? Is there full transparency to all the things you do?”
  3. Collective Responsibility – “When was the last time you/your people got hit? Something that knocked you back, knocked you down…and you really felt it? When something bad happens does everyone get together and help solve the problem? Or does a blame game start? You’re all in this together; you got into healthcare to help people. Make sure they know you’re a team.”
  4. Care for One Another – “Anger is a good emotion if it destroys something bad. Cancer is bad, diabetes is bad, and Alzheimer’s is bad. You should be angry at these diseases and, at the same time, empathetic with those struggling to survive with them. Always put yourself in the patients’ shoes before saying or doing anything” – healthcare must become more patient-centric, or as one comprehensive cancer center has tagged it, “personalized medicine”.
  5. Pride (in something bigger than yourself) – “You have to feel it (visualize it, hear it) in order to effectively address, resolve, and improve it.”

“You can want to win, but you must prepare to win”. Preparation starts with an understanding that healthcare has become a team sport –specialists and clinicians must leverage each other’s experiences and expertise to provide patients the best possible outcomes. And since this is my area of expertise – I can add, “it starts with sharing data!”

Paying Too Much for Custom Application Implementation

Face it. Even if you have a team of entry-level coders implementing custom application software, you’re probably still paying too much.

Here’s what I mean:

You already pay upfront for fool proof design and detailed requirements.  If you leverage more technology to implement your application, rather than spending more on coders, your ROI can go up significantly.

In order for entry-level coders to implement software, they need extra detailed designs. Such designs typically must be detailed enough that a coder can simply repeat patterns and fill in blanks from reasonably structured requirements. Coders make mistakes, and have misunderstandings and other costly failures and take months to complete (if nothing changes in requirements during that time).

But, again…   if you have requirements and designs that are already sufficiently structured and detailed… how much more effort is it to get a computer to repeat the patterns and fill in the blanks instead?   Leveraging technology through code generation can help a lot.

Code generation becomes a much less expensive option in cases like that because:

  • There’s dramatically less human error and misunderstanding.
  • Generators can do the work of a team of offshored implementers in moments… and repeat the performance over and over again at the whim of business analysts.
  • Quality Assurance gets much easier…  it’s just a matter of testing each pattern, rather than each detail.  (and while you’re at it, you can generate unit tests as well.)

Code generation is not perfect: it requires very experienced developers to architect and implement an intelligent code generation solution. Naturally, such solutions tend to require experienced people to maintain (because in sufficiently dynamic systems, there will always be implementation pattern changes)  There’s also the one-off stuff that just doesn’t make sense to generate…  (but that all has to be done anyway.)

Actual savings will vary, (and in some cases may not be realized until a later iteration of the application)but typically depend on how large and well your meta data (data dictionary) is structured, and how well your designs lend themselves to code generation.  If you plan for code generation early on, you’ll probably get more out of the experience.  Trying to retro-fit generation can definitely be done (been there, done that, too), but it can be painful.

Projects I’ve worked on that used code generation happened to focus generation techniques mostly on database and data access layer components and/or UI.  Within those components, we were able to achieve 75-80% generated code in the target assemblies.  This meant that from a data dictionary, we were able to generate, for example, all of our database schema and most of our stored procedures, in one case.  In that case, for every item in our data dictionary, we estimated that we were generating about 250 lines of compilable, tested code.  In our data dictionary of about 170 items, that translated into over 400,000 lines of  code.

By contrast, projects where code generation was not used generally took longer to build, especially in cases where the data dictionaries changed during the development process.  There’s no solid apples to apples comparison, but consider hand-writing about 300,000 lines of UI code while the requirements are changing.  Trying to nail down every detail (and change) by hand was a painstaking process, and the changes forced us to adjust the QA cycle accordingly, as well.

Code generation is not a new concept.  There are TONs of tools out there, as demonstrated by this comparison of a number of them on Wikipedia.  Interestingly, some of the best tools for code generation can be as simple as XSL transforms (which opens the tool set up even more).  Code generation may also already be built into your favorite dev tools.  For example, Microsoft’s Visual Studio has had a code generation utility known as T4 built into it for the past few versions, now.   That’s just scratching the surface.

So it’s true…  Code generation is not for every project, but any project that has a large data dictionary (that might need to be changed mid stream) is an immediate candidate in my mind.  It’s especially great for User Interfaces, Database schemas and access layers, and even a lot of transform code, among others.

It’s definitely a thought worth considering.

Workbenches, Lenses and Decisions, Oh My! A Data Quality Software Assessment

Introduction

In a recent survey conducted by The Information Difference, the top three domains requiring data quality initiatives included the product domain, the financial domain, and the name/address domain.  This was surprising since most of the data quality vendors offer name and address matching features; however, few offer product specific features and even fewer offer a financial based set of features. The survey included twenty-seven questions that ranged from a ranking of organizational data quality estimates to data quality implementation specifics.   The survey contains thorough analysis spanning the data quality paradigm. One of the more telling questions in the survey was in reference to the vendor/tool selected by the organizations implementing data quality solutions.

After reading the response summary, it was clear that there was not a predominant choice.  As the survey points out, this could be a consequence of the rather large number of data quality tools available on the market. With so many data quality options, could it be that the data quality market has become so saturated that the difference between offerings has become obscured?

With this in mind, I have put together an assessment that analyzes how the features of two leading vendor offerings, Informatica and Oracle, address data quality issues in the enterprise.  The specific products involved are Informatica’s Data Quality Workbench and Oracle’s DataLens®.  While this assessment is limited in scope, it does correlate with two of the most popular data domains; product and name/address. 

The Informatica Data Quality Workbench

The Informatica data quality product offering includes two products, Data Explorer and Data Quality Workbench; however, for the purposes of this assessment, only the Data Quality Workbench will be reviewed.  

The reason for this is that Data Explorer is primarily a profiling tool which provides insight to what data requires attention, whereas Data Quality Workbench is the tool that performs many of the quality enhancements. 

The Data Quality Workbench contains many features that enable the data quality analyst to enrich data; however, chief among these are the address validation and matching components.  

The address validation component utilizes a service provided by AddressDoctor®, a leader in global address validation.  This service validates addresses fed into the component in multiple ways such as street level, geocoding, and delivery point validation via their reference database which currently covers 240 countries and territories.  As a result, non-deliverable addresses are verified or corrected, increasing the success of operational initiatives such as sales, marketing, and customer service.  

 In addition to the address components, there are also match components available designed to compare various types of strings such as numeric, character and variable character based. 

The tool generates a score representing the degree that the two strings are similar.  The higher the match score, the greater likelihood that the two strings are a match. Potential matches are grouped enabling manual or automated evaluation in the nomination of a master transaction.
 
  

Oracle DataLens®

Formerly from Silver Creek Systems, DataLens® is a data quality engine built specifically for product integration and master data management.  Using semantic technology, DataLens is able to identify and correct errant product descriptions regardless of how the information is presented. This distinguishes it from most data cleansing products.

Based on specific contexts, such as manufacturing or pharmaceutical, DataLens® can recognize the meaning of values regardless of word order, spelling deviations or punctuation. DataLens® also enables on-the-fly classifications, such as Federal Supply Class, and language conversion abilities from any language to any language.
  

Oracle’s long term vision for DataLens® is a seamless integration with Oracle’s Product Management Hub which will allows organizations to centralize the management of product information from various sources.  This collaborative relationship will allow organizations to evaluate and, if necessary, standardize product descriptions as part of an enterprise data management and migration effort.
  

The Assessment

Now that we’ve covered the basics of these products, what conclusions can we draw?  Considering the native technologies built into each of these products, it is reasonable to conclude that there is little overlap between the two.  While both these products are excellent data quality tools, they are meant to address two distinct data quality domains.

With its address validation technology, Data Quality Workbench is primed for customer data integration (CDI), while DataLens’ imminent integration with Oracle’s Product Management Hub makes it a compelling choice for product information management (PIM).

Customer Data Integration (CDI)

CDI benefits organizations both large and small by enabling a “single view of the customer” and typically relies on name and address coupling in order to identify potential duplicate customer data. CDI is often associated with direct marketing campaigns, but also provide benefits in billing operations and customer service operations.

Informatica’s Data Quality Workbench is an appropriate selection for an organization looking to achieve any of the following objectives:

  1. Eliminate direct marketing mailings to undeliverable addresses
  2. Eliminate multiple direct marketing mailings to the same customer
  3. Eliminate multiple direct marketing mailings to the same household
  4. Eliminate erroneous billing activities due to customer/client duplication
  5. Eliminate erroneous billing activities due to undeliverable addresses
  6. Increase customer satisfaction by eliminating confusion caused by duplicate customer data
  7. Decrease resolution time for customer service incidents by eliminating duplicate customer data

Product Information Management (PIM)

PIM initiatives benefit organizations with multiple product lines and distributed order fulfillment operations.  They are frequently associated with supply chain operations in an effort to reduce product data variability and stream-line product order fulfillment. PIM projects are rooted in data governance and rely on external reference data and business process vigor to implement.

Oracle’s DataLens is an appropriate selection for an organization looking to achieve any of the following objectives:

  1. Eliminate erroneous order fulfillment activities caused by stale or variant product information
  2. Eliminate incorrect billing due to discrepencies in product data
  3. Eliminate under utilization of warehouse inventory due to confusion on availability of product
  4. Eliminate confusion and delays at customs due to discrepencies in product weights and descriptions
  5. Eliminate reconciliation exercises associated with the remediation of product data
  6. Increase cross-sell for customers via aligned data on product usage
  7. Decrease errors resulting from poor data entry accuracy

Just as no two data quality projects are the same, neither are data quality software products. So while Oracle’s DataLens and Informatica’s Data Quality Workbench are both classified under the data quality software umbrella, they are so different in design and implementation that they cannot be thought of as interchangeable. Each tool enables the execution of information quality in data domains so distinct that it is important to understand this context prior to the investment of purchasing such a tool.

This further supports the need to perform an assessment of tool features aligned to the business need in the project planning phase in order to ensure full capitalization of the investment in the data quality initiative.

Informatica Data Quality Matching Algorithms: Eliminate duplicates and reduce costs

Why are matching algorithms important?

Are you looking for a way to cut costs from operations? Matching algorithms can help you do it! Duplication data consolidation can deliver a direct cost savings to an organization’s operations through the elimination of redundant, costly data.

On a recent engagement, I worked with a marketing group and discovered close to one million duplicate customer records. At an average cost of $0.45 per mailer, the estimated marketing operations cost reduction was $450,000. What’s really exciting is that this cost reduction was for the first year. When you take into consideration that each customer remained in marketing campaigns for a minimum of three years, the total cost reduction to marketing operations was in the excess of one million dollars.

The American Health Information Management Association (AHIMA) has published estimates that project the cost of reconciling duplicate data between $10 and $20 per pair.

Studies in the healthcare industry indicate that the cost of duplicate records ranges from twenty dollars to layered costs of several hundred dollars per duplicate. When you consider the fact that a conservative estimate for data duplication in enterprise data is approximately 10%, the total cost of data duplication can be expressed in millions of dollars.

When examining the potential return on investment (ROI) of data de-duplication efforts, there is evidence that the value proposition can be significant and can indeed add value to data-driven organizations.  Using the low-end of the cost of duplication ($20 per duplicate), and the median ($15 per duplicate), the potential return on investment is estimated at 33%.  At a cost reduction of $500,000 per million records, a compelling case can be made for such an effort.

Furthermore, projects like data quality assessments can be performed at lower costs to determine if such an ROI is available to an organization.  A data quality assessment is a short,  diagnostic effort that gauges the quality of the data that yields insight into, among other aspects, potential duplication levels in the data.

Matching algorithms just got a whole lot more interesting, didn’t they?

Informatica Data Quality Workbench Matching Algorithms

Informatica offers several implementations of matching algorithms that can be used to identify possible duplicate records. Each implementation is based on determining the similarity between two strings, such as name and address. There are implementations that are more well-suited to use with date strings and others that are ideal for numeric strings. In the coming weeks, we’ll go through an overview of each of these implementations and how to use them to your advantage!

Hamming Distance Algorithm

Let’s begin the series with the Informatica’s implementation of the Hamming distance algorithm.

The Hamming distance algorithm is particularly useful when the position of the characters in the string is important.  Examples of such strings are telephone numbers, dates and postal codes.  The Hamming Distance algorithm measures the minimum number of substitutions required to change one string into the other, or the number of errors that transformed one string into the other.

The Hamming distance is named after Richard Hamming.  Hamming was an American mathematician whose accomplishments include many advances in Information Science.  Perhaps as a result of Hamming’s time at Bell Laboratories, the Hamming distance algorithm is most often associated with the analysis of telephone numbers.  However the advantages of the algorithm are applicable to various types of strings and are not limited to numeric strings.

Worth noting is one condition that needs to be adhered to when using this algorithm; the strings being analyzed need to be of the same length.  Since the Hamming distance algorithm is based on the “cost” of transposing one string into another, strings of unequal length will result in high penalties due to the transpositions involving null character values.

In Practice

Due to this constraint, it is important to cleanse and standardize your data prior to using the Hamming distance component in IDQ. For instance, failing to parse area codes from telephone numbers could cause penalties when matching what would otherwise be similar numbers.

Let’s take the data in Figure 1 as an example. As you can see the data has not been cleansed and we have a record containing an area code and a plus-4 extension on the postal code. We also have a record which does not contain either an area code or a plus-4 extension.

Sample data

Figure 1 Telephone / Postal Code Sample Data

Before running a match plan on these records we’ll need to group them into logical groups. These groups are usually defined by the business objectives which lead to the need to match the data. For instance, if we are trying to identify and consolidate duplicate phone numbers then grouping the data by area code would be a logical grouping factor. This is due to the fact that seven digit telephone numbers are often repeated across area codes.

After reviewing the data it is evident that we do not have an area code element in both records. We do, however, have postal codes and although it is not a one-to-one relationship, the first three digits of the postal code should build logical groupings that will produce similar groupings.

Once the data has been grouped, we can build a simple IDQ plan to match the data using the Hamming Distance component (Figure 2).

Data Quality

Figure 2 IDQ Match Plan using the Hamming Distance Component *Figure 2 appears with the permission of Informatica Corporation.


Since we are focused on identifying duplicate phone numbers, I’ve chosen to weight the telephone element slightly higher than the postal code. This is can be done via the use of the Weight Based Analyzer, illustrated in Figure 2 above. As you can see in the figure to the left, adjusting the weighting of a data element is as simple as entering a number into the text box provided.

The higher the number, the greater the weighting or emphasis placed on that value matching. Acceptable ranges for weightings are between 0.0 and 1000000.0. The default value for each element in a match plan is 0.5.  In my experience, as long as the values entered are proportionate to the importance of the data in identifying a true match, then the actual values entered are subjective.

Depending on the business problem, my experience indicates that keeping these values between 0.4 and 0.8 has produced valid positive identifications. When the data element is essential to indicating a match, the higher end of this weighting scale allows that element to drive the matching appropriately.  When the data element is useful in indicating a key difference between the records, the lower end of this scale is appropriate.  An example of a field that establishes a difference between two records is the name suffix.  This value can differentiate a father from his son when they both share a common first and last name.

If you enter a number outside that range, an error message like the one below will appear.

 

*Screen shots appear with the permission of Informatica Corporation.

Without cleansing this data IDQ is unable to detect a match for either record. The match summary report in figure 4 below shows the default html report output. As you can see, the plan made two comparisons but was unable to find a match in either of those comparisons.

Data Quality

Figure 3 match summary report *Figure 3 appears with the permission of Informatica Corporation.


Now let’s take a look at the same data once it has been cleansed by parsing out the area code and plus 4 postal code extension. This type of parsing can be achieved by using the token parser and using the dash (-) symbol as the parsing token.

Healthcare data

Figure 4 Parsed Telephone / Postal Code Sample Data

From manual review of the data, it is easy to see what a difference the data cleansing can make, but let’s take a look at its effect on matching in IDQ.

Using the same grouping and matching criteria, IDQ is now able to identify a match. The results summary shows us that the same number of comparisons has been made, however, a match has been identified and the match score range is between 0.9-1.0.

Upon further examination of the clusters, or group of potential matches, we can see that the hamming distance score (highlighted in red) for both the telephone and postal code inputs was equal to 1.

Data quality

Figure 6 Detailed Match Report *Figure 6 appears with the permission of Informatica Corporation.


This indicates that these values were an exact match. The report also includes an output for the record which best matches a given row (highlighted in blue). If there were more than two matches in the cluster, this field would help identify the records which most closely resemble each other.

Summary

In this post we’ve introduced the various algorithms available in Informatica’s Data Quality (IDQ) workbench and described the implementation and advantages of the Hamming Distance component in IDQ.  We’ve discussed how data cleansing and standardization can positively influence the results of matching processes.  We’ve even discussed certain business problems and how they can be solved with data matching.

But perhaps most importantly, we’ve illustrated why matching algorithms are an important tool in reducing costs and that a de-duplication effort is worth the investment.