Data Darwinism – Evolving your data environment

In my previous posts, the concept of Data Darwinism was introduced, as well as the types of capabilities that allow a company to set itself apart from its competition.   Data Darwinism is the practice of using an organization’s data to survive, adapt, compete and innovate in a constantly changing and increasingly competitive business environment.   If you take an honest and objective look at how and why you are using data, you might find out that you are on the wrong side of the equation.  So the question is “how do I move up the food chain?”

The goal of evolving your data environment is to change from using your data in a reactionary manner and just trying to survive, to proactively using your data as a foundational component to constantly innovate to create a competitive advantage.

The plan is simple on the surface, but not always so easy in execution.   It requires an objective assessment of where you are compared to where you need to be, a plan/blueprint/roadmap to get from here to there, and flexible, iterative execution.


As mentioned before, taking an objective look at where you are compared to where you need to be is the first critical step.  This is often an interesting conversation among different parts of the organization that have competing interests and objectives. Many organizations can’t get past this first step. People get caught up in politics and self-interest and lose sight of the goal; to move the organization forward into a competitive advantage situation. Other organizations don’t have the in-house expertise or discipline to conduct the assessment. However, until this can be done, you remain vulnerable to other organizations that have moved past this step.


Great, now you’ve done the assessment, you know what your situation is and what your strengths and weaknesses are.  Without a roadmap of how to get to your data utopia, you’re going nowhere.   The roadmap is really a blueprint of inter-related capabilities that need to be implemented incrementally over time to constantly move the organization forward.   Now, I’ve seen this step end very badly for organizations that make some fundamental mistakes.  They try to do too much at once.  They make the roadmap too rigid to adapt to changing business needs.   They take a form over substance approach.  All these can be fatal to an organization.   They key to the roadmap is three-fold:

  • Flexible – This is not a sprint.   Evolving your data environment takes time.   Your business priorities will change, the external environment in which you operate will change, etc.   The roadmap needs to be flexible enough to enable it to adapt to these types of challenges.
  • – There will be an impulse to move quickly and do everything at once.   That almost never works.   It is important to align the priorities with the overall priorities of the organization.
  • Realistic – Just as you had to take an objective, and possibly painful, look at where you were with respect to your data, you have to take a similar look at what can be done given any number of constraints all organizations face.   Funding, people, discipline, etc. are all factors that need to be considered when developing the roadmap.   In some cases, you might not have the internal skill sets necessary and have to leverage outside talent.   In other cases, you will have to implement new processes, organizational constructs and enabling technologies to enable the movement to a new level.  

Execute Iteratively

The capabilities you need to implement will build upon each other and it will take time for the organization to adapt to the changes.   Taking an iterative approach that focuses on building capabilities based on the organization’s business priorities will greatly increase your chance of success.  It also gives you a chance to evaluate the capabilities to see if they are working as anticipated and generating the expected returns.   Since you are taking an iterative approach, you have the opportunity to make the necessary changes to continue moving forward.

The path to innovation is not always an easy one.   It requires a solid, yet flexible, plan to get there and persistence to overcome the obstacles that you will encounter.   However, in the end, it’s a journey well worth the effort.

Data Darwinism – Capabilities that provide a competitive advantage

In my previous post, I introduced the concept of Data Darwinism, which states that for a company to be the ‘king of the jungle’ (and remain so), they need to have the ability to continually innovate.   Let’s be clear, though.   Innovation must be aligned with the strategic goals and objectives of the company.   The landscape is littered with examples of innovative ideas that didn’t have a market.  

So that begs the question “What are the behaviors and characteristics of companies that are at the top of the food chain?”    The answer to that question can go in many different directions.   With respect to Data Darwinism, the following hierarchy illustrates the categories of capabilities that an organization needs to demonstrate to truly become a dominant force.


The impulse will be for an organization to want to immediately jump to implementing capabilities that they think will allow them to be at the top of the pyramid.   And while this is possible to a certain extent, you must put in place certain foundational capabilities to have a sustainable model.     Examples of capabilities at this level include data integration, data standardization, data quality, and basic reporting.

Without clean, integrated, accurate data that is aligned with the intended business goals, the ability to implement the more advanced capabilities is severely limited.    This does not mean that all foundational capabilities must be implemented before moving on to the next level.  Quite the opposite actually.   You must balance the need for the foundational components with the return that the more advanced capabilities will enable.


Transitional capabilities are those that allow an organization to move from silo’d, isolated, often duplicative efforts to a more ‘centralized’ platform in which to leverage their data.    Capabilities at this level of the hierarchy start to migrate towards an enterprise view of data and include such things as a more complete, integrated data set, increased collaboration, basic analytics and ‘coordinated governance’.

Again, you don’t need to fully instantiate the capabilities at this level before building capabilities at the next level.   It continues to be a balancing act.


Transformational capabilities are those that allow the company to start to truly differentiate themselves from their competition.   It doesn’t fully deliver the innovative capabilities that set them head and shoulders above other companies, but rather sets the stage for such.   This stage can be challenging for organizations as it can require a significant change in mind-set compared to the current way its conducts its operations.   Capabilities at this level of the hierarchy include more advanced analytical capabilities (such as true data mining), targeted access to data by users, and ‘managed governance’.


Innovative capabilities are those that truly set a company apart from its competitors.   They allow for innovative product offerings, unique methods of handling the customer experience and new ways in which to conduct business operations.   Amazon is a great example of this.   Their ability to customize the user experience and offer ‘recommendations’ based on a wealth of user buying  trend data has set them apart from most other online retailers.    Capabilities at this level of the hierarchy include predictive analytics, enterprise governance and user self-service access to data.

The bottom line is that moving up the hierarchy requires vision, discipline and a pragmatic approach.   The journey is not always an easy one, but the rewards more than justify the effort.

Check back for the next installment of this series “Data Darwinism – Evolving Your Data Environment.”

Data Darwinism – Are you on the path to extinction?

Most people are familiar with Darwinism.  We’ve all heard the term survival of the fittest.   There is even a humorous take on the subject with the annual Darwin Awards, given to those individuals who have removed themselves from the gene pool through, shall we say, less than intelligent choices.

Businesses go through ups and downs, transformations, up-sizing/down-sizing, centralization/ decentralization, etc.   In other words, they are trying to adapt to the current and future events in order to grow.   Just as in the animal kingdom, some will survive and dominate, some will not fare as well.   In today’s challenging business environment, while many are trying to merely survive, others are prospering, growing and dominating.  

So what makes the difference between being the king of the jungle or being prey?   The ability to make the right decisions in the face of uncertainty.     This is often easier said than done.   However, at the core of making the best decisions is making sure you have the right data.   That brings us back to the topic at hand:  Data Darwinism.   Data Darwinism can be defined as:

“The practice of using an organization’s data to survive, adapt, compete and innovate in a constantly changing and increasingly competitive business environment.”

When asked to assess where they are on the Data Darwinism continuum, many companies will say that they are at the top of the food chain, that they are very fast at getting data to make decisions, that they don’t see data as a problem, etc.   However, when truly asked to objectively evaluate their situation, they often come up with a very different, and often frightening, picture. 

  It’s as simple as looking at your behavior when dealing with data:

If you find yourself exhibiting more of the behaviors on the left side of the picture above, you might be a candidate for the next Data Darwin Awards.

Check back for the next installment of this series “Data Darwinism – Capabilities that Provide a Competitive Advantage.”

Informatica Data Quality Matching Algorithms: Eliminate duplicates and reduce costs

Why are matching algorithms important?

Are you looking for a way to cut costs from operations? Matching algorithms can help you do it! Duplication data consolidation can deliver a direct cost savings to an organization’s operations through the elimination of redundant, costly data.

On a recent engagement, I worked with a marketing group and discovered close to one million duplicate customer records. At an average cost of $0.45 per mailer, the estimated marketing operations cost reduction was $450,000. What’s really exciting is that this cost reduction was for the first year. When you take into consideration that each customer remained in marketing campaigns for a minimum of three years, the total cost reduction to marketing operations was in the excess of one million dollars.

The American Health Information Management Association (AHIMA) has published estimates that project the cost of reconciling duplicate data between $10 and $20 per pair.

Studies in the healthcare industry indicate that the cost of duplicate records ranges from twenty dollars to layered costs of several hundred dollars per duplicate. When you consider the fact that a conservative estimate for data duplication in enterprise data is approximately 10%, the total cost of data duplication can be expressed in millions of dollars.

When examining the potential return on investment (ROI) of data de-duplication efforts, there is evidence that the value proposition can be significant and can indeed add value to data-driven organizations.  Using the low-end of the cost of duplication ($20 per duplicate), and the median ($15 per duplicate), the potential return on investment is estimated at 33%.  At a cost reduction of $500,000 per million records, a compelling case can be made for such an effort.

Furthermore, projects like data quality assessments can be performed at lower costs to determine if such an ROI is available to an organization.  A data quality assessment is a short,  diagnostic effort that gauges the quality of the data that yields insight into, among other aspects, potential duplication levels in the data.

Matching algorithms just got a whole lot more interesting, didn’t they?

Informatica Data Quality Workbench Matching Algorithms

Informatica offers several implementations of matching algorithms that can be used to identify possible duplicate records. Each implementation is based on determining the similarity between two strings, such as name and address. There are implementations that are more well-suited to use with date strings and others that are ideal for numeric strings. In the coming weeks, we’ll go through an overview of each of these implementations and how to use them to your advantage!

Hamming Distance Algorithm

Let’s begin the series with the Informatica’s implementation of the Hamming distance algorithm.

The Hamming distance algorithm is particularly useful when the position of the characters in the string is important.  Examples of such strings are telephone numbers, dates and postal codes.  The Hamming Distance algorithm measures the minimum number of substitutions required to change one string into the other, or the number of errors that transformed one string into the other.

The Hamming distance is named after Richard Hamming.  Hamming was an American mathematician whose accomplishments include many advances in Information Science.  Perhaps as a result of Hamming’s time at Bell Laboratories, the Hamming distance algorithm is most often associated with the analysis of telephone numbers.  However the advantages of the algorithm are applicable to various types of strings and are not limited to numeric strings.

Worth noting is one condition that needs to be adhered to when using this algorithm; the strings being analyzed need to be of the same length.  Since the Hamming distance algorithm is based on the “cost” of transposing one string into another, strings of unequal length will result in high penalties due to the transpositions involving null character values.

In Practice

Due to this constraint, it is important to cleanse and standardize your data prior to using the Hamming distance component in IDQ. For instance, failing to parse area codes from telephone numbers could cause penalties when matching what would otherwise be similar numbers.

Let’s take the data in Figure 1 as an example. As you can see the data has not been cleansed and we have a record containing an area code and a plus-4 extension on the postal code. We also have a record which does not contain either an area code or a plus-4 extension.

Sample data

Figure 1 Telephone / Postal Code Sample Data

Before running a match plan on these records we’ll need to group them into logical groups. These groups are usually defined by the business objectives which lead to the need to match the data. For instance, if we are trying to identify and consolidate duplicate phone numbers then grouping the data by area code would be a logical grouping factor. This is due to the fact that seven digit telephone numbers are often repeated across area codes.

After reviewing the data it is evident that we do not have an area code element in both records. We do, however, have postal codes and although it is not a one-to-one relationship, the first three digits of the postal code should build logical groupings that will produce similar groupings.

Once the data has been grouped, we can build a simple IDQ plan to match the data using the Hamming Distance component (Figure 2).

Data Quality

Figure 2 IDQ Match Plan using the Hamming Distance Component *Figure 2 appears with the permission of Informatica Corporation.

Since we are focused on identifying duplicate phone numbers, I’ve chosen to weight the telephone element slightly higher than the postal code. This is can be done via the use of the Weight Based Analyzer, illustrated in Figure 2 above. As you can see in the figure to the left, adjusting the weighting of a data element is as simple as entering a number into the text box provided.

The higher the number, the greater the weighting or emphasis placed on that value matching. Acceptable ranges for weightings are between 0.0 and 1000000.0. The default value for each element in a match plan is 0.5.  In my experience, as long as the values entered are proportionate to the importance of the data in identifying a true match, then the actual values entered are subjective.

Depending on the business problem, my experience indicates that keeping these values between 0.4 and 0.8 has produced valid positive identifications. When the data element is essential to indicating a match, the higher end of this weighting scale allows that element to drive the matching appropriately.  When the data element is useful in indicating a key difference between the records, the lower end of this scale is appropriate.  An example of a field that establishes a difference between two records is the name suffix.  This value can differentiate a father from his son when they both share a common first and last name.

If you enter a number outside that range, an error message like the one below will appear.


*Screen shots appear with the permission of Informatica Corporation.

Without cleansing this data IDQ is unable to detect a match for either record. The match summary report in figure 4 below shows the default html report output. As you can see, the plan made two comparisons but was unable to find a match in either of those comparisons.

Data Quality

Figure 3 match summary report *Figure 3 appears with the permission of Informatica Corporation.

Now let’s take a look at the same data once it has been cleansed by parsing out the area code and plus 4 postal code extension. This type of parsing can be achieved by using the token parser and using the dash (-) symbol as the parsing token.

Healthcare data

Figure 4 Parsed Telephone / Postal Code Sample Data

From manual review of the data, it is easy to see what a difference the data cleansing can make, but let’s take a look at its effect on matching in IDQ.

Using the same grouping and matching criteria, IDQ is now able to identify a match. The results summary shows us that the same number of comparisons has been made, however, a match has been identified and the match score range is between 0.9-1.0.

Upon further examination of the clusters, or group of potential matches, we can see that the hamming distance score (highlighted in red) for both the telephone and postal code inputs was equal to 1.

Data quality

Figure 6 Detailed Match Report *Figure 6 appears with the permission of Informatica Corporation.

This indicates that these values were an exact match. The report also includes an output for the record which best matches a given row (highlighted in blue). If there were more than two matches in the cluster, this field would help identify the records which most closely resemble each other.


In this post we’ve introduced the various algorithms available in Informatica’s Data Quality (IDQ) workbench and described the implementation and advantages of the Hamming Distance component in IDQ.  We’ve discussed how data cleansing and standardization can positively influence the results of matching processes.  We’ve even discussed certain business problems and how they can be solved with data matching.

But perhaps most importantly, we’ve illustrated why matching algorithms are an important tool in reducing costs and that a de-duplication effort is worth the investment.

Clinical Alerts – Why Good Intentions Must Start as Good Ideas

As the heated debate continues about ways to decrease the costs of our healthcare system while simultaneously improving its quality, it is critical to consider the most appropriate place to start – which depends on who you are. Much has been made about the advantages of clinical alerts especially with their use in areas high on the national radar like quality of care, medication use and allergic reactions, and adverse events.   Common sense, though, says walk before you run; in this case its crawl before you run. 

Clinical alerts are most often electronic messages sent via email, text, page, and even automated voice to notify a clinician or group of clinicians to conduct a course of action related to their patient care based on data retrieved in a Clinical Decision Support System (CDSS) designed for optimal outcomes. The rules engine that generates alerts is created specifically for various areas of patient safety and quality like administering vaccines to children, core measure compliance, and preventing complications like venous thromboembolism (VTE) (also a core measure). The benefits of using clinical alerts in various care settings are obvious if the right people, processes, and systems are in place to consume and manage the alerts appropriately. Numerous studies have been done highlighting the right and wrong ways of implementing and utilizing alerts. The best criteria I’ve seen used consider 5 major themes when designing alerts: Efficiency, Usefulness, Information Content, User Interface, and Workflow (I’ve personally confirmed each of these from numerous discussions with clinicians ranging from ED nurses to Anesthesiologists in the OR to hospitalists on the floors). And don’t forget one huge piece of the alerting discussion that often gets overlooked…….the patient! While some of these may be obvious, all must be considered as the design and implementation phases of the alerts progress.

OK, Now Back to Reality

A discussion about how clinical alerting can improve the quality of care is one limited to the very few provider organizations that already have the infrastructure setup and resources to implement such an initiative. This means that if you are seriously considering such a task, you should already have:

  • an Enterprise Data Strategy and Roadmap that tells you how alerts tie into the broader mission;
  • Data Governance  to assign ownership and accountability for the quality of your data and implement standards (especially when it comes to clinical documentation and data entry);
  • standardized process flows that identify points for consistent, discrete data collection;
  • surgeon, physician, anesthesiology, nursing, researcher, and hospitalist champions to gather support from various constituencies and facilitate education and buy-in; and
  •  oh yeah, the technology and skilled staff to support a multi-system, highly integrated, complex rules-based environment that will likely change over time and be more scrutinized………

◊◊Or a strong relationship with an experienced consulting partner capable of handling all of these requirements and transferring the necessary knowledge along the way.◊◊

I must emphasize the second bullet for just a moment; data governance is critical to ensure that the quality of the data being collected passes the highest level of scrutiny, from doctors to administrators. This is of the utmost importance because the data is what forms the basis of the information that decision makers act on. The quickest way to lose momentum and buy in to any project is by putting bad data in front of a group of doctors and clinicians; trust me when I say it is infinitely more difficult to win their trust back once you’ve made that mistake. On the other hand, if they trust the data and understand the value of it in near real time across their spectrum of care, you turn them quickly into leaders willing to champion your efforts. And now you have a solid foundation for any healthcare analytics program.    

If you are like the majority of healthcare organizations in this country, you may have some pieces to this puzzle in various stages of design, development, deployment or implementation. In all likelihood, though, you are at the early stages of the Clinical Alerts Maturity Model


and with all things considered, should have alerting functionality in the later years of your strategic roadmap. Though, there are many  projects with low cost, fast implementations, quick ROIs, and ample examples to glean lessons learned from like, Computerized Physician Order Entry (CPOE), electronic nursing and physician documentation, Picture Archiving System (PACS), and a clinical data repository (CDR) to use alerting as a prototype or proof of concept to demonstrate the broader value proposition. Clinical alerting, to start, should be incorporated alongside projects that have proven impact across the Clinical Alerts Maturity Model before they are rolled out as stand-alone initiatives.

From Free Text Clinical Documentation to Data-rich Actionable Information

Hey healthcare providers! Yeah you the “little guy”, the rural community hospital; or you the “average Joe”, the few-hundred bed hub hospital with outpatient clinics, an ED, and some sub-paper-pilespecialties; or you the “behemoth”, the one with the health plan, physician group, outpatient, inpatient, and multi-discipline, multi-care setting institution. Is your EMR really just an electronic filing cabinet? Do nursing and physician notes, standard lab and imaging orders, registration and other critical documents just get scanned into a central system that can’t be referenced later on to meet your analytic needs? Don’t worry, you’re not alone…

Recently, I blogged about some of the advantages of Microsoft’s new Amalga platform; I want to emphasize a capability of Amalga Life Sciences that I hope finds its way into the range of healthcare provider organizations mentioned above, and quick! That is, the ability to create adoctor microscope standard ontology for displaying and navigating the unstructured information collected by providers across care settings and patient visits (see my response to a comment about Amalga Life Science utilization of UMLS for a model of standardized terminology). I don’t have to make this case to the huge group of clinicians already too familiar with this process in hospitals across the country; but the argument (and likely ROI) clearly needs to be articulated for those individuals responsible for transitioning from paper to digital records at the organizations who are dragging their feet (>90%). The question I have for these individuals is, “why is this taking so long? Why haven’t you been able to identify the clear cut benefits from moving from paper-laden manual processes to automated, digital interfaces and streamlined workflows?” These folks should ask the Corporate Executives at hospitals in New Orleans after Hurricane Katrina whether they had hoped to have this debate long before their entire patient population medical records’ drowned; just one reason why “all paper” is a strategy of the past.   

Let’s take one example most provider organizations can conceptualize: a pneumonia patient flow through the Emergency Department. There are numerous points throughout this process that could be considered “data collection points”. These, collectively and over time, paint a vivid picture of the patient experience from registration to triage to physical exam and diagnostic testing to possible admission or discharge. With this data you can do things like real or near-real time clinical alerting that would improve patient outcomes and compliance with regulations like CMS Core Measures; you can identify weak points or bottlenecks in the process to allocate additional resources; you can model best practices identified over time to improve clinical and operational efficiencies. Individually, though, with this data written on a piece of paper (and remember 1 piece of paper for registration, a separate piece for the “Core Measure Checklist”, another for the physician exam, another for the lab/X-ray report, etc.) and maybe scanned into a central system, this information tells you very little. You are also, then, at the mercy of the ability to actually read a physicians handwriting and analyze scanned documents of information vs. delineated data fields that can be trended over time, summarized, visualized, drilled down to, and so on.11-3 hc analytics

Vulnerabilities and Liabilities from Poor Documentation

Relying on poor documentation like illegible penmanship, incomplete charting and unapproved abbreviations burdens nurses and creates a huge liability. With all of the requirements and suggestions for the proper way to document, it’s no wonder why this area is so prone to errors. There are a variety of consequences from performing patient care based on “best guesses” when reading clinical documentation. Fortunately, improving documentation directly correlates with reduced medical errors. The value proposition for improved data collection and standardized terminology for that data makes sense operationally, financially, and clinically.   

So Let’s Get On With It, Shall We?

Advancing clinical care through the use of technology is seemingly one component of the larger healthcare debate in this country centered on “how do we improve the system?” Unfortunately, too many providers want to sprint before they can crawl. Moving off of paper helps you crawl first; it is a valuable, achievable goal across that the majority of organizations burdened with manual processes and their costs and if done properly, the ROI can be realized in a short amount of time with manageable effort. Having said this, the question quickly then becomes, “are we prepared to do what it takes to actually make the system improve?” Are you?