Why are matching algorithms important?
Are you looking for a way to cut costs from operations? Matching algorithms can help you do it! Duplication data consolidation can deliver a direct cost savings to an organization’s operations through the elimination of redundant, costly data.
On a recent engagement, I worked with a marketing group and discovered close to one million duplicate customer records. At an average cost of $0.45 per mailer, the estimated marketing operations cost reduction was $450,000. What’s really exciting is that this cost reduction was for the first year. When you take into consideration that each customer remained in marketing campaigns for a minimum of three years, the total cost reduction to marketing operations was in the excess of one million dollars.
The American Health Information Management Association (AHIMA) has published estimates that project the cost of reconciling duplicate data between $10 and $20 per pair.
Studies in the healthcare industry indicate that the cost of duplicate records ranges from twenty dollars to layered costs of several hundred dollars per duplicate. When you consider the fact that a conservative estimate for data duplication in enterprise data is approximately 10%, the total cost of data duplication can be expressed in millions of dollars.
When examining the potential return on investment (ROI) of data de-duplication efforts, there is evidence that the value proposition can be significant and can indeed add value to data-driven organizations. Using the low-end of the cost of duplication ($20 per duplicate), and the median ($15 per duplicate), the potential return on investment is estimated at 33%. At a cost reduction of $500,000 per million records, a compelling case can be made for such an effort.
Furthermore, projects like data quality assessments can be performed at lower costs to determine if such an ROI is available to an organization. A data quality assessment is a short, diagnostic effort that gauges the quality of the data that yields insight into, among other aspects, potential duplication levels in the data.
Matching algorithms just got a whole lot more interesting, didn’t they?
Informatica Data Quality Workbench Matching Algorithms
Informatica offers several implementations of matching algorithms that can be used to identify possible duplicate records. Each implementation is based on determining the similarity between two strings, such as name and address. There are implementations that are more well-suited to use with date strings and others that are ideal for numeric strings. In the coming weeks, we’ll go through an overview of each of these implementations and how to use them to your advantage!
Hamming Distance Algorithm
Let’s begin the series with the Informatica’s implementation of the Hamming distance algorithm.
The Hamming distance algorithm is particularly useful when the position of the characters in the string is important. Examples of such strings are telephone numbers, dates and postal codes. The Hamming Distance algorithm measures the minimum number of substitutions required to change one string into the other, or the number of errors that transformed one string into the other.
The Hamming distance is named after Richard Hamming. Hamming was an American mathematician whose accomplishments include many advances in Information Science. Perhaps as a result of Hamming’s time at Bell Laboratories, the Hamming distance algorithm is most often associated with the analysis of telephone numbers. However the advantages of the algorithm are applicable to various types of strings and are not limited to numeric strings.
Worth noting is one condition that needs to be adhered to when using this algorithm; the strings being analyzed need to be of the same length. Since the Hamming distance algorithm is based on the “cost” of transposing one string into another, strings of unequal length will result in high penalties due to the transpositions involving null character values.
In Practice
Due to this constraint, it is important to cleanse and standardize your data prior to using the Hamming distance component in IDQ. For instance, failing to parse area codes from telephone numbers could cause penalties when matching what would otherwise be similar numbers.
Let’s take the data in Figure 1 as an example. As you can see the data has not been cleansed and we have a record containing an area code and a plus-4 extension on the postal code. We also have a record which does not contain either an area code or a plus-4 extension.
Figure 1 Telephone / Postal Code Sample Data
Before running a match plan on these records we’ll need to group them into logical groups. These groups are usually defined by the business objectives which lead to the need to match the data. For instance, if we are trying to identify and consolidate duplicate phone numbers then grouping the data by area code would be a logical grouping factor. This is due to the fact that seven digit telephone numbers are often repeated across area codes.
After reviewing the data it is evident that we do not have an area code element in both records. We do, however, have postal codes and although it is not a one-to-one relationship, the first three digits of the postal code should build logical groupings that will produce similar groupings.
Once the data has been grouped, we can build a simple IDQ plan to match the data using the Hamming Distance component (Figure 2).
Figure 2 IDQ Match Plan using the Hamming Distance Component *Figure 2 appears with the permission of Informatica Corporation.
Since we are focused on identifying duplicate phone numbers, I’ve chosen to weight the telephone element slightly higher than the postal code. This is can be done via the use of the Weight Based Analyzer, illustrated in Figure 2 above. As you can see in the figure to the left, adjusting the weighting of a data element is as simple as entering a number into the text box provided.
The higher the number, the greater the weighting or emphasis placed on that value matching. Acceptable ranges for weightings are between 0.0 and 1000000.0. The default value for each element in a match plan is 0.5. In my experience, as long as the values entered are proportionate to the importance of the data in identifying a true match, then the actual values entered are subjective.
Depending on the business problem, my experience indicates that keeping these values between 0.4 and 0.8 has produced valid positive identifications. When the data element is essential to indicating a match, the higher end of this weighting scale allows that element to drive the matching appropriately. When the data element is useful in indicating a key difference between the records, the lower end of this scale is appropriate. An example of a field that establishes a difference between two records is the name suffix. This value can differentiate a father from his son when they both share a common first and last name.
If you enter a number outside that range, an error message like the one below will appear.
*Screen shots appear with the permission of Informatica Corporation.
Without cleansing this data IDQ is unable to detect a match for either record. The match summary report in figure 4 below shows the default html report output. As you can see, the plan made two comparisons but was unable to find a match in either of those comparisons.
Figure 3 match summary report *Figure 3 appears with the permission of Informatica Corporation.
Now let’s take a look at the same data once it has been cleansed by parsing out the area code and plus 4 postal code extension. This type of parsing can be achieved by using the token parser and using the dash (-) symbol as the parsing token.
Figure 4 Parsed Telephone / Postal Code Sample Data
From manual review of the data, it is easy to see what a difference the data cleansing can make, but let’s take a look at its effect on matching in IDQ.
Using the same grouping and matching criteria, IDQ is now able to identify a match. The results summary shows us that the same number of comparisons has been made, however, a match has been identified and the match score range is between 0.9-1.0.
Upon further examination of the clusters, or group of potential matches, we can see that the hamming distance score (highlighted in red) for both the telephone and postal code inputs was equal to 1.
Figure 6 Detailed Match Report *Figure 6 appears with the permission of Informatica Corporation.
This indicates that these values were an exact match. The report also includes an output for the record which best matches a given row (highlighted in blue). If there were more than two matches in the cluster, this field would help identify the records which most closely resemble each other.
Summary
In this post we’ve introduced the various algorithms available in Informatica’s Data Quality (IDQ) workbench and described the implementation and advantages of the Hamming Distance component in IDQ. We’ve discussed how data cleansing and standardization can positively influence the results of matching processes. We’ve even discussed certain business problems and how they can be solved with data matching.
But perhaps most importantly, we’ve illustrated why matching algorithms are an important tool in reducing costs and that a de-duplication effort is worth the investment.