Wednesday, May 5, 2010

Suspect Duplicate Processing in IBM MDM Server

The task of evaluating data, finding suspects in the data and collapsing them based on rules is an exhaustive process. If the suspects do not have a high possibility of a match then what action should be taken? How can automated merge be leveraged so the manual process of collapsing data can be minimized?

There are a lot of questions which the business wants the answer for before it can make an informed decision. Lets talk about the basic terminology which business should know when talking about Suspect Duplicate Processing(SDP) or Duplicate Suspect Processing(DSP).

What is SDP ?
IBM MDM can identify the duplicate parties in real-time, as part of adding or updating the party data or offline as part of Evergreening. Suspect Duplicate Processing (SDP) feature provides mechanism to identify these duplicate parties. Terminology Business users should know:
- Critical Data
- Match/Non-Match Score
- Match Category
- Match Matrix

What is Critical Data ?
The term ‘Critical Data’ refers to data elements that are selected by business to be used for comparision in SDP. If all Critical Data fields match between two records then they are considered exact match. For Example: Last Name, SSN, Address Line One

What is Match/Non-Match Score?
Each critical data element is given a score
For example:

Critical Data
Match Relevancy Score
Non-Match Relevancy Score
Last Name
1
1
SSN
2
2
Address Line One
8
8


What is Match Category ?
Match category is based on the Match/Non-Match Score.Out of the Box(OOTB) there are 4 Match Categories:
A1 - Match/Non-Match score indicate that a definite duplicate party has been found.
A2 - Match/Non-Match score indicate that high probability that a duplicate party has been found.
B - Match/Non-Match score indicate that it is fairly unlikely that a duplicate party has been found.
C - Match/Non-Match score indicate that the suspect party is not a duplicate.
These categories can be customized.

What is the Match Matrix?
Match Matrix brings together Match/Non-Match Scores and Match Categories.
- 0 means data element not present in either or both the new and existing records
- Negative value means data element is present in both the new and existing record and it does not match
- Positive Value means data element is present in both the new and existing record and it matches

Last Name
SSN
Address Line One
Match Score
Non-Match Score
Category
1
2
8
11
0
A1
-1
-1
-8
0
11
C
1
2
0
3
0
A2


The categories in the match matrix are decided by business. Hopefully this provided a basic overview of the Suspect Duplicate Processing concept in IBM MDM. Feel free to leave comments or ask questions.

1 comment:

Unknown said...

Pretty good post. I just stumbled upon your blog and wanted to say that I have really enjoyed reading your blog posts.
master data management