Identification of duplicates in a data warehouse environment is a crucial process. The quality of de-duplication is dependent on the selection of a correct token for de-duplication. In the traditional data warehouse the token is selected by the domain experts. The traditional approach of token formation is not suitable for the de-duplication process in real time data warehouse environment. The token selection must be in an automated process. A method Automated Token Formation (ATF) for record De-duplication and Linkage is proposed in this paper. ATF approach is based on the distinctness and missing or unknown value count of the attributes. The experimentation is done on Restaurant and Cora datasets. The approach provides good quality tokens for Restaurant dataset but for CORA dataset the performance of ATF is very poor. The issue of ATF is rectified by a rough set based classification approach termed as Rough Set based Semi-Automated Token Formation (RS-SATF). RS-SATF has improved de-duplication accuracy over ATF in all the datasets. It is a structured supervised approach of classification. RS-SATF classification is applied only on the tokens of ATF. This has reduced the complexity of classification. RS-SATF has improved the de-duplication results by 29 % for Manual domain expert set tokens and by 51% for ATF tokens for Cora Dataset. For restaurant dataset the accuracy of De-duplication by RS-SATF is 98 %. It has improved the accuracy by 14% as compared to manual tokens.
Volume 11 | 04-Special Issue
Pages: 380-390