The Talend Fuzzy Matching or tFuzzyMatch component compares the source data (main table) column value with the reference table (lookup table). Talend Fuzzy returns the matching value in the lookup table along with the distance in an integer value.
The Talend Fuzzy matching is very useful in correcting the typing mistakes that happened while entering the data. By using this Talend Fuzzy Matching, we can simply compare the incoming data with the original table that we stored in the database, and correct the mistakes.
In this Talend Fuzzy Matching example, we use the source or main table data from a text file and the lookup table from the SQL database. The following screenshot shows the data inside the original table.
and the Source text file is
Talend Fuzzy Matching Example
First, drag and drop the tDBConnection and tDBCommit to establish SQL connection and to close the same. Next, drag the tFileInputDelimited from the palette to job design. As you can see from the below, we are selecting the text file from our source, changing the field separator to a comma, and skipping the header row by placing one.
Please click on the Edit Schema to add columns, data type, and length. If you use the Metadata, then you can skip this step.
Next, drag the Talend tFuzzyMatch to the job design and connect the tFileInputDelimited main row to it.
Within the Talend Fuzzy Match components tab, by default, the Levenshtein type is selected. However, you can change them from the available list, and they are
- Levenshtein: This algorithm calculates the total insertions, updates, or deletes required for the given word to match with the lookup value. Based on the integer number, Talend tFuzzyMatch will return the matching word from the lookup table.
- Metaphone uses the Phonetic algorithm, and the algorithm works based on the pronunciation of the word. It first loads all the lookup table records, then checks the pronunciation of the main table word against the lookup.
- Double Metaphone: An upgraded new version of the original. As per our understanding, it returns better results compared to the original.
Min and Max distance: Please specify the number of changes allowed to match with the lookup word. 0 means a perfect match.
Unique Matching: If there are multiple matches, click this option to get the best match.
The Talend tFuzzyMatch requires a lookup table. So, we are selecting the FuzzyLookup table from our SQL Database.
Please select the lookup column, the column that has the common data. Here, we want to compare the Country names in the text file with the lookup table for spelling mistakes. So, let me select row2.Country as the lookup column.
Next, change the matching column to Country Name, and change the mat distance to 1.
Let me use tDBOutput to save the Fuzzy matching records in the Talend FuzzyMatch table.
Let us run the Talend tFuzzyMatch Job.
Within the Management Studio, you can see that it is showing the correct matching records. It means Value 1 will perform strict comparisons.
Let me change the Max distance to 3 and run the Talend Fuzzy matching job.
Now, you can see all the records with proper spellings in SQL.
This time, we are changing the Matching type to Metaphone.
And the result of the Talend fuzzy matching is
Let us change the Talend Fuzzy Matching type to Double Metaphone.
It shows the country or countries that pronounce the same while comparing the text file country with the lookup table country name.
If you select the Unique option along with the Double Metaphone, then the Talend Fuzzy Matching output is.
NOTE: The result of the Talend tFuzzyMatch might be different if the records are more.