Talend Fuzzy Matching

The Talend Fuzzy Matching or tFuzzyMatch component compares the source data (main table) column value with the reference table (lookup table). Fuzzy returns the matching value in the lookup table along with the distance in an integer value. 

The Talend Fuzzy matching is very useful in correcting the typing mistakes that happened while entering the data. By using this Fuzzy Matching, we can simply compare the incoming data with the original table that we stored in the database, and correct the mistakes.

In this Talend Fuzzy Matching example, we use the source or main table data from a text file and the lookup table from the SQL database. The following screenshot shows the data inside the original table.

Reference Table

and the Source text file is

Source Table

Talend Fuzzy Matching Example

First, drag and drop the tDBConnection and tDBCommit to establish the SQL connection and to close the same. Next, drag the tFileInputDelimited from the palette to the job design. As you can see from the below, we are selecting the text file from our source, changing the field separator to a comma, and skipping the header row by placing one.

Load Source txt into File input source

Please click on the Edit Schema to add columns, data type, and length. If you use the Metadata, then you can skip this step.

Edit Shema

Next, drag the Talend tFuzzyMatch to the job design and connect the tFileInputDelimited main row to it.

Within the Talend Fuzzy Match components tab, by default, the Levenshtein type is selected. However, you can change them from the available list, and they are

Talend Fuzzy Match 5
  • Levenshtein: This algorithm calculates the total insertions, updates, or deletes required for the given word to match with the lookup value. Based on the integer number, Talend tFuzzyMatch will return the matching word from the lookup table. 
  • Metaphone uses the Phonetic algorithm, and the algorithm works based on the pronunciation of the word. It first loads all the lookup table records, and then checks the pronunciation of the main table word against the lookup.
  • Double Metaphone: An upgraded new version of the original. As per our understanding, it returns better results compared to the original.

Min and Max distance: Please specify the number of changes allowed to match with the lookup word. 0 means a perfect match.

Unique Matching: If there are multiple matches, click this option to get the best match. 

The Talend tFuzzyMatch requires a lookup table. So, we are selecting the FuzzyLookup table from our SQL Database.

Talend Fuzzy Match 6

Please select the lookup column, the column that has the common data. Here, we want to compare the Country names in the text file with the lookup table for spelling mistakes. So, let me select row2.Country as the lookup column.

Talend Fuzzy Match 7

Next, change the matching column to Country Name, and change the mat distance to 1.

Talend Fuzzy Match 8

Let me use tDBOutput to save the Fuzzy matching records in the Talend FuzzyMatch table.

Let us run the Talend tFuzzyMatch Job.

Talend Fuzzy Match 10

Within the Management Studio, you can see that it is showing the correct matching records. It means Value 1 will perform strict comparisons.

Result Table

Let me change the Max distance to 3 and run the Talend Fuzzy matching job.

Talend Fuzzy Match 12

Now, you can see all the records with proper spellings in SQL.

Result set

This time, we are changing the Matching type to Metaphone.

Talend Fuzzy Match 14

The result of the fuzzy matching is

Table output

Let us change the Talend Fuzzy Matching type to Double Metaphone.

Talend Fuzzy Match 16

It shows the country or countries that pronounce the same while comparing the text file country with the lookup table country name. 

Table

If you select the Unique option along with the Double Metaphone, then the Fuzzy Matching output is.

Results

NOTE: The result of the Talend tFuzzyMatch might be different if the records are more.