Talend Fuzzy Matching

The Talend Fuzzy Matching or tFuzzyMatch component compares the source data (main table) column value with the reference table (lookup table). Talend Fuzzy Match returns the matching value in the lookup table along with the distance in an integer value. 

The Talend Fuzzy matching is very useful in correcting the typing mistakes that happened while entering the data. By using this Talend Fuzzy Matching, we can simply compare the incoming data with the original table that we stored in the database, and correct the mistakes.

In this Talend Fuzzy Matching example, we use the source or main table data from a text file and the lookup table from the SQL database. The following screenshot shows the data inside the original table.

Talend Fuzzy Match 1

and the Source text file is

Talend Fuzzy Match 2

Talend Fuzzy Matching Example

First, drag and drop the tDBConnection and tDBCommit to establish SQL connection and to close the same. Next, drag the tFileInputDelimited from palette to job design. As you can see from the below, we are selecting the text file from our source, changed the field separator to a comma, and skipping the header row by placing one.

Talend Fuzzy Match 3

Please click on the Edit Schema to add columns, data type, and length. If you use the Metadata, then you can skip this step.

Talend Fuzzy Match 4

Next, drag the Talend tFuzzyMatch to the job design and connect the tFileInputDelimited main row to it.

Within the Talend Fuzzy Match components tab, by default, the Levenshtein Matching type is selected. However, you can change them from the available list, and they are

Talend Fuzzy Match 5
  • Levenshtein: This algorithm calculates the total insertions, updates, or deletes required for the given word to match with the lookup value. Based on the integer number, Talend tFuzzyMatch will return the matching word from the lookup table. 
  • Metaphone uses the Phonetic algorithm, and the algorithm works based on the pronunciation of the word. It first loads all the lookup table records, then checks the pronunciation of the main table word against the lookup.
  • Double Metaphone: An upgraded new version of the original. As per our understanding, it returns better results compared to the original.

Min and Max distance: Please specify the number of changes allowed to match with lookup word. 0 means a perfect match.

Unique Matching: If there are multiple matches, click this option to get the best match. 

The Talend tFuzzyMatch requires a lookup table. So, we are selecting the FuzzyLookup table from our SQL Database.

Talend Fuzzy Match 6

Please select the lookup column, the column that has the common data. Here, we want to match the Country names in the text file with the lookup table for spelling mistakes. So, let me select the row2.Country as the lookup column.

Talend Fuzzy Match 7

Next, change the matching column to Country Name, and change the mat distance to 1.

Talend Fuzzy Match 8

Let me use tDBOutput to save the Fuzzy matching records in the Talend FuzzyMatch table.

Talend Fuzzy Match 9

Let us run the Talend tFuzzyMatch Job.

Talend Fuzzy Match 10

Within the Management Studio, you can see that it is showing the correct matching records. It means Value 1 will perform strict matching.

Talend Fuzzy Match 11

Let me change the Max distance to 3 and run the Talend Fuzzy matching job.

Talend Fuzzy Match 12

Now, you can see all the records with proper spellings in SQL.

Talend Fuzzy Match 13

This time, we are changing the Matching type to Metaphone.

Talend Fuzzy Match 14

And the result of the Talend fuzzy matching is

Talend Fuzzy Match 15

Let us change the Talend Fuzzy Matching Matching type to Double Metaphone.

Talend Fuzzy Match 16

It shows the country or countries that pronounce the same while comparing the text file country with lookup table country name. 

Talend Fuzzy Match 17

If you select the Unique Matching option along with the Double Metaphone, then the Talend Fuzzy Matching output is.

Talend Fuzzy Match 18

NOTE: The result of the Talend tFuzzyMatch might be different if the records are more.