The Term Extraction transformation in SSIS first extracts terms from the text present in the source data and then writes the extracted terms to a Transformation output column.
For example, people are writing reviews on your products, and you want to contact them for further assistance. In these situations you can use Term Extraction transformation to extract the email address and name of the user from the reviews.
NOTE: This transformation uses its own English dictionary and linguistic setting to extract the Term from the source data.
In SSIS, We can perform Term Extraction Transformations only on the column with the DT_WSTR and DT_NTEXT data type. If your input column is different from these two then Please use the SSIS Data Conversion to convert other Data Types to DT_WSTR and DT_NTEXT data type
TIP: Please refer Term lookup Transformation in SSIS article to understand the term lookup techniques.
Steps involved in configuring Term Extraction in SSIS
When you double-click on this transformation a Term Extraction Transformation Editor window will be opened to configure it. It contains three Tabs such as Term Extraction, Exclusion and Advanced Tab.
Term Extraction Tab
Within the Term Extraction tab, We have to select the column name of the Source data , from the Available Input Columns option as we shown in the below screenshot.
Term Extraction transformation produce only two output columns. The default names of the columns are Term and Score but you can change them as per your requirement.
- Term: This column contains the extracted terms from the text. For example, if we are extracting the Nouns then all the nouns will be stored in this column.
- Score: This column contains the number of times a term is repeated in the input column. For example, India is the first term extracted from the text then Term Extraction Transformation will check all the rows and counts the number of times Term India is repeated in all the rows available in that input column.
Exclusion Tab in Term Extraction Transformation
This tab is used to exclude unwanted terms from the extraction. For example, when we are extracting terms from a source data that contains product reviews about all our company products then we don’t need to extract our own Product name from the input text. To add the Exclusion Terms to the Term Extraction Transformation please check mark the Use Exclusion Terms option from the below screenshot
TIP: Please refer Exclusion Tab in Term article to understand the configuration of Exclusion tab.
List of options available in Exclusion Tab to configure the exclusion list are:
- OLE DB connection manager: Term Extraction Transformation only supports OLE DB connection manager to connect with the server holding the exclusion list. So, select an existing one from the drop down list if you already created or if you want to create new connection then click on the New button.
- New: Create New connection to a database using OLE DB Connection Manager dialog box.
- Table or view: Select the table or view from the drop down list which contains the exclusion terms.
- Column: Select the column name from the table or view which contains the exclusion terms.
- Configure Error Output: Click on this button to configure the errors.
Use the Advanced tab in the Term Extraction Transformation Editor to configure the extraction properties.
From the above screenshot you can observe that, Following are the list of options available in the Advanced Tab
- Noun: If you select this option then, Term Extraction Transformation will extract only Nouns from the input text. Please refer Term Extraction Transformation in SSIS for further reference.
- Noun phrase: If you select this option then the transformation will extract only Noun Phrases from the input text. Please refer Extract Noun Phrases using Term Extraction Transformation in SSIS for further reference.
- Noun and noun phrase: If you select this option then the transformation will extract both Nouns and Noun Phrases from the input text. Please refer Extract Nouns and Noun Phrases using Term Extraction Transformation in SSIS for further reference.
- Frequency: If you select this option then Score column will store the information of, Frequency of the Term repeated in input column.
- TFIDF: If you select this option then Score column will store the information of, TFIDF value of the Term.
- Frequency threshold: If we specify 3 then the transformation will extract the Terms, if they are repeated at least 3 times in the column and it will ignore the terms repeated less than 3 times.
- Maximum length of term: Please provide the maximum length of a word or phrase. This option is available if we selected the Noun Phrases only option.
- Use case-sensitive term extraction: Please check mark this option if you want to perform the Case-Sensitive extraction.
Thank you for Visiting Our Blog