The Term Extraction transformation in SSIS first extracts terms from the text present in the source data and then writes the extracted terms to a Transformation output column.
For example, people are writing reviews on your products, and you want to contact them for further guidance. In these situations, you can use SSIS Term Extraction transformation to extract the email address and name of the user from the reviews.
NOTE: The SSIS Term Extraction transformation uses its own English dictionary and linguistic setting to extract the Term from the source data.
In SSIS, We can perform Term Extraction Transformations only on the column with the DT_WSTR and DT_NTEXT data types. If your input column is different from these two, then Please use the Data Conversion to convert other Data Types to DT_WSTR and DT_NTEXT data type
TIP: Please refer to the Term lookup Transformation article to understand the term lookup techniques.
Steps involved in configuring Term Extraction in SSIS
When you double-click on this transformation, an SSIS Term Extraction Transformation Editor window will open to configure it. It contains three Tabs, such as Term Extraction, Exclusion, and Advanced Tab.
Term Extraction Tab
Within the SSIS Term Extraction tab, We have to select the column name of the Source data from the Available Input Columns option.
The SSIS Term Extraction transformation produces only two output columns. The default names of the columns are Term and Score, but you can change them as per your requirement.
- Term: This column contains the extracted terms from the text. For example, if we are extracting the Nouns, then all the nouns will be stored in this column.
- Score: This column contains the number of times a term is repeated in the input column. For example, India is the first term extracted from the text. Term Extraction Transformation will check all the rows and counts the number of times Term India is repeated in all the rows available in that input column.
Exclusion Tab in SSIS Term Extraction Transformation
This tab is used to exclude unwanted terms from the extraction. For example, when we are extracting terms from source data that contains product reviews about all our company products, we don’t need to extract the Product name from the input text. To add the Exclusion Terms to the Term Extraction Transformation, please checkmark the Use Exclusion Terms option
The list of options available in the Exclusion Tab to configure the exclusion list is as follows:
- OLE DB connection manager: SSIS Term Extraction Transformation only supports the OLE DB connection manager to connect with the server holding the exclusion list. So, select an existing one from the drop-down list if you already created it. Or, if you want to create a new connection, then click on the New button.
- New: Create a New connection to a database using the OLE DB Connection Manager dialog box.
- Table or view: Select the table or view from the drop-down list which contains the exclusion terms.
- Column: Select the column name from the table or view which contains the exclusion terms.
- Configure Error Output: Click on this button to configure the errors.
Use the Advanced tab in the SSIS Term Extraction Transformation Editor to configure the extraction properties.
From the above screenshot, you can observe that the following is the list of options available in the Advanced Tab
- Noun: If you select this option, SSIS Term Extraction Transformation will extract only Nouns from the input text. Please refer to Term Extraction Transformation for further reference.
- Noun phrase: If you select this option, the transformation will extract only Noun Phrases from the input text. Please refer to Extract Noun Phrases using Term Extraction Transformation for further reference.
- Noun and noun phrase: The SSIS Term Extraction transformation will extract both Nouns and Noun Phrases from the input text. Please refer to Extract Nouns and Noun Phrases using Term Extraction Transformation for further reference.
- Frequency: If you select this option, the Score column stores the information of the Frequency of the Term repeated in the input column.
- TFIDF: The Score column will store the information on the TFIDF value of the Term.
- Frequency threshold: If we specify 3, the transformation extracts the Terms if they are repeated at least three times in the column, and it ignores the terms repeated less than three times.
- Maximum length of term: Please provide the maximum length of a word or phrase. This option is available if we select the Noun Phrases only option.
- Use case-sensitive term extraction: Please check mark this option if you want to perform the Case-Sensitive extraction.