To work with the tool in Foresight Analytics Platform 10, use the new interface.
In this article:
The Delete Duplicates transformer is an object that deletes records with duplicate field values. There is one provider at the object's input and two consumers at the object's output. To provide the transformer work, on the Checked Fields page create a list of fields, which will be used to create unique combinations of values and to search for duplicate combinations. To speed up the transformer work by the selected fields, records are sorted beforehand. On the Deduplicator page set the expression that will be calculated based on duplicate records, and the rule of selecting the single record that will be sent to the data consumer.
When the Delete Duplicates transformer is used from the table below:
Key | Date | Value |
4 | Summer | 1111 |
1 | Winter | 2222 |
5 | Summer | 3333 |
2 | Winter | 4444 |
4 | Summer | 1111 |
6 | Summer | 5555 |
5 | Summer | 3333 |
3 | Winter | 6666 |
it can be converted into a table without duplicates:
Key | Date | Value |
4 | Summer | 1111 |
1 | Winter | 2222 |
5 | Summer | 3333 |
2 | Winter | 4444 |
6 | Summer | 5555 |
3 | Winter | 6666 |
and a table that contains deleted duplicates:
Key | Date | Value |
4 | Summer | 1111 |
5 | Summer | 3333 |
Thus, duplicates are deleted if values of all fields were equal at the same time.
To set a list of fields and link to input, use the Edit Input page.
The following parameters are available on the page:
NOTE. The screenshot represents edit wizard for the Repository data consumer.
The Edit Output page enables the user to set links to the consumer object, to which data on executing ETL tasks is loaded.
NOTE. The page is common to all data connectors and transformers, except for the Split and User Algorithm transformers. Consider setting up a list of fields and output links using the example of the Repository data source editing wizard.
The following settings are available on the page:
The Edit Output with Error Records page is used to set links to the consumer object, to which information about error records skipped by the transformer is exported.
NOTE. The page is common to all data transformers, except for the Union and User Algorithm transformers. Consider setting up a list of fields and output links using the example of the Split data transformer editing wizard.
The following settings are available on the page:
Set the input fields, which values should be checked for duplicates, on the Checked Fields page.
To create a list of checked fields:
Drag the selected field from the Source Fields list to the Selected Fields list.
Select a field in the Source Fields list and an input in the Selected Fields list. Click the Add button.
Click the Delete button to delete the selected field from the list of checked fields.
The selected fields will be used to create an index, which works similarly to relational table index. Records will also be sorted to speed up work with the selected fields. After deleting duplicates the records are sent to the data consumer as sorted.
If none of the checked fields is set, duplicates are calculated by combinations of values of all data provider fields. Records are not sorted in this case. Work in this mode may take long time.
Set a condition, based on which records to be deleted are selected, on the Deduplicator page.
Condition is formed in the editor window, which opens by clicking the Setup button.
The rule of duplicate selection is determined by radio button in the Selection Rules group:
Record Satisfies Condition. The first duplicate record satisfying the specified condition is passed to the consumer.
Record does not Satisfy Condition. The first duplicate record not satisfying the specified condition is passed to the consumer.
See also:
Getting Started with the ETL Task Tool in the Web Application | Data Transformers