In this article:
The Delete Duplicates transformer is an object that deletes duplicate data. There is one provider at the object's input and two consumers at the object's output. This operation deletes duplicate values based on a specified index. A condition, based on which the records are selected, is generated to select records to be deleted.
To ensure efficient duplicate deletion, provider data should be ordered by index. After executing the operation the data remains ordered.
On using the Delete Duplicates transformer the table below:
Key | Date | Value |
4 | Summer | 1111 |
1 | Winter | 2222 |
5 | Summer | 3333 |
2 | Winter | 4444 |
4 | Summer | 1111 |
6 | Summer | 5555 |
5 | Summer | 3333 |
3 | Winter | 6666 |
can be converted into a table without duplicates:
Key | Date | Value |
4 | Summer | 1111 |
1 | Winter | 2222 |
5 | Summer | 3333 |
2 | Winter | 4444 |
6 | Summer | 5555 |
3 | Winter | 6666 |
and a table that contains deleted duplicates:
Key | Date | Value |
4 | Summer | 1111 |
5 | Summer | 3333 |
Thus, duplicates are deleted, if values of all fields were equal at the same time.
In basic properties object name, identifier and comment are set.
To set a list of fields and link to input, use the Edit Input page.
The following parameters are available on the page:
NOTE. The screenshot represents edit wizard for the Repository data consumer.
To set a list of fields and output link, use the Output Edit page.
The following settings are available on the page:
NOTE. The screenshot represents edit wizard for the Repository data provider.
The Edit Outputs with Error Records page is used to set a link with a consumer, to which information about error records skipped by the transformer is exported:
Specify output identifier with errors and select available link with the consumer object, to which information about error records is exported.
The Settings button enables the user to determine advanced settings used on occurring error records:
Specify the maximum number of error records, information about which is exported. By default the -1 value is set, in this case information about all error records is exported.
NOTE. If there is a large number of error records, information export may slow down ETL task runtime.
If the maximum number of output records is set, select the action executed if this number is exceeded. By default, records output is not stopped.
NOTE. The screenshot represents edit wizard for the Split data transformer.
Set the input fields which values should be checked for duplicates, on the Checked Fields page.
To create a list of checked fields:
Drag selected field from the Source Fields list to the Selected Fields list.
Select a field in the Source Fields list, and an input in the Selected Fields list. Click the Add button.
Click the Delete button to delete a selected field form the list of checked fields.
If no checked field is defined, an attempt to go to the next page brings up a confirmation dialog box.
Set a condition, based on which records to be deleted are selected, on the Deduplicator page.
Condition is formed in the editor, dialog box, which opens on clicking the button.
The rule of duplicate selection is determined by radio button in the Selection Rules group:
Record Satisfies Condition. The first duplicate record meeting the specified condition is passed to the consumer.
Record does not Satisfy Condition. The first duplicate record not meeting the specified condition is passed to the consumer.
See also: