CleanData.drop_duplicate_rows
Use to drop duplicate rows from the input dataframe. Performs the following steps:
- Exclude variables of the “Index” category from the duplicate search
- Make a working copy of the input dataframe
- Get the index of duplicate rows
- Drop the duplicate rows from the dataframe
- Get the dropped rows from the original dataframe and save them as an excel file (
CleanData.dropped_duplicated_rows_filename
) - Update the new filename and the new input dataframe (suffix used:
CleanData.suffix_dropped_duplicated_rows
)
CleanData.drop_duplicate_rows()
Parameters None.
Returns None.
Notes
- Updated dataframe can be found as
CleanData.clean_df
. - A copy of the cleaned data can be found in the folder
CleanData.train_data_path
, with a suffixCleanData.suffix_dropped_duplicated_rows
. - A record of the dropped rows can be found in the folder
CleanData.train_data_path
, with the filenameCleanData.dropped_duplicated_rows_filename
.
Relevant Definitions Settings
- SUFFIX_DROPPED_DUPLICATED_ROWS: suffix to append to the end of the output filename of the input data. E.g. “
DD
” - OUTPUT_DROPPED_DUPLICATED_ROWS_FILENAME: output file name to store the duplicated rows which have been dropped. E.g. “
rowsRemoved.xlsx
”.
Examples
See Example cleanData for detailed setup and outputs.