Skip to main content Link Menu Expand (external link) Document Search Copy Copied

CleanData.drop_duplicate_rows

Use to drop duplicate rows from the input dataframe. Performs the following steps:

  1. Exclude variables of the “Index” category from the duplicate search
  2. Make a working copy of the input dataframe
  3. Get the index of duplicate rows
  4. Drop the duplicate rows from the dataframe
  5. Get the dropped rows from the original dataframe and save them as an excel file (CleanData.dropped_duplicated_rows_filename)
  6. Update the new filename and the new input dataframe (suffix used: CleanData.suffix_dropped_duplicated_rows)

CleanData.drop_duplicate_rows()

Parameters None.

Returns None.

Notes

  • Updated dataframe can be found as CleanData.clean_df.
  • A copy of the cleaned data can be found in the folder CleanData.train_data_path, with a suffix CleanData.suffix_dropped_duplicated_rows.
  • A record of the dropped rows can be found in the folder CleanData.train_data_path, with the filename CleanData.dropped_duplicated_rows_filename.

Relevant Definitions Settings

  • SUFFIX_DROPPED_DUPLICATED_ROWS: suffix to append to the end of the output filename of the input data. E.g. “DD
  • OUTPUT_DROPPED_DUPLICATED_ROWS_FILENAME: output file name to store the duplicated rows which have been dropped. E.g. “rowsRemoved.xlsx”.

Examples

See Example cleanData for detailed setup and outputs.


Copyright © 2023 BiomedDAR. Distributed by an MIT license.