CleanData
class CleanData(definitions=None, debug=True)
Module for data cleaning. Designed to work with a data dictionary (metadata). See data dictionary template and data dictionary guide for more details.
Parameters
definitions: file (.py) containing global variables
Global variables for raw data files
PREFIX_PATH: define the root directoryRAW_PATH: set the folder name for all raw data filesTRAIN_PATH: set the folder name to store all the cleaned data files. If not specified, default is “trainData”READ_NA: option for loading csvs. IfFalse, csv entries that can be found in nanList (see detailed nanList under Attributes) will be converted toNaN. ifTrue, above entries will be preserved as they are. Default isFalse.RAWXLSX: filename containing the raw dataRAWXLSX_SHEETNAME: if RAWXLSX is an excel file, assign the sheetname from which to load the data. If not specified, will read the first sheet.RAWDICTXLSX: filename containing the data dictionaryRAWDICTXLSX_SHEETNAME: if RAWDICTXLSX is an excel file, assign the sheetname from which to load the dictionary. If not specified, will read the first sheet.
Settings for Log Files
LOGGING: A boolean. Whether to output logfile or not. If not specified, default isTrue.LOG_FILENAME: A string. Filename of log file. If not defined, default is “logfile.txt”.
Settings for Longitudinal data
LONG_VAR_MARKER: (in preparation for longitudinal data) the variable name that indicates which longitudinal group that row belongs to. If not specified, default isNone.
Settings for Data Dictionary
DICT_VAR_VARNAME: column in data dictionary containing variable names in input data. If not specified, set as “NAME”.DICT_VAR_VARCATEGORY: column in data dictionary setting the category of the variable name. If not specified, set as “CATEGORY”DICT_VAR_TYPE: column in data dictionary setting the type of variable (string, numeric, date). If not specified, set as “TYPE”DICT_VAR_CODINGS: column in data dictionary setting the codings of variable (dateformat, categories). If not specified, set as “CODINGS”
Settings for Data Cleaning
VAR_NAME_STRIPEMPTYSPACES: boolean option. IfTrue, empty spaces will be stripped from variable names in input data, and from variables names listed in data dictionary. If not specified, default isFalse.OUTPUT_TYPE_DATA: the output file type for the clean data files. Available options: ‘csv’, ‘xlsx’. If not specified, default is ‘csv’OUTPUT_TYPE_DICT: the output file type fot the amended dictionary. Available options: ‘csv’, ‘xlsx’. If not specified, default is ‘xlsx’
Settings for Drop Duplicates
SUFFIX_DROPPED_DUPLICATED_ROWS: The filename suffix to use for intermediate outputs of cleaned data. If not specified, default isDD.OUTPUT_DROPPED_DUPLICATED_ROWS_FILENAME: output file name to store the duplicated rows which have been dropped. Default isrowsRemoved.xlsx
Settings for Constraints
SUFFIX_CONSTRAINTS: The filename suffix to use for intermediate outputs of cleaned data. If not specified, default isCON.
Settings for Standardise Text
SUFFIX_STANDARDISE_TEXT: The filename suffix to use for intermediate outputs of cleaned data. If not specified, default isST.OPTIONS_STANDARDISE_TEXT_CASE_TYPE: default case type to convert strings into: “uppercase”, “lowercase”, “capitalise”. If not specified, default isuppercase.OPTIONS_STANDARDISE_TEXT_EXCLUDE_LIST: variables to exclude from the conversion. For example:["Gender", "Work"]OPTIONS_STANDARDISE_TEXT_CASE_TYPE_DICT: dictionary to customise case_type for specific variables, overwriting default. For example:{"Race1": "capitalise"}.
Settings for Date Standardisation
SUFFIX_STANDARDISE_DATE: The filename suffix to use for intermediate outputs of cleaned data. If not specified, default isDATE.OPTIONS_STANDARDISE_DATE_FORMAT: the standard date format to use for all dates (if not specified, default isyyyy-mm-dd). Follows format used in ms-excel, see ref. Example:ddd, dd mmmm yy.OPTIONS_FAILEDDATE_CONVERSIONS_FILENAME: filename for storing list of failed date conversions (only csv). Default isfailed_date_conversions.csv
Settings for ASCII Conversion
SUFFIX_CONVERT_ASCII: The filename suffix to use for intermediate outputs of cleaned data. If not specified, default isASCII.OPTIONS_CONVERT_ASCII_EXCLUSION_LIST: List of characters to exclude from conversion. Eg.['€','$','Ò']
debug: boolean, default None.
If True, print intermediate outputs for debugging purposes
Notes
Examples
Please refer to the below pages for detailed examples:
| Example | Description |
|---|---|
| cleanData 1 | Demonstrates use of definitions.py, dropping duplicate rows, and standardising text variables (capital/small letters) |
| CleanData 2 | Demonstrates use of definitions.py, standardising date formats, conversion of characters from international accents to ASCII-compatible symbols |
| CleanData 3 | Demonstrates use of customised constraints |
Attributes
| Attribute | Description |
|---|---|
| debug | (boolean) whether to debug or not |
| definitions | (obj) definitions in corresponding input defintions.py |
| nanList | (list) list of values interpreted as NaN. Values: ["","#N/A","#N/A N/A", "#NA", "-1.#IND", "-1.#QNAN", "-NaN", "-nan", "1.#IND", "1.#QNAN", "<NA>", "N/A", "NA", "NULL", "NaN", "None", "n/a", "nan", "null "]. |
| dict_df | (dataframe) data dictionary |
| raw_df | (dataframe) raw data |
| clean_df | (dataframe) cleaned data |
| clean_dict_df | (dataframe) cleaned data dictionary |
| clean_split_df | (dict of dataframes) dictionary of dataframes for split-data |
| report_df | (dataframe) report |
| cat_var_dict | (dict) with categories as keys and the respective variables as values, {cat: [list of variables]} |
| type_var_dict | (dict) with the variable type as keys and the respective variables as values |
| longitudinal_marker_list | (list) list of longitudinal markers |
| longitudinal_variableMarker | (str) column header which contains the list of categories stipulating a list of longitudinal markers |
| longitudinal | (bool) boolean indicating if the data is longitudinal |
| longitudinal_freq_set_dict | (dict) list of variables for each longitudinal frequency |
| dict_var_codings | (str) column name in data dict. setting the codings of the variable e.g. "CODINGS" |
| dict_var_type | (str) column name in data dict. setting the type of variable e.g. "TYPE" |
| dict_var_varcategory | (str) column name in data dict. setting the category of the variable name e.g. "CATEGORY" |
| dict_var_varsecondary | (str) column name in data dict. setting if the variable is a secondary variable e.g. "SECONDARY" |
| dict_var_varname | (str) column name in data dict containing variable names in input data e.g. "NAME" |
| raw_data_dict_filename | (str) full path of original data dictionary |
| raw_data_dict_sheetname | (str) sheet name of original data dictionary |
| raw_data_filename | (str) full path of raw data |
| raw_data_sheetname | (str) sheet name of raw data |
| raw_data_path | (str) full path of folder storing raw data |
| train_data_path | (str) full path of folder storing training data |
| folder_rawData | (str) folder storing raw data, e.g. "rawData" |
| folder_trainData | (str) folder storing training data (outputs of cleaning modules), e.g. "trainData" |
| data_latest_filename | (str) latest filename for cleaned data |
| dict_latest_filename | (str) latest filename for cleaned data dictionary |
| latest_filename_split_dict | (dict) dictionary of filenames for split-datasets |
| dropped_duplicated_rows_filename | (str) filename for storing dropped duplicate rows |
| output_removed_secondary_filename | (str) filename for storing dropped secondary columns |
| initial_report_filename | (str) filename for storing report |
| log_filename | (str) filename for storing logs |
| log_filepath | (str) full path of log_filename |
| logger | (obj) logger to pass on for logging |
| logging | (boolean) whether to do logging or not |
| options_convert_ascii_exclusion_list | (list) list of symbols to exclude from ASCII conversion |
| options_secondary_removal_exclude_list | (list) list of secondary variables to exclude from removal process |
| options_faileddate_conversions_filename | (str) filename for storing list of failed date conversions |
| options_standardise_date_format | (str) date standardisation format e.g. "yyyy-mm-dd" |
| options_standardise_text_case_type | (str) option for standardising case format, e.g. "uppercase" |
| options_standardise_text_case_type_dict | (dict) dictionary to customise case_type for specific variables. |
| options_standardise_text_exclude_list | (list) variables to exclude from the conversion |
| output_type_data | (str) the output file type for the clean data files |
| output_type_dict | (str) the output file type fot the amended dictionary |
| prefix_path | (str) prefix path as stipulated in dictionary |
| suffix_constraints | (str) suffix to append to data_latest_filename after applying constraints |
| suffix_convert_ascii | (str) suffix to append to data_latest_filename after converting ASCII symbols |
| suffix_dropped_duplicated_rows | (str) suffix to append to data_latest_filename after dropping duplicate rows |
| suffix_standardise_date | (str) suffix to append to data_latest_filename after standardising dates |
| suffix_standardise_text | (str) suffix to append to data_latest_filename after standardising text case |
| suffix_remove_secondary | (str) suffix to append to data_latest_filename after removing secondary variables |
| var_diff_list | (list) list of mismatched variable names between input data and data dictionary |
| var_list | (list) list of all variables (column headers) found in input data |
| var_secondary_list | (list) list of all secondary variables (column headers) found in input data |
| var_name_stripemptyspaces | (boolean) if True, empty spaces will be stripped from variable names in input data, and from variables names listed in data dictionary. |
Methods
| Method | Description |
|---|---|
| convert_2_dtypes(data) | Convert data (df) into best possible dtypes. |
| gen_data_report(data, dict, [report_filename]) | Generates a report of data |
| drop_duplicate_rows() | Use to drop duplicate rows from the input dataframe. |
| standardise_text_case_conversion(data, case_type) | Takes dataframe (cols) and one case type parameter as input and returns the text data converted as per the case type specified. |
| standardise_text() | Standardises text case in input data. |
| converting_ascii([ascii_exclusion_list, ]) | Converts all characters in input data to ASCII-compatible format. |
| standardise_date([def_date_format, faileddate_conversions_filename]) | Standardises the date/time in input data. |
| remove_secondary_variables() | Remove variables tagged as secondary variables from data. |