Definitions (List of Global Variables)

Attribute	Description	Type
PREFIX_PATH	Defines the root directory path	Directory
RAW_PATH	Define folder name for all raw data files	Directory
TRAIN_PATH	Define folder name for all cleaned data files. If not specified, default is `"trainData"`	Directory
READ_NA	Option for loading CSVs. If `False`, data entries that can be found in default nanList will be converted to `NaN`. If `True`, above entries will be preserved as they are. If not specified, default is `False`	Data loading
RAWXLSX	Defines filename containing the raw data. E.g. “xx.xlsx”, “xx.csv”	Data loading
RAWXLSX_SHEETNAME	if RAWXLSX is an excel file, assign the sheetname from which to load the data. If not specified, will read the first sheet. E.g. “Sheet1”	Data loading
RAWDICTXLSX	Defines filename containing the data dictionary. E.g. “xx.xlsx”, “xx.csv”	Data loading
RAWDICTXLSX_SHEETNAME	if RAWDICTXLSX is an excel file, assign the sheetname from which to load the dictionary. If not specified, will read the first sheet. E.g. “Sheet1”	Data loading
LOGGING	Option to output logfile. If `True`, logfile will be built. If not specified, default is `True`.	Log
LOG_FILENAME	Defines filename of logfile. If not defined, default is `logfile.txt`.	Log
CREATE_UNIQUE_INDEX	Option to create unique row index from existing columns. If `True`, new index will be created. If not specified, default is `False`.	Indexing
UNIQUE_INDEX_COMPOSITION_LIST	List of column names to create new index from. If not specified, default is `[]`. E.g.: `["subject_id", "visit"]`	Indexing
UNIQUE_INDEX_DELIMITER	Delimiter to separate values from composition list. If not specified, default is `_`	Indexing
LONG_VAR_MARKER	Defines the variable name that indicates which longitudinal group that row belongs to. If not specified, default is `None`.	Longitudinal data
DICT_VAR_VARNAME	Column in data dictionary containing variable names in input data. If not specified, set as “`NAME`”.	Data Dictionary settings
DICT_VAR_VARCATEGORY	Column in data dictionary setting the category of the variable name. If not specified, set as “`CATEGORY`”	Data Dictionary settings
DICT_VAR_VARSECONDARY	Column in data dictionary setting if the variable is a secondary variable. If not specified, set as “`SECONDARY`”	Data Dictionary settings
DICT_VAR_VARFREQUENCY	Column in data dictionary setting frequency of the variable (for longitudinal datasets). If not specified, set as “`FREQUENCY`”.	Data Dictionary settings
DICT_VAR_TYPE	Column in data dictionary setting the type of variable (string, numeric, date). If not specified, set as “`TYPE`”	Data Dictionary settings
DICT_VAR_CODINGS	Column in data dictionary setting the codings of variable (dateformat, categories). If not specified, set as “`CODINGS`”	Data Dictionary settings
VAR_NAME_STRIPEMPTYSPACES	Boolean option. If `True`, empty spaces will be stripped from variable names in input data, and from variables names listed in data dictionary. If not specified, default is `False`.	Data cleaning settings
OUTPUT_TYPE_DATA	The output file type for the clean data files. Available options: ‘`csv`’, ‘`xlsx`’. If not specified, default is ‘`csv`’	Data cleaning settings
OUTPUT_TYPE_DICT	The output file type fot the amended dictionary. Available options: ‘`csv`’, ‘`xlsx`’. If not specified, default is ‘`xlsx`’	Data cleaning settings
INITIAL_REPORT_FILENAME	The output filename to store the initial report prior to optional cleaning steps. If not specified, default is ‘`initial_report.xlsx`’	Report generation settings
SUFFIX_DROPPED_DUPLICATED_ROWS	The filename suffix to use for intermediate outputs of cleaned data. If not specified, default is `DD`.	Drop Duplicates settings
OUTPUT_DROPPED_DUPLICATED_ROWS_FILENAME	The output filename to store the duplicated rows which have been dropped. Default is `rowsRemoved.xlsx`	Drop Duplicates settings
SUFFIX_CONSTRAINTS	The filename suffix to use for intermediate outputs of cleaned data. If not specified, default is `CON`.	Constraints settings
SUFFIX_STANDARDISE_TEXT	The filename suffix to use for intermediate outputs of cleaned data. If not specified, default is `ST`.	Standardise Text settings
OPTIONS_STANDARDISE_TEXT_CASE_TYPE	The default case type to convert strings into: “uppercase”, “lowercase”, “capitalise”. If not specified, default is `uppercase`.	Standardise Text settings
OPTIONS_STANDARDISE_TEXT_EXCLUDE_LIST	The variables to exclude from the conversion. For example: `["Gender", "Work"]`.	Standardise Text settings
OPTIONS_STANDARDISE_TEXT_CASE_TYPE_DICT	The dictionary to customise case_type for specific variables, overwriting default. For example: `{"Race1": "capitalise"}`.	Standardise Text settings
SUFFIX_STANDARDISE_DATE	The filename suffix to use for intermediate outputs of cleaned data. If not specified, default is `DATE`.	Date standardisation settings
OPTIONS_STANDARDISE_DATE_FORMAT	The standard date format to use for all dates (if not specified, default is `yyyy-mm-dd`). Follows format used in ms-excel, see ref. Example: `ddd, dd mmmm yy`.	Date standardisation settings
OPTIONS_FAILEDDATE_CONVERSIONS_FILENAME	The filename for storing list of failed date conversions (only csv). Default is `failed_date_conversions.csv`.	Date standardisation settings
SUFFIX_CONVERT_ASCII	The filename suffix to use for intermediate outputs of cleaned data. If not specified, default is `ASCII`.	ASCII Conversion settings
OPTIONS_CONVERT_ASCII_EXCLUSION_LIST	List of characters to exclude from conversion. Eg. `['€','$','Ò']`.	ASCII Conversion settings
SUFFIX_REMOVE_SECONDARY	The filename suffix to use for intermediate outputs of cleaned data. If not specified, default is `NOSEC`.	Remove secondary variable settings
OUTPUT_REMOVED_SECONDARY_FILENAME	The output filename to store removed variables. If not specified, default is `removed_secondary_variables.xlsx`.	Remove secondary variable settings
OPTIONS_SECONDARY_REMOVAL_EXCLUDE_LIST	Secondary variables to exclude from removal process If not specified, default is `[]`.	Remove secondary variable settings

Non-Canonical Definitions (Used by `utils_exec.py`)

Attribute	Description	Type
EXECUTE_STEPS	Variable(obj) containing the steps which data cleaning operations to perform first - first step starting with 1.	Data cleaning steps
FINAL_REPORT_FILENAME	Variable(str) which determines the final report filename, if not specified, it will default to `final_report_sample.xlsx`	Data cleaning settings

Definitions (List of Global Variables)

Non-Canonical Definitions (Used by utils_exec.py)

Non-Canonical Definitions (Used by `utils_exec.py`)