Skip to main content Link Menu Expand (external link) Document Search Copy Copied

Example of CleanData with Constraints Class

This example demonstrates the use of the CleanData class together with the Constraints class. It continues from the previous example.

Before we exert the constraints, we can take a look at the data report, which is automatically generated during initialisation.

cd = CleanData(definitions=defi)
print(cd.report_df)

Sample Output

                data_type data_type_in_dict data_type_mismatch  count_missing_values  percentage_missing_values numeric_range                     unique_categories
TESTID              int64           numeric            Matched                     0                     0.0000       1:10000                                  N.A.
ID                  int64           numeric            Matched                     0                     0.0000   51624:71915                                  N.A.
SurveyYr           object            string            Matched                     0                     0.0000           NaN                       2009_10,2011_12
Gender             object            string            Matched                     0                     0.0000           NaN                           male,female
Age                 int64           numeric            Matched                     0                     0.0000          0:80                                  N.A.
...                   ...               ...                ...                   ...                        ...           ...                                   ...
SexNumPartnLife   float64           numeric            Matched                  4275                     0.4275    0.0:2000.0                                  N.A.
SexNumPartYear    float64           numeric            Matched                  5072                     0.5072      0.0:69.0                                  N.A.
SameSex            object            string            Matched                  4232                     0.4232           NaN                            No,nan,Yes
SexOrientation     object            string            Matched                  5158                     0.5158           NaN  Heterosexual,nan,Bisexual,Homosexual
PregnantNow        object            string            Matched                  8304                     0.8304           NaN                    nan,No,Unknown,Yes

Import Libraries

# LOAD DEPENDENCIES
from bdarpack.Constraints import Constraints

eg_nhanes_constraints is a script where the constraints specific to the nhanes have been stored. They consists of functions which take in a dataframe from the CleanData class and an object of the Constraints class, and returns a constrained dataframe and an updated Constraints class that captured the details of the transformation.

import eg_nhanes_constraints as n_con

Use Constraints

Load the dataframe from the CleanData object. Initialise a new Constraints object.

df = cd.clean_df
con = Constraints(debug=True)

The following are examples of constraints used on the variables of the NHANES dataset. The dataset undergoes a series of constraints, as stipulated by the metadata. Please refer to the Constraints example for further details on their usage/construction.

df, con = n_con.con_ageDecade(df, con)
df, con = n_con.con_ageMonths(df, con)
df, con = n_con.con_Race3(df, con)
df, con = n_con.con_GenderEducationMaritalStatus(df, con)
df, con = n_con.con_HHIncome(df, con)
df, con = n_con.con_HomeWork(df, con)
df, con = n_con.con_WeightHeight(df, con)
df, con = n_con.con_BMI(df, con, bmiChartPerc_filename=f"{cd.raw_data_path}bmiagerev.xls")
df, con = n_con.con_Testosterone(df, con)
df, con = n_con.con_cholUrine(df, con)
df, con = n_con.con_diabetes(df, con)
df, con = n_con.con_HealthGen(df, con)
df, con = n_con.con_Depression(df, con)
df, con = n_con.con_Pregnancies(df, con)
df, con = n_con.con_Activeness(df, con)
df, con = n_con.con_Activity(df, con)
df, con = n_con.con_Alcohol(df, con)
df, con = n_con.con_Smoking(df, con)
df, con = n_con.con_Drugs(df, con)
df, con = n_con.con_Sex(df, con)
df, con = n_con.con_Pregnancies_2(df, con)

Sample output

For variable: AgeDecade: Replaced AgeDecade using conditions and values given in dict_conditions_values.
For variable: AgeDecade: Converted 0 missing values to UNK.
For variable: AgeMonths: Replaced AgeMonths using conditions and values given in dict_conditions_values.
For variable: Race1: Converted 0 missing values to UNK.
For variable: Race3: Converted 3568 missing values to UNK.
For variable: Race3: Replaced Race3 using conditions and values given in dict_conditions_values.
For variable: Gender: Converted 0 missing values to UNK.
For variable: Education: Converted 2363 missing values to UNK.
For variable: MaritalStatus: Converted 2358 missing values to UNK.
For variable: Education: Replaced Education using conditions and values given in dict_conditions_values.
For variable: MaritalStatus: Replaced MaritalStatus using conditions and values given in dict_conditions_values.
For variable: HHIncome: Converted 653 missing values to UNK.
For variable: HHIncomeMid: Replaced HHIncomeMid using conditions and values given in dict_conditions_values.
For variable: Poverty: Replaced Poverty using conditions and values given in dict_conditions_values.
For variable: Poverty: Converted 665 missing values to nan.
For variable: HomeOwn: Converted 50 missing values to UNK.
For variable: Work: Converted 1922 missing values to UNK.
For variable: Length: Replaced Length using conditions and values given in dict_conditions_values.
For variable: HeadCirc: Replaced HeadCirc using conditions and values given in dict_conditions_values.
For variable: Height: Replaced Height using conditions and values given in dict_conditions_values.
For variable: BMI: Replaced BMI using conditions and values given in dict_conditions_values.
For variable: BMICatUnder20yrs: Replaced BMICatUnder20yrs using conditions and values given in dict_conditions_values.
For variable: BMI_WHO: Replaced BMI_WHO using conditions and values given in dict_conditions_values.
For variable: Testosterone: Replaced Testosterone using conditions and values given in dict_conditions_values.
For variable: DirectChol: Replaced DirectChol using conditions and values given in dict_conditions_values.
For variable: TotChol: Replaced TotChol using conditions and values given in dict_conditions_values.
For variable: UrineVol1: Replaced UrineVol1 using conditions and values given in dict_conditions_values.
For variable: UrineFlow1: Replaced UrineFlow1 using conditions and values given in dict_conditions_values.
For variable: UrineVol2: Replaced UrineVol2 using conditions and values given in dict_conditions_values.
For variable: UrineFlow2: Replaced UrineFlow2 using conditions and values given in dict_conditions_values.
For variable: Diabetes: Converted 133 missing values to UNK.
For variable: Diabetes: Replaced Diabetes using conditions and values given in dict_conditions_values.
For variable: DiabetesAge: Replaced DiabetesAge using conditions and values given in dict_conditions_values.
For variable: DiabetesAge: Converted 0 values in DiabetesAge to Age.
For variable: HealthGen: Converted 2067 missing values to UNK.
For variable: HealthGen: Replaced HealthGen using conditions and values given in dict_conditions_values.
For variable: DaysPhysHlthBad: Replaced DaysPhysHlthBad using conditions and values given in dict_conditions_values.
For variable: DaysMentHlthBad: Replaced DaysMentHlthBad using conditions and values given in dict_conditions_values.
For variable: LittleInterest: Converted 2783 missing values to UNK.
For variable: Depressed: Converted 2781 missing values to UNK.
For variable: LittleInterest: Replaced LittleInterest using conditions and values given in dict_conditions_values.
For variable: Depressed: Replaced Depressed using conditions and values given in dict_conditions_values.
For variable: nPregnancies: Replaced nPregnancies using conditions and values given in dict_conditions_values.
For variable: nBabies: Replaced nBabies using conditions and values given in dict_conditions_values.
For variable: Age1stBaby: Replaced Age1stBaby using conditions and values given in dict_conditions_values.
For variable: nBabies: Converted 0 values in nBabies to nPregnancies.
For variable: Age1stBaby: Converted 1 values in Age1stBaby to Age.
For variable: SleepHrsNight: Replaced SleepHrsNight using conditions and values given in dict_conditions_values.
For variable: SleepTrouble: Converted 1921 missing values to UNK.
For variable: SleepTrouble: Replaced SleepTrouble using conditions and values given in dict_conditions_values.
For variable: PhysActive: Converted 1458 missing values to UNK.
For variable: PhysActive: Replaced PhysActive using conditions and values given in dict_conditions_values.
For variable: PhysActiveDays: Replaced PhysActiveDays using conditions and values given in dict_conditions_values.
For variable: TVHrsDay: Converted 3704 missing values to UNK.
For variable: CompHrsDay: Converted 3700 missing values to UNK.
For variable: TVHrsDay: Replaced TVHrsDay using conditions and values given in dict_conditions_values.
For variable: CompHrsDay: Replaced CompHrsDay using conditions and values given in dict_conditions_values.
For variable: TVHrsDayChild: Replaced TVHrsDayChild using conditions and values given in dict_conditions_values.
For variable: CompHrsDayChild: Replaced CompHrsDayChild using conditions and values given in dict_conditions_values.
For variable: Alcohol12PlusYr: Converted 2853 missing values to UNK.
For variable: Alcohol12PlusYr: Replaced Alcohol12PlusYr using conditions and values given in dict_conditions_values.
For variable: AlcoholDay: Replaced AlcoholDay using conditions and values given in dict_conditions_values.
For variable: AlcoholYear: Replaced AlcoholYear using conditions and values given in dict_conditions_values.
For variable: Smoke100: Converted 2354 missing values to UNK.
For variable: Smoke100: Replaced Smoke100 using conditions and values given in dict_conditions_values.
For variable: SmokeNow: Converted 5409 missing values to UNK.
For variable: SmokeNow: Replaced SmokeNow using conditions and values given in dict_conditions_values.
For variable: Smoke100n: Replaced Smoke100n using conditions and values given in dict_conditions_values.
For variable: SmokeAge: Replaced SmokeAge using conditions and values given in dict_conditions_values.
For variable: SmokeAge: Converted 0 values in SmokeAge to Age.
For variable: Marijuana: Converted 4163 missing values to UNK.
For variable: RegularMarij: Converted 4163 missing values to UNK.
For variable: Marijuana: Replaced Marijuana using conditions and values given in dict_conditions_values.
For variable: RegularMarij: Replaced RegularMarij using conditions and values given in dict_conditions_values.
For variable: AgeFirstMarij: Replaced AgeFirstMarij using conditions and values given in dict_conditions_values.
For variable: AgeRegMarij: Replaced AgeRegMarij using conditions and values given in dict_conditions_values.
For variable: AgeFirstMarij: Replaced AgeFirstMarij using conditions and values given in dict_conditions_values.
For variable: AgeRegMarij: Replaced AgeRegMarij using conditions and values given in dict_conditions_values.
For variable: AgeFirstMarij: Converted 0 values in AgeFirstMarij to Age.
For variable: AgeRegMarij: Converted 0 values in AgeRegMarij to Age.
For variable: HardDrugs: Converted 3521 missing values to UNK.
For variable: SexEver: Converted 3520 missing values to UNK.
For variable: SameSex: Converted 3520 missing values to UNK.
For variable: HardDrugs: Replaced HardDrugs using conditions and values given in dict_conditions_values.
For variable: SexEver: Replaced SexEver using conditions and values given in dict_conditions_values.
For variable: SameSex: Replaced SameSex using conditions and values given in dict_conditions_values.
For variable: SameSex: Replaced SameSex using conditions and values given in dict_conditions_values.
For variable: SexAge: Replaced SexAge using conditions and values given in dict_conditions_values.
For variable: SexNumPartnLife: Replaced SexNumPartnLife using conditions and values given in dict_conditions_values.
For variable: SexAge: Converted 0 values in SexAge to Age.
For variable: SexNumPartYear: Replaced SexNumPartYear using conditions and values given in dict_conditions_values.
For variable: SexOrientation: Converted 4245 missing values to UNK.
For variable: SexOrientation: Replaced SexOrientation using conditions and values given in dict_conditions_values.
For variable: PregnantNow: Converted 6540 missing values to UNK.
For variable: PregnantNow: Replaced PregnantNow using conditions and values given in dict_conditions_values.

Check Output

Check details of the constrained dataset

pprint.pprint(con.log)

Sample Output

{'Age1stBaby': {'compare_columns_A_B': 'Converted 1 values in Age1stBaby to '
                                       'Age.',
                'evaluate_df_column': 'Replaced Age1stBaby using conditions '
                                      'and values given in '
                                      'dict_conditions_values.'},
 'AgeDecade': {'convertBlankstoValue': 'Converted 0 missing values to UNK.',
               'evaluate_df_column': 'Replaced AgeDecade using conditions and '
                                     'values given in dict_conditions_values.'},
 'AgeFirstMarij': {'compare_columns_A_B': 'Converted 0 values in AgeFirstMarij '
                                          'to Age.',
                   'evaluate_df_column': 'Replaced AgeFirstMarij using '
                                         'conditions and values given in '
                                         'dict_conditions_values.'},
 'AgeMonths': {'evaluate_df_column': 'Replaced AgeMonths using conditions and '
                                     'values given in dict_conditions_values.'},
 'AgeRegMarij': {'compare_columns_A_B': 'Converted 0 values in AgeRegMarij to '
                                        'Age.',
                 'evaluate_df_column': 'Replaced AgeRegMarij using conditions '
                                       'and values given in '
                                       'dict_conditions_values.'},
 'Alcohol12PlusYr': {'convertBlankstoValue': 'Converted 2853 missing values to '
                                             'UNK.',
                     'evaluate_df_column': 'Replaced Alcohol12PlusYr using '
                                           'conditions and values given in '
                                           'dict_conditions_values.'},
 'AlcoholDay': {'evaluate_df_column': 'Replaced AlcoholDay using conditions '
                                      'and values given in '
                                      'dict_conditions_values.'},
 'AlcoholYear': {'evaluate_df_column': 'Replaced AlcoholYear using conditions '
                                       'and values given in '
                                       'dict_conditions_values.'},
 'BMI': {'evaluate_df_column': 'Replaced BMI using conditions and values given '
                               'in dict_conditions_values.'},
 'BMICatUnder20yrs': {'evaluate_df_column': 'Replaced BMICatUnder20yrs using '
                                            'conditions and values given in '
                                            'dict_conditions_values.'},
 'BMI_WHO': {'evaluate_df_column': 'Replaced BMI_WHO using conditions and '
                                   'values given in dict_conditions_values.'},
 'CompHrsDay': {'convertBlankstoValue': 'Converted 3700 missing values to UNK.',
                'evaluate_df_column': 'Replaced CompHrsDay using conditions '
                                      'and values given in '
                                      'dict_conditions_values.'},
 'CompHrsDayChild': {'evaluate_df_column': 'Replaced CompHrsDayChild using '
                                           'conditions and values given in '
                                           'dict_conditions_values.'},
 'DaysMentHlthBad': {'evaluate_df_column': 'Replaced DaysMentHlthBad using '
                                           'conditions and values given in '
                                           'dict_conditions_values.'},
 'DaysPhysHlthBad': {'evaluate_df_column': 'Replaced DaysPhysHlthBad using '
                                           'conditions and values given in '
                                           'dict_conditions_values.'},
 'Depressed': {'convertBlankstoValue': 'Converted 2781 missing values to UNK.',
               'evaluate_df_column': 'Replaced Depressed using conditions and '
                                     'values given in dict_conditions_values.'},
 'Diabetes': {'convertBlankstoValue': 'Converted 133 missing values to UNK.',
              'evaluate_df_column': 'Replaced Diabetes using conditions and '
                                    'values given in dict_conditions_values.'},
 'DiabetesAge': {'compare_columns_A_B': 'Converted 0 values in DiabetesAge to '
                                        'Age.',
                 'evaluate_df_column': 'Replaced DiabetesAge using conditions '
                                       'and values given in '
                                       'dict_conditions_values.'},
 'DirectChol': {'evaluate_df_column': 'Replaced DirectChol using conditions '
                                      'and values given in '
                                      'dict_conditions_values.'},
 'Education': {'convertBlankstoValue': 'Converted 2363 missing values to UNK.',
               'evaluate_df_column': 'Replaced Education using conditions and '
                                     'values given in dict_conditions_values.'},
 'Gender': {'convertBlankstoValue': 'Converted 0 missing values to UNK.'},
 'HHIncome': {'convertBlankstoValue': 'Converted 653 missing values to UNK.'},
 'HHIncomeMid': {'evaluate_df_column': 'Replaced HHIncomeMid using conditions '
                                       'and values given in '
                                       'dict_conditions_values.'},
 'HardDrugs': {'convertBlankstoValue': 'Converted 3521 missing values to UNK.',
               'evaluate_df_column': 'Replaced HardDrugs using conditions and '
                                     'values given in dict_conditions_values.'},
 'HeadCirc': {'evaluate_df_column': 'Replaced HeadCirc using conditions and '
                                    'values given in dict_conditions_values.'},
 'HealthGen': {'convertBlankstoValue': 'Converted 2067 missing values to UNK.',
               'evaluate_df_column': 'Replaced HealthGen using conditions and '
                                     'values given in dict_conditions_values.'},
 'Height': {'evaluate_df_column': 'Replaced Height using conditions and values '
                                  'given in dict_conditions_values.'},
 'HomeOwn': {'convertBlankstoValue': 'Converted 50 missing values to UNK.'},
 'Length': {'evaluate_df_column': 'Replaced Length using conditions and values '
                                  'given in dict_conditions_values.'},
 'LittleInterest': {'convertBlankstoValue': 'Converted 2783 missing values to '
                                            'UNK.',
                    'evaluate_df_column': 'Replaced LittleInterest using '
                                          'conditions and values given in '
                                          'dict_conditions_values.'},
 'Marijuana': {'convertBlankstoValue': 'Converted 4163 missing values to UNK.',
               'evaluate_df_column': 'Replaced Marijuana using conditions and '
                                     'values given in dict_conditions_values.'},
 'MaritalStatus': {'convertBlankstoValue': 'Converted 2358 missing values to '
                                           'UNK.',
                   'evaluate_df_column': 'Replaced MaritalStatus using '
                                         'conditions and values given in '
                                         'dict_conditions_values.'},
 'PhysActive': {'convertBlankstoValue': 'Converted 1458 missing values to UNK.',
                'evaluate_df_column': 'Replaced PhysActive using conditions '
                                      'and values given in '
                                      'dict_conditions_values.'},
 'PhysActiveDays': {'evaluate_df_column': 'Replaced PhysActiveDays using '
                                          'conditions and values given in '
                                          'dict_conditions_values.'},
 'Poverty': {'convertBlankstoValue': 'Converted 665 missing values to nan.',
             'evaluate_df_column': 'Replaced Poverty using conditions and '
                                   'values given in dict_conditions_values.'},
 'PregnantNow': {'convertBlankstoValue': 'Converted 6540 missing values to '
                                         'UNK.',
                 'evaluate_df_column': 'Replaced PregnantNow using conditions '
                                       'and values given in '
                                       'dict_conditions_values.'},
 'Race1': {'convertBlankstoValue': 'Converted 0 missing values to UNK.'},
 'Race3': {'convertBlankstoValue': 'Converted 3568 missing values to UNK.',
           'evaluate_df_column': 'Replaced Race3 using conditions and values '
                                 'given in dict_conditions_values.'},
 'RegularMarij': {'convertBlankstoValue': 'Converted 4163 missing values to '
                                          'UNK.',
                  'evaluate_df_column': 'Replaced RegularMarij using '
                                        'conditions and values given in '
                                        'dict_conditions_values.'},
 'SameSex': {'convertBlankstoValue': 'Converted 3520 missing values to UNK.',
             'evaluate_df_column': 'Replaced SameSex using conditions and '
                                   'values given in dict_conditions_values.'},
 'SexAge': {'compare_columns_A_B': 'Converted 0 values in SexAge to Age.',
            'evaluate_df_column': 'Replaced SexAge using conditions and values '
                                  'given in dict_conditions_values.'},
 'SexEver': {'convertBlankstoValue': 'Converted 3520 missing values to UNK.',
             'evaluate_df_column': 'Replaced SexEver using conditions and '
                                   'values given in dict_conditions_values.'},
 'SexNumPartYear': {'evaluate_df_column': 'Replaced SexNumPartYear using '
                                          'conditions and values given in '
                                          'dict_conditions_values.'},
 'SexNumPartnLife': {'evaluate_df_column': 'Replaced SexNumPartnLife using '
                                           'conditions and values given in '
                                           'dict_conditions_values.'},
 'SexOrientation': {'convertBlankstoValue': 'Converted 4245 missing values to '
                                            'UNK.',
                    'evaluate_df_column': 'Replaced SexOrientation using '
                                          'conditions and values given in '
                                          'dict_conditions_values.'},
 'SleepHrsNight': {'evaluate_df_column': 'Replaced SleepHrsNight using '
                                         'conditions and values given in '
                                         'dict_conditions_values.'},
 'SleepTrouble': {'convertBlankstoValue': 'Converted 1921 missing values to '
                                          'UNK.',
                  'evaluate_df_column': 'Replaced SleepTrouble using '
                                        'conditions and values given in '
                                        'dict_conditions_values.'},
 'Smoke100': {'convertBlankstoValue': 'Converted 2354 missing values to UNK.',
              'evaluate_df_column': 'Replaced Smoke100 using conditions and '
                                    'values given in dict_conditions_values.'},
 'Smoke100n': {'evaluate_df_column': 'Replaced Smoke100n using conditions and '
                                     'values given in dict_conditions_values.'},
 'SmokeAge': {'compare_columns_A_B': 'Converted 0 values in SmokeAge to Age.',
              'evaluate_df_column': 'Replaced SmokeAge using conditions and '
                                    'values given in dict_conditions_values.'},
 'SmokeNow': {'convertBlankstoValue': 'Converted 5409 missing values to UNK.',
              'evaluate_df_column': 'Replaced SmokeNow using conditions and '
                                    'values given in dict_conditions_values.'},
 'TVHrsDay': {'convertBlankstoValue': 'Converted 3704 missing values to UNK.',
              'evaluate_df_column': 'Replaced TVHrsDay using conditions and '
                                    'values given in dict_conditions_values.'},
 'TVHrsDayChild': {'evaluate_df_column': 'Replaced TVHrsDayChild using '
                                         'conditions and values given in '
                                         'dict_conditions_values.'},
 'Testosterone': {'evaluate_df_column': 'Replaced Testosterone using '
                                        'conditions and values given in '
                                        'dict_conditions_values.'},
 'TotChol': {'evaluate_df_column': 'Replaced TotChol using conditions and '
                                   'values given in dict_conditions_values.'},
 'UrineFlow1': {'evaluate_df_column': 'Replaced UrineFlow1 using conditions '
                                      'and values given in '
                                      'dict_conditions_values.'},
 'UrineFlow2': {'evaluate_df_column': 'Replaced UrineFlow2 using conditions '
                                      'and values given in '
                                      'dict_conditions_values.'},
 'UrineVol1': {'evaluate_df_column': 'Replaced UrineVol1 using conditions and '
                                     'values given in dict_conditions_values.'},
 'UrineVol2': {'evaluate_df_column': 'Replaced UrineVol2 using conditions and '
                                     'values given in dict_conditions_values.'},
 'Work': {'convertBlankstoValue': 'Converted 1922 missing values to UNK.'},
 'nBabies': {'compare_columns_A_B': 'Converted 0 values in nBabies to '
                                    'nPregnancies.',
             'evaluate_df_column': 'Replaced nBabies using conditions and '
                                   'values given in dict_conditions_values.'},
 'nPregnancies': {'evaluate_df_column': 'Replaced nPregnancies using '
                                        'conditions and values given in '
                                        'dict_conditions_values.'}}

Update CleanData class

Update the CleanData class with constrained data.

cd.update_data(new_df = df, filename_suffix = cd.suffix_constraints)

Sample Output

Replacing the input data...
Replacing the input data complete. new filename: *\trainData\nhanes_raw-DD-CON.csv

Copyright © 2023 BiomedDAR. Distributed by an MIT license.