Constraints.evaluate_df_column

This function takes a dataframe and column name(s) and evaluates the column based on the given conditions and values, creating a new column in the dataframe with the evaluated values. Optionally, a function can be passed in to evaluate the column.

Constraints.evaluate_df_column(df, column_names, [dict_conditions_values, func, output_column_name])

Parameters

df: (dataframe)
- dataframe to be evaluated
column_names: (str, list)
- a string or list of strings containing the name(s) of the columns to be evaluated
dict_conditions_values: (dict)
- A dictionary containing the conditions and values to be evaluated. E.g. dict_conditions_values = {i: {condition: lambda x: x> 5 OR “x>5”, value: “‘5+’”} }
func: (function)
- A function to be applied on the columns
- Default is None
output_column_name: (str)
- A string containing the name of the output column. If not provided, the default is the name of the column plus ‘_evaluated’.

Returns

pandas.DataFrame
- the dataframe with the evaluated values in the new column

Notes

Snippet of an example dict_conditions_values

df = evaluate_df_column(df, 'item', dict_conditions_values=
    {
        'condition_1': {'condition': 'x == "apples"', 'value': '"fruit"'},
        'condition_2': {'condition': 'x == "oranges"', 'value': '"fruit"'},
        'condition_3': {'condition': 'x == "carrots"', 'value': '"vegetable"'},
        'condition_4': {'condition': 'x == "potatoes"', 'value': '"vegetable"'}
    },
    output_column_name='item_type'
)

Examples

Creating AgeDecade as a secondary variable from Age

dict_conditions_values = {}
for i in range(1,11):
    if (i-1)*10 >= 70:
        value = ' 70+'
    else:
        value = f" {(i-1)*10}-{i*10-1}"  #' 0-9'
    dict_conditions_values.update({str(i): {
        'condition': f"{(i-1)*10} <= x and x < {i*10}", #"0 < x and x < 10",
        'value': f"'{value}'"
    }})
df = con.evaluate_df_column(df, 'Age', dict_conditions_values, output_column_name="AgeDecade")

Generating Smoke100n from Smoke100 (convert ‘Yes’ to ‘Smoker’, ‘No’ to ‘Non-Smoker’) using customised function

<!-- Build customised function -->
def compute_smoke100n(df_row):
    w = df_row['Smoke100']
    smoke100n = "N.A."
    if (w=='Yes'):
        smoke100n = 'Smoker'
    elif (w=='No'):
        smoke100n = 'Non-Smoker'
    else:
        smoke100n = w

    return smoke100n

df = con.evaluate_df_column(df, ['Smoke100'], func=compute_smoke100n, output_column_name='Smoke100n')

Derive secondary variable HHIncomeMid from HHIncome

def fn_1(x):
    if "-" in str(x):
        v = x.split("-")
        return ( int(v[1]) + 1 + int(v[0]) )/ 2
    elif "UNK" in str(x):
        return np.nan
    elif "more" in str(x):
        return 100000
    else:
        return np.nan
        
con.evaluate_df_column(df, 'HHIncome', func=fn_1, output_column_name='HHIncomeMid')