TabulaCopula
class TabulaCopula(definitions=None, output_general_prefix=None, conditionalSettings_dict=None, metaData_transformer=None, var_list_filter=None, removeNull=False, sampling=None, debug=False)
Module for performing copula/conditional-copula (Gaussian) for Tabular-type data.
Parameters
definitions: file (.py), optional, default None. Contain global variables
output_general_prefix: str, optional, default None. Prefix used for all output files, e.g. "EXPT_1". If not None, replaces settings in definitions.
conditionalSettings_dict: dict, optional, default None. Dictionary of conditional inputs for using conditional-copula.
metaData_transformer: dict, optional, default None. Dictionary of inputs for the Transformer class initialisation (ref Transformer metaData).
var_list_filter: list, optional, default None. List of variables to transform. If None, all will be transformed.
removeNull: boolean, optional, default False. Whether to remove all null values prior to transformation.
sampling: float, optional, default None. Percentage of sample points draw from the transformed dataframe, leaving the rest as control. If None, all training points will be used. (Note that the sampling process is done after the transformation.)
debug: boolean, default True. Whether to print debug-related outputs to console.
Notes
Description of conditionalSettings_dict
The conditionalSettings_dict variable specifies the structure of the conditional setup. It takes the following form:
conditionalSettings_dict = {
"set_1": {
"bool": True,
"parent_conditions": { # parents, the `Y` in `P(X | Y)` while learning P(X | Y).
"SurveyYr": { # split variable into 2 sets
"condition": "set", #available options: "set", "range"
"condition_value": {
1: ["2009_10"],
2: ["2011_12"]
}
},
"Age": { # split variable into 3 sets based on range
"condition": "range",
"condition_value": {
1: [">=3", "<79"],
2: ["<3"],
3: [">=79"]
}
}
},
"conditions_var": ["Age"], # the `Y` to keep constant while generating values of `X` in `P(X | Y)`. Can be a float, in which case it is a threshold to fix all variables with pairwise correlation (with X) above then said threshold.
"children": ['AgeMonths'] #variable for which to learn the joint conditional distributions on, the `X` in `P(X | Y)`. Can be a string: "allOthers".
}
}
Examples
Please refer to the below pages for detailed examples:
| Example | Description |
|---|---|
| Multivariate Synthetic Data (multi-Linear) | Demonstrates the use of the TabulaCopula class to generate synthetic data for a multivariate dataset. |
| Multivariate Synthetic Data (multi-Linear, with missing values) | Demonstrates use of the TabulaCopula class to generate synthetic data for a multivariate real dataset (NHANES), between variables of known linear relationship. |
| Multivariate Synthetic Data (multi-Linear, Privacy Leakage Assessment) | Demonstrates use of the TabulaCopula class to quantitatively assess the privacy leakage risks of generated synthetic data. |
| Multivariate Synthetic Data (multi, non-linear, non-monotonic) | Demonstrates use of the TabulaCopula class to generate synthetic data for a multivariate simulated dataset (socialdata), between variables of known non-linear, non-monotonic relationships. |
Attributes
| Attribute | Description |
|---|---|
| debug | (boolean) whether to debug or not |
| folder_trainData | (str) training data folder |
| folder_synData | (str) synthetic data folder |
| folder_privacyMetrics | (str) privacy metrics folder |
| output_general_prefix | (str) prefix used for all output files |
| sampling | (int) percentage of sample points draw from the transformed dataframe, leaving the rest as control |
| privacy_batch_n | (int) number of repetitions of privacy test |
| output_type_data | (str) output file type for the clean data files. |
| output_type_dict | (str) output file type for the amended dictionary. |
| output_type_obj | (str) output file type for saved class instance |
| dict_var_varname | (str) column in data dictionary containing variable names in input data |
| dict_var_varcategory | (str) column in data dictionary setting the category of the variable name |
| dict_var_vartype | (str) column in data dictionary containing variable types in input data |
| conditional_set_bool | (bool) flag set to true when filenames initialised for conditional setup |
| metaData_transformer | (dict) dictionary of inputs for the Transformer class initialisation |
| var_list_filter | (list) list of variables to transform (subset of all input variables) |
| removeNull | (bool) Whether to remove all null values prior to transformation. |
| conditionalSettings_dict | (dict) dictionary of conditional inputs for using conditional-copula. |
| prefix_path | (str) PREFIX_PATH from definitions |
| trainxlsx | (str) TRAINXLSX from definitions |
| traindictxlsx | (str) TRAINDICTXLSX from definitions |
| train_data_path | (str) folder path to put training data |
| train_data_filename | (str) filename of training data |
| train_data_dict_filename | (str) filename of dictionary of training data |
| syn_data_path | (str) folder path to put synthetic data |
| privacyMetrics_path | (str) folder path to put privacy metrics |
| train_df | (dataframe) training data (dataframe) |
| dict_df | (dataframe) data dictionary (dataframe) |
| var_list | (list) list of all variables (column headers) found in input data |
| processed_var_list | (list) list of all variables (column headers) that have been transformed |
| transformed_df | (dataframe) transformed training data (dataframe) / replaced with transformed data after sampling (disjoint with self.control_df) |
| control_df | (dataframe) dataframe that is not sampled for training (left as control for privacy leakage testing) |
| curated_train_df | (dataframe) curated data prior to transformation (dataframe) |
| syn_samples_df | (dataframe) synthetic samples (dataframe) |
| syn_samples_conditional_df | (dataframe) conditional synthetic samples (dataframe) |
| reversed_df | (dataframe) reversed synthetic samples (dataframe) |
| reversed_conditional_df | (dataframe) conditional reversed synthetic samples (dataframe) |
| reversed_control_df | (dataframe) reversed control_df (dataframe) |
| reversed_transformed_df | (dataframe) reversed transformed_df (dataframe) |
| privacyMetricEval | (obj) privacy metric evaluator |
| privacyMetricEval_cond | (obj) privacy metric evaluator for conditional copula |
| privacyMetricResults | (dict) privacy metric results dictionary |
| definitions | (obj) definitions in corresponding input defintions.py |
Methods
| Method | Description |
|---|---|
| transform([metaData, var_list]) | transform data into numerical equivalent |
| transform_conditional([metaData, ]) | transform data into numerical equivalent (for conditional) |
| reverse_transform([transformed_df, conditional_transformed_df, control_transformed_df]) | reverse transformation on generated synthetic data |
| print_details_copula() | print copula details |
| fit_gaussian_copula([correlation_method, marginal_dist_dict]) | build copula for given training data |
| fit_gaussian_copula_conditional([correlation_method, marginal_dist_dict]) | build conditional-copula for given conditional_dict |
| sample_gaussian_copula([sample_size, conditions]) | sample datapoints from learned joint distribution |
| sample_gaussian_copula_conditional() | sample datapoints from learned conditional joint distribution |
| syn_generate([sample_size, cond_bool, conditions]) | wrapper for synthetic data generation |
| build_privacyMetric() | build privacyMetric, privacyMetric_conditional evaluator |
| privacyMetric_singlingOut_Batch([n, mode, n_attacks, print_results]) | wrapper fn to run privacy metric evaluation for singling out attack (standard) |
| privacyMetric_singlingOut_cond_Batch([n, mode, n_attacks, print_results]) | wrapper fn to run privacy metric evaluation for singling out attack (conditional) |
| privacyMetric_Linkability_Batch(aux_cols, [n, n_neighbors, n_attacks, print_results]) | wrapper fn to run privacy metric evaluation for linkability attack (standard) |
| privacyBatch_Linkability_cond_Batch(aux_cols, [n, n_neighbors, n_attacks, print_results]) | wrapper fn to run privacy metric evaluation for linkability attack (conditional) |
| privacyMetric_Inference_Batch([n, n_attacks, print_results]) | wrapper fn to run privacy metric evaluation for inference attack (standard) |
| privacyMetric_Inference_cond_Batch([n, n_attacks, print_results]) | wrapper fn to run privacy metric evaluation for inference attack (conditional) |
| save() | wrapper fn to save class instance and output filenames |
| save_outputFilenames() | Saves the output filenames dictionary to a csv file, suffix=”CL-OF” |
| save_instance() | Saves the current class instance to a pickle file |