TabulaCopula

class TabulaCopula(definitions=None, output_general_prefix=None, conditionalSettings_dict=None, metaData_transformer=None, var_list_filter=None, removeNull=False, sampling=None, debug=False) Module for performing copula/conditional-copula (Gaussian) for Tabular-type data.

Parameters

definitions: file (.py), optional, default None. Contain global variables

output_general_prefix: str, optional, default None. Prefix used for all output files, e.g. "EXPT_1". If not None, replaces settings in definitions.

conditionalSettings_dict: dict, optional, default None. Dictionary of conditional inputs for using conditional-copula.

metaData_transformer: dict, optional, default None. Dictionary of inputs for the Transformer class initialisation (ref Transformer metaData).

var_list_filter: list, optional, default None. List of variables to transform. If None, all will be transformed.

removeNull: boolean, optional, default False. Whether to remove all null values prior to transformation.

sampling: float, optional, default None. Percentage of sample points draw from the transformed dataframe, leaving the rest as control. If None, all training points will be used. (Note that the sampling process is done after the transformation.)

debug: boolean, default True. Whether to print debug-related outputs to console.

Notes

Description of conditionalSettings_dict

The conditionalSettings_dict variable specifies the structure of the conditional setup. It takes the following form:

conditionalSettings_dict = {
    "set_1": {
        "bool": True,
        "parent_conditions": { # parents, the `Y` in `P(X | Y)` while learning P(X | Y).
            "SurveyYr": { # split variable into 2 sets
                "condition": "set", #available options: "set", "range"
                "condition_value": {
                    1: ["2009_10"],
                    2: ["2011_12"]
                }
            },
            "Age": { # split variable into 3 sets based on range
                "condition": "range",
                "condition_value": {
                    1: [">=3", "<79"],
                    2: ["<3"],
                    3: [">=79"]
                }
            }
        },
        "conditions_var": ["Age"], # the `Y` to keep constant while generating values of `X` in `P(X | Y)`. Can be a float, in which case it is a threshold to fix all variables with pairwise correlation (with X) above then said threshold.
        "children": ['AgeMonths'] #variable for which to learn the joint conditional distributions on, the `X` in `P(X | Y)`. Can be a string: "allOthers".
    }
}

Examples

Please refer to the below pages for detailed examples:

Example	Description
Multivariate Synthetic Data (multi-Linear)	Demonstrates the use of the TabulaCopula class to generate synthetic data for a multivariate dataset.
Multivariate Synthetic Data (multi-Linear, with missing values)	Demonstrates use of the TabulaCopula class to generate synthetic data for a multivariate real dataset (NHANES), between variables of known linear relationship.
Multivariate Synthetic Data (multi-Linear, Privacy Leakage Assessment)	Demonstrates use of the TabulaCopula class to quantitatively assess the privacy leakage risks of generated synthetic data.
Multivariate Synthetic Data (multi, non-linear, non-monotonic)	Demonstrates use of the TabulaCopula class to generate synthetic data for a multivariate simulated dataset (socialdata), between variables of known non-linear, non-monotonic relationships.

Attributes

Attribute	Description
debug	(boolean) whether to debug or not
folder_trainData	(str) training data folder
folder_synData	(str) synthetic data folder
folder_privacyMetrics	(str) privacy metrics folder
output_general_prefix	(str) prefix used for all output files
sampling	(int) percentage of sample points draw from the transformed dataframe, leaving the rest as control
privacy_batch_n	(int) number of repetitions of privacy test
output_type_data	(str) output file type for the clean data files.
output_type_dict	(str) output file type for the amended dictionary.
output_type_obj	(str) output file type for saved class instance
dict_var_varname	(str) column in data dictionary containing variable names in input data
dict_var_varcategory	(str) column in data dictionary setting the category of the variable name
dict_var_vartype	(str) column in data dictionary containing variable types in input data
conditional_set_bool	(bool) flag set to true when filenames initialised for conditional setup
metaData_transformer	(dict) dictionary of inputs for the Transformer class initialisation
var_list_filter	(list) list of variables to transform (subset of all input variables)
removeNull	(bool) Whether to remove all null values prior to transformation.
conditionalSettings_dict	(dict) dictionary of conditional inputs for using conditional-copula.
prefix_path	(str) PREFIX_PATH from definitions
trainxlsx	(str) TRAINXLSX from definitions
traindictxlsx	(str) TRAINDICTXLSX from definitions
train_data_path	(str) folder path to put training data
train_data_filename	(str) filename of training data
train_data_dict_filename	(str) filename of dictionary of training data
syn_data_path	(str) folder path to put synthetic data
privacyMetrics_path	(str) folder path to put privacy metrics
train_df	(dataframe) training data (dataframe)
dict_df	(dataframe) data dictionary (dataframe)
var_list	(list) list of all variables (column headers) found in input data
processed_var_list	(list) list of all variables (column headers) that have been transformed
transformed_df	(dataframe) transformed training data (dataframe) / replaced with transformed data after sampling (disjoint with self.control_df)
control_df	(dataframe) dataframe that is not sampled for training (left as control for privacy leakage testing)
curated_train_df	(dataframe) curated data prior to transformation (dataframe)
syn_samples_df	(dataframe) synthetic samples (dataframe)
syn_samples_conditional_df	(dataframe) conditional synthetic samples (dataframe)
reversed_df	(dataframe) reversed synthetic samples (dataframe)
reversed_conditional_df	(dataframe) conditional reversed synthetic samples (dataframe)
reversed_control_df	(dataframe) reversed control_df (dataframe)
reversed_transformed_df	(dataframe) reversed transformed_df (dataframe)
privacyMetricEval	(obj) privacy metric evaluator
privacyMetricEval_cond	(obj) privacy metric evaluator for conditional copula
privacyMetricResults	(dict) privacy metric results dictionary
definitions	(obj) definitions in corresponding input `defintions.py`

Methods

Method	Description
transform([metaData, var_list])	transform data into numerical equivalent
transform_conditional([metaData, ])	transform data into numerical equivalent (for conditional)
reverse_transform([transformed_df, conditional_transformed_df, control_transformed_df])	reverse transformation on generated synthetic data
print_details_copula()	print copula details
fit_gaussian_copula([correlation_method, marginal_dist_dict])	build copula for given training data
fit_gaussian_copula_conditional([correlation_method, marginal_dist_dict])	build conditional-copula for given conditional_dict
sample_gaussian_copula([sample_size, conditions])	sample datapoints from learned joint distribution
sample_gaussian_copula_conditional()	sample datapoints from learned conditional joint distribution
syn_generate([sample_size, cond_bool, conditions])	wrapper for synthetic data generation
build_privacyMetric()	build privacyMetric, privacyMetric_conditional evaluator
privacyMetric_singlingOut_Batch([n, mode, n_attacks, print_results])	wrapper fn to run privacy metric evaluation for singling out attack (standard)
privacyMetric_singlingOut_cond_Batch([n, mode, n_attacks, print_results])	wrapper fn to run privacy metric evaluation for singling out attack (conditional)
privacyMetric_Linkability_Batch(aux_cols, [n, n_neighbors, n_attacks, print_results])	wrapper fn to run privacy metric evaluation for linkability attack (standard)
privacyBatch_Linkability_cond_Batch(aux_cols, [n, n_neighbors, n_attacks, print_results])	wrapper fn to run privacy metric evaluation for linkability attack (conditional)
privacyMetric_Inference_Batch([n, n_attacks, print_results])	wrapper fn to run privacy metric evaluation for inference attack (standard)
privacyMetric_Inference_cond_Batch([n, n_attacks, print_results])	wrapper fn to run privacy metric evaluation for inference attack (conditional)
save()	wrapper fn to save class instance and output filenames
save_outputFilenames()	Saves the output filenames dictionary to a csv file, suffix=”CL-OF”
save_instance()	Saves the current class instance to a pickle file