rockfish.labs.metrics

`metrics`

Classes

Functions

`pearsonr(dataset: LocalDataset, x: str, y: str) -> LocalDataset`

Returns a new table containing two fields: statistic and pvalue. The statistic field will have one value with the Pearson product-moment correlation coefficient and the pvalue field will have the p-value between the fields.

See :func:scipy.stats.pearsonr.

Parameters:

Name	Type	Description	Default
`dataset`	`LocalDataset`	Input dataset.	required
`x`	`str`	Field containing x data.	required
`y`	`str`	Field containing y data.	required

`cramer_v(dataset: LocalDataset, x: str, y: str, correction: bool = True) -> float`

Compute the Cramer's V for two categorical columns. Best at 1, worst at 0.

Parameters:

Name	Type	Description	Default
`dataset`	`LocalDataset`	Input dataset.	required
`x`	`str`	Field containing x data.	required
`y`	`str`	Field containing y data.	required
`correction`	`bool`	Apply bias correction to Cramer's V. Default is True.	`True`

`ks_distance(ds1: LocalDataset, ds2: LocalDataset, field: str) -> float`

Computes the Kolmogorov-Smirnov distance between two numerical fields.

See :func:scipy.stats.ks_2samp.

Parameters:

Name	Type	Description	Default
`ds1`	`LocalDataset`	Input dataset 1.	required
`ds2`	`LocalDataset`	Input dataset 2.	required
`field`	`str`	Field to compute the distance for.	required

`tv_distance(ds1: LocalDataset, ds2: LocalDataset, field: str) -> float`

Computes the total variation distance between two categorical fields.

Parameters:

Name	Type	Description	Default
`ds1`	`LocalDataset`	Input dataset 1.	required
`ds2`	`LocalDataset`	Input dataset 2.	required
`field`	`str`	Field to compute the distance for.	required

`category_coverage(ds1: LocalDataset, ds2: LocalDataset, field: str) -> float`

Compare the category coverage of two categorical columns. Returns: score value Score range: 0-1 (best = 1, worst = 0)

Parameters:

Name	Type	Description	Default
`ds1`	`LocalDataset`	Input dataset 1.	required
`ds2`	`LocalDataset`	Input dataset 2.	required
`field`	`str`	Field to compute.	required

`range_coverage(ds1: LocalDataset, ds2: LocalDataset, field: str) -> float`

Compare the range coverage of two numerical columns. Returns: score value Score range: 0-1 (best = 1, worst = 0)

Parameters:

Name	Type	Description	Default
`ds1`	`LocalDataset`	Input dataset 1.	required
`ds2`	`LocalDataset`	Input dataset 2.	required
`field`	`str`	Field to compute.	required

`jsd(ds1: LocalDataset, ds2: LocalDataset, fields: list[str]) -> float`

Compute the Jensen-Shannon distance (metric) between two categorical distributions. This is the square root of the Jensen-Shannon divergence. Returns: score value Score range: 0 - 1 (best = 0, worst = 1)

Parameters:

Name	Type	Description	Default
`ds1`	`LocalDataset`	Input dataset 1.	required
`ds2`	`LocalDataset`	Input dataset 2.	required
`fields`	`list[str]`	A list of field(s). (can be a list 1 field or multiple fields)	required

`emd(ds1: LocalDataset, ds2: LocalDataset, fields: list[str]) -> float`

Compute the Wasserstein distance(aka Earth Mover's distance) between two numerical distributions. Returns: score value Score range: 0 - positive infinity (best = 0, worst = positive infinity)

For 1D arrays, use scipy.stats.wasserstein_distance(). For higher dimensional (>=2) arrays, use GeomLoss library for approximation.

Parameters:

Name	Type	Description	Default
`ds1`	`LocalDataset`	Input dataset 1.	required
`ds2`	`LocalDataset`	Input dataset 2.	required
`fields`	`list[str]`	A list of field(s). (can be a list 1 field or multiple fields)	required

`marginal_dist_score(dataset: LocalDataset, syn: LocalDataset, metadata: list[str] = [], other_categorical: list[str] = [], weights: dict[str, float] = {}) -> float`

Return the weighted fidelity score of the synthetic dataset compared to the real dataset. Range: 0-1 (best = 1, worst = 0)

Parameters:

Name	Type	Description	Default
`dataset`	`LocalDataset`	Real dataset	required
`syn`	`LocalDataset`	Synthetic dataset	required
`metadata`	`list[str]`	[Opitional]List of metadata fields for timeseries dataset	`[]`
`other_categorical`	`list[str]`	[Opitional]List of other categorical fields	`[]`
`weights`	`dict[str, float]`	[Opitional]Define weights for fields. If not provided, by default all weights are 1.	`{}`

`distance_to_closest_record_score(train_dataset: rf.dataset.LocalDataset, test_dataset: rf.dataset.LocalDataset, syn: rf.dataset.LocalDataset, subset_length: Optional[int] = None, subset_seed: Optional[int] = None, transform: Optional[Literal['sigmoid']] = None) -> float`

Returns a Distance to Closest Record (DCR) score.

A DCR score measures the similarity between the DCR distributions b/w the two dataset pairs - (train, synthetic) and (train, test). The more similar DCR distributions are, the more "private" the synthetic data.

A DCR distribution is obtained as follows: For each record in the synthetic (resp, test) dataset, store the Gower distance to the closest record in the train dataset.

The DCR score is the ratio between the medians of the two DCR distributions.

Range for score: 0 - positive infinity (worst = 0, best >= 1)

Interpreting the score using a Likert scale for quality (also helpful for visualization): 1. Low: [0 - 0.75) 2. Medium: [0.75 - 1.0) 3. High: [1.0, positive infinity) If the modified sigmoid function is used, the above ranges are transformed to: 1. Low: [0 - 0.36) 2. Medium: [0.36 - 0.46) 3. High: [0.46, 1.0)

Parameters:

Name	Type	Description	Default
`train_dataset`	`LocalDataset`	Real dataset used to train the synthetic data generator.	required
`test_dataset`	`LocalDataset`	Dataset from the same distribution as the train_dataset.	required
`syn`	`LocalDataset`	Synthetic dataset.	required
`subset_length`	`Optional[int]`	(optional) Number of rows that should be randomly sampled from the three datasets when the DCR score is to be computed on subsets instead of the full datasets.	`None`
`subset_seed`	`Optional[int]`	(optional) Seed for taking random samples from datasets.	`None`
`transform`	`Optional[Literal['sigmoid']]`	(optional) Transform the DCR score using a modified sigmoid function.	`None`

`correlation_score(dataset: rf.dataset.LocalDataset, syn: rf.dataset.LocalDataset, fields: list[str]) -> float`

Returns correlation score between the real and synthetic on the selected fields. It considers the difference from real and synthetic values for pearson correlations on the pairwise fields.

Score range between 0 and 1 (best = 1, worst = 0)

Parameters:

Name	Type	Description	Default
`dataset`	`LocalDataset`	Real dataset.	required
`syn`	`LocalDataset`	Synthetic dataset.	required
`fields`	`list[str]`	List of continuous numerical fields to compute the correlation score for.	required

`association_score(dataset: rf.dataset.LocalDataset, syn: rf.dataset.LocalDataset, fields: list[str], correction: bool = False) -> float`

Returns association score between the real and synthetic datasets on the selected fields. It considers the difference from real and synthetic values for the Cramer's V associations on the pairwise fields.

Score range between 0 and 1 (best = 1, worst = 0)

Parameters:

Name	Type	Description	Default
`dataset`	`LocalDataset`	Real dataset.	required
`syn`	`LocalDataset`	Synthetic dataset.	required
`fields`	`list[str]`	List of categorical fields to compute the association score for.	required
`correction`	`bool`	Apply bias correction to Cramer's V. Default is False.	`False`

`memorization_rate(dataset: LocalDataset, syn: LocalDataset) -> float`

Returns the memorization rate for synthetic data.

It evaluates on the tabular datasets and measure how much the synthetic dataset has memorized from the real dataset.

Rate range: 0 - 1 (0 means no memorization, 1 means full memorization)

Parameters:

Name	Type	Description	Default
`dataset`	`LocalDataset`	Real dataset.	required
`syn`	`LocalDataset`	Synthetic dataset.	required

`range_adherence_score(dataset: LocalDataset, syn: LocalDataset, fields: list[str])`

Returns the range adherence score for synthetic data.

It supports the evaluation of numerical fields as well as temporal fields and measures the average adherence level of synthetic fields to the range of the real fields.

Score range: 0 - 1 (worst = 0, best = 1) - 0 means no adherence - 1 means full adherence

Parameters:

Name	Type	Description	Default
`dataset`	`LocalDataset`	Real dataset.	required
`syn`	`LocalDataset`	Synthetic dataset.	required
`fields`	`list[str]`	List of numerical fields.	required

`txtr_score(real: float, syn: float, lower: float = 0.5) -> float`

Calculates the TXTR score.

Parameters:

Name	Type	Description	Default
`real`	`float`	The AUC value of the classifier trained on the real data.	required
`syn`	`float`	The AUC value of the classifier trained on the synthetic data.	required
`lower`	`float`	The lower reference value of the score.	`0.5`