rockfish.labs.metrics
metrics
Classes
Functions
pearsonr(dataset: LocalDataset, x: str, y: str) -> LocalDataset
Returns a new table containing two fields: statistic and pvalue. The statistic field will have one value with the Pearson product-moment correlation coefficient and the pvalue field will have the p-value between the fields.
See :func:scipy.stats.pearsonr
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset
|
LocalDataset
|
Input dataset. |
required |
x
|
str
|
Field containing x data. |
required |
y
|
str
|
Field containing y data. |
required |
cramer_v(dataset: LocalDataset, x: str, y: str, correction: bool = True) -> float
Compute the Cramer's V for two categorical columns. Best at 1, worst at 0.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset
|
LocalDataset
|
Input dataset. |
required |
x
|
str
|
Field containing x data. |
required |
y
|
str
|
Field containing y data. |
required |
correction
|
bool
|
Apply bias correction to Cramer's V. Default is True. |
True
|
ks_distance(ds1: LocalDataset, ds2: LocalDataset, field: str) -> float
Computes the Kolmogorov-Smirnov distance between two numerical fields.
See :func:scipy.stats.ks_2samp
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ds1
|
LocalDataset
|
Input dataset 1. |
required |
ds2
|
LocalDataset
|
Input dataset 2. |
required |
field
|
str
|
Field to compute the distance for. |
required |
tv_distance(ds1: LocalDataset, ds2: LocalDataset, field: str) -> float
Computes the total variation distance between two categorical fields.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ds1
|
LocalDataset
|
Input dataset 1. |
required |
ds2
|
LocalDataset
|
Input dataset 2. |
required |
field
|
str
|
Field to compute the distance for. |
required |
category_coverage(ds1: LocalDataset, ds2: LocalDataset, field: str) -> float
Compare the category coverage of two categorical columns. Returns: score value Score range: 0-1 (best = 1, worst = 0)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ds1
|
LocalDataset
|
Input dataset 1. |
required |
ds2
|
LocalDataset
|
Input dataset 2. |
required |
field
|
str
|
Field to compute. |
required |
range_coverage(ds1: LocalDataset, ds2: LocalDataset, field: str) -> float
Compare the range coverage of two numerical columns. Returns: score value Score range: 0-1 (best = 1, worst = 0)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ds1
|
LocalDataset
|
Input dataset 1. |
required |
ds2
|
LocalDataset
|
Input dataset 2. |
required |
field
|
str
|
Field to compute. |
required |
jsd(ds1: LocalDataset, ds2: LocalDataset, fields: list[str]) -> float
Compute the Jensen-Shannon distance (metric) between two categorical distributions. This is the square root of the Jensen-Shannon divergence. Returns: score value Score range: 0 - 1 (best = 0, worst = 1)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ds1
|
LocalDataset
|
Input dataset 1. |
required |
ds2
|
LocalDataset
|
Input dataset 2. |
required |
fields
|
list[str]
|
A list of field(s). (can be a list 1 field or multiple fields) |
required |
emd(ds1: LocalDataset, ds2: LocalDataset, fields: list[str]) -> float
Compute the Wasserstein distance(aka Earth Mover's distance) between two numerical distributions. Returns: score value Score range: 0 - positive infinity (best = 0, worst = positive infinity)
For 1D arrays, use scipy.stats.wasserstein_distance(). For higher dimensional (>=2) arrays, use GeomLoss library for approximation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ds1
|
LocalDataset
|
Input dataset 1. |
required |
ds2
|
LocalDataset
|
Input dataset 2. |
required |
fields
|
list[str]
|
A list of field(s). (can be a list 1 field or multiple fields) |
required |
marginal_dist_score(dataset: LocalDataset, syn: LocalDataset, metadata: list[str] = [], other_categorical: list[str] = [], weights: dict[str, float] = {}) -> float
Return the weighted fidelity score of the synthetic dataset compared to the real dataset. Range: 0-1 (best = 1, worst = 0)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset
|
LocalDataset
|
Real dataset |
required |
syn
|
LocalDataset
|
Synthetic dataset |
required |
metadata
|
list[str]
|
[Opitional]List of metadata fields for timeseries dataset |
[]
|
other_categorical
|
list[str]
|
[Opitional]List of other categorical fields |
[]
|
weights
|
dict[str, float]
|
[Opitional]Define weights for fields. If not provided, by default all weights are 1. |
{}
|
distance_to_closest_record_score(train_dataset: rf.dataset.LocalDataset, test_dataset: rf.dataset.LocalDataset, syn: rf.dataset.LocalDataset, subset_length: Optional[int] = None, subset_seed: Optional[int] = None, transform: Optional[Literal['sigmoid']] = None) -> float
Returns a Distance to Closest Record (DCR) score.
A DCR score measures the similarity between the DCR distributions b/w the two dataset pairs - (train, synthetic) and (train, test). The more similar DCR distributions are, the more "private" the synthetic data.
A DCR distribution is obtained as follows: For each record in the synthetic (resp, test) dataset, store the Gower distance to the closest record in the train dataset.
The DCR score is the ratio between the medians of the two DCR distributions.
Range for score: 0 - positive infinity (worst = 0, best >= 1)
Interpreting the score using a Likert scale for quality (also helpful for visualization): 1. Low: [0 - 0.75) 2. Medium: [0.75 - 1.0) 3. High: [1.0, positive infinity) If the modified sigmoid function is used, the above ranges are transformed to: 1. Low: [0 - 0.36) 2. Medium: [0.36 - 0.46) 3. High: [0.46, 1.0)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
train_dataset
|
LocalDataset
|
Real dataset used to train the synthetic data generator. |
required |
test_dataset
|
LocalDataset
|
Dataset from the same distribution as the train_dataset. |
required |
syn
|
LocalDataset
|
Synthetic dataset. |
required |
subset_length
|
Optional[int]
|
(optional) Number of rows that should be randomly sampled from the three datasets when the DCR score is to be computed on subsets instead of the full datasets. |
None
|
subset_seed
|
Optional[int]
|
(optional) Seed for taking random samples from datasets. |
None
|
transform
|
Optional[Literal['sigmoid']]
|
(optional) Transform the DCR score using a modified sigmoid function. |
None
|
correlation_score(dataset: rf.dataset.LocalDataset, syn: rf.dataset.LocalDataset, fields: list[str]) -> float
Returns correlation score between the real and synthetic on the selected fields. It considers the difference from real and synthetic values for pearson correlations on the pairwise fields.
Score range between 0 and 1 (best = 1, worst = 0)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset
|
LocalDataset
|
Real dataset. |
required |
syn
|
LocalDataset
|
Synthetic dataset. |
required |
fields
|
list[str]
|
List of continuous numerical fields to compute the correlation score for. |
required |
association_score(dataset: rf.dataset.LocalDataset, syn: rf.dataset.LocalDataset, fields: list[str], correction: bool = False) -> float
Returns association score between the real and synthetic datasets on the selected fields. It considers the difference from real and synthetic values for the Cramer's V associations on the pairwise fields.
Score range between 0 and 1 (best = 1, worst = 0)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset
|
LocalDataset
|
Real dataset. |
required |
syn
|
LocalDataset
|
Synthetic dataset. |
required |
fields
|
list[str]
|
List of categorical fields to compute the association score for. |
required |
correction
|
bool
|
Apply bias correction to Cramer's V. Default is False. |
False
|
memorization_rate(dataset: LocalDataset, syn: LocalDataset) -> float
Returns the memorization rate for synthetic data.
It evaluates on the tabular datasets and measure how much the synthetic dataset has memorized from the real dataset.
Rate range: 0 - 1 (0 means no memorization, 1 means full memorization)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset
|
LocalDataset
|
Real dataset. |
required |
syn
|
LocalDataset
|
Synthetic dataset. |
required |
range_adherence_score(dataset: LocalDataset, syn: LocalDataset, fields: list[str])
Returns the range adherence score for synthetic data.
It supports the evaluation of numerical fields as well as temporal fields and measures the average adherence level of synthetic fields to the range of the real fields.
Score range: 0 - 1 (worst = 0, best = 1) - 0 means no adherence - 1 means full adherence
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset
|
LocalDataset
|
Real dataset. |
required |
syn
|
LocalDataset
|
Synthetic dataset. |
required |
fields
|
list[str]
|
List of numerical fields. |
required |