Using Databricks for generating Synthetic Data with Rockfish
With the Rockfish Synthetic Data Platform, you can seamlessly integrate with Databricks to use it as a data source or destination for synthetic data.
This guide covers: * How to configure Rockfish to connect with Databricks * Uploading Source Data * Training models and generating synthetic datasets using Rockfish * Storing generated synthetic data in Databricks Unity Catalog
Prerequisites
To integrate Rockfish with Databricks, ensure you have:
- Databricks Account: To access Databricks workspace.
- Rockfish Account: Rockfish API keys and API URL
- Python Libraries: Install libraries such as pandas and rockfish.
Step 1: Source Data
In order to generate synthetic data, you will need sample source data to train a Gen AI Model.
Options for source data: * Either upload data to Databricks workspace or * Use existing data from the databricks unity catalog
Step 2: Onboard,Train and Generate Synthetic Data using Rockfish
Install Rockfish SDK:
%pip install -U 'rockfish[labs]' -f 'https://packages.rockfish.ai'
Onboard Data:
- Load the source dataset into a Pandas DataFrame
From a CSV file
import pandas as pd
df = pd.read_csv("/dbfs/mnt/data/source_data.csv")
# Define the fully qualified table name, including catalog and schema
table_name = "`rockfish-customers-demo`.`databricks-demo`.`netflow_ton_source`"
# Read the table into a DataFrame
df = spark.read.table(table_name)
# Convert the Spark DataFrame to a Pandas DataFrame
pandas_df = df.toPandas()
- Save the source dataset using the appropriate tag (e.g.,
).
Train Model:
Select an appropriate Rockfish model and epoch for training
tab_gan_train_config = {
"encoder": data_config,
"tabular-gan": {
"epochs": 100,
}
}
Train the model
workflow = await builder.start(conn)
print(f"Workflow: {workflow.id()}")
Generate Synthetic Data:
Use the trained model to generate synthetic records.
tab_gan_generate_config = {
"tabular-gan": {
"records": 2582
}
}
generate = ra.GenerateTabGAN(tab_gan_generate_config)
Assess Synthetic Data (SDA):
Use the Rockfish SDK to generate quality metrics to evaluate the generated synthetic data to the source data.
for col in ["Column1", "Column2"]:
source_agg = rf.metrics.count_all(dataset, col, nlargest=10)
syn_agg = rf.metrics.count_all(syn, col, nlargest=10)
rl.vis.plot_bar([source_agg, syn_agg], col, f"{col}_count")
Step 3: Save the generated synthetic data
You can now store the generated synthetic data in databricks unity catalog managed table.
NOTE: Use the sample Rockfish Notebook to get started.