Skip to content

Using Databricks for generating Synthetic Data with Rockfish

With the Rockfish Synthetic Data Platform, you can seamlessly integrate with Databricks to use it as a data source or destination for synthetic data.

This guide covers: * How to configure Rockfish to connect with Databricks * Uploading Source Data * Training models and generating synthetic datasets using Rockfish * Storing generated synthetic data in Databricks Unity Catalog

Prerequisites

To integrate Rockfish with Databricks, ensure you have: - Databricks Account: To access Databricks workspace. - Rockfish Account: Rockfish API keys and API URL - Python Libraries: Install libraries such as pandas and rockfish.

Step 1: Source Data

In order to generate synthetic data, you will need sample source data to train a Gen AI Model.

Options for source data: * Either upload data to Databricks workspace or * Use existing data from the databricks unity catalog

Screenshot 2025-02-27 at 8 27 26 AM

Step 2: Onboard,Train and Generate Synthetic Data using Rockfish

Install Rockfish SDK:

%pip install -U 'rockfish[labs]' -f 'https://packages.rockfish.ai'

Onboard Data:

  1. Load the source dataset into a Pandas DataFrame

From a CSV file

import pandas as pd
df = pd.read_csv("/dbfs/mnt/data/source_data.csv")
From Unity Catalog
# Define the fully qualified table name, including catalog and schema
table_name = "`rockfish-customers-demo`.`databricks-demo`.`netflow_ton_source`"
# Read the table into a DataFrame
df = spark.read.table(table_name)
# Convert the Spark DataFrame to a Pandas DataFrame
pandas_df = df.toPandas()
2. Perform data processing, encoding, hyper parameter tuning as needed.

  1. Save the source dataset using the appropriate tag (e.g., ).

Train Model:

Select an appropriate Rockfish model and epoch for training

tab_gan_train_config = {
    "encoder": data_config,
    "tabular-gan": {
        "epochs": 100,
    }
}

Train the model

workflow = await builder.start(conn)
print(f"Workflow: {workflow.id()}")

Generate Synthetic Data:

Use the trained model to generate synthetic records.

tab_gan_generate_config = {
    "tabular-gan": {
        "records": 2582
    }
}
generate = ra.GenerateTabGAN(tab_gan_generate_config)

Assess Synthetic Data (SDA):

Use the Rockfish SDK to generate quality metrics to evaluate the generated synthetic data to the source data.

for col in ["Column1", "Column2"]:
    source_agg = rf.metrics.count_all(dataset, col, nlargest=10)
    syn_agg = rf.metrics.count_all(syn, col, nlargest=10)
    rl.vis.plot_bar([source_agg, syn_agg], col, f"{col}_count")

Step 3: Save the generated synthetic data

You can now store the generated synthetic data in databricks unity catalog managed table.

Screenshot 2025-02-27 at 8 39 46 AM (1)

NOTE: Use the sample Rockfish Notebook to get started.