Basic Generation
Synthetic generation refers to the process of creating artificial data that mimics real-world datasets while preserving privacy and meeting specific requirements. This data generation can be useful for applications such as training machine learning models, testing, privacy-preserving analytics, and more
Synthetic conditional generation involves generating data based on specified conditions or constraints, enabling the creation of realistic datasets that follow particular patterns, distributions, or dependencies. This approach allows more control over the synthetic data, making it especially useful for testing, scenario simulations, and analytics while ensuring that privacy requirements are met.
Generate Module
Rockfish's Generate Module provides users the flexiblity to use any trained model to generate high-quality synthetic data for specific use cases.
With the Generate Module, you can:
- Use the trained model to generate synthetic data based on your specified configurations.
- Customize the generation process to meet specific requirements, such as data volume or target features.
Generation Process
1. Fetch the trained model
After training is complete, the model
can be fetched using
model = await workflow.models().last()
The generate
action is created specific to the trained Rockfish model with its
generation configuration. For details, please check out Model Generation Configuration.
3. Create a SessionTarget action
The session target action is to assign a target generation value to the synthetic output.
target = ra.SessionTarget(target = <target generation value>)
Default Generation
For default generation, you do not need to specify a target
value:
target = ra.SessionTarget()
- For time series models, it generates the same number of sessions as in the training data.
- For tabular models, it generates the same number of records as in the training data.
4. Create a Save action
The save
action is used to store the generated dataset.
# please give the synthetic data name
save = ra.DatasetSave(name= "<synthetic data name>")
5. Build generation workflow
You can build the generation workflow to start the generation job with as follows
builder = rf.WorkflowBuilder()
builder.add_model(model)
builder.add_action(generate, parents=[model, target])
builder.add_action(target, parents=[generate])
builder.add_action(save, parents=[generate])
workflow = await builder.start(conn)
print(f"Workflow: {workflow.id()}")
6. Generate synthetic data
syn = None
async for sds in workflow.datasets():
syn = await sds.to_local(conn)
syn.to_pandas()
Use Cases for generation
Now that you know how to generate data using the Generate Module, lets explore the different ways to generate data specific to your use case
Lets take a look at few example use cases.
Use Case 1: Regulatory Compliance and Data Masking
Scenario | Problem | Solution |
---|---|---|
A healthcare provider needs to share patient data with a third-party analytics firm for research purposes. | Sensitive patient data cannot be shared. | Since the actual patient data is sensitive, you can generate synthetic health records to share safely. |
Solution: Generate Specific amount of data
To use this solution, please follow the steps desribed in the General Generation Process.
Use Case 2: Stress Testing Systems or Applications
Scenario | Problem | Solution |
---|---|---|
A telecom company is launching a new billing system and wants to ensure it can handle millions of transactions per minute. | Not having enough data to conduct the test efficienlty | By generating a massive amount of synthetic transaction data, they can stress test the system under load. |
Solution: Continuous Generation
To use this solution, assign a large target generation value to the session target
action for large-scale generation, as described in Step 3
above in the General Generation Process.
Use Case 3: Tracking Customer Transaction Behavior Based on Session Metadata
Scenario | Problem | Solution |
---|---|---|
A platform aims to analyze transaction behavior for various customer groups to promote relevant categories effectively. | The platform possesses historical session data for customer transactions, but specific combinations of metadata (e.g., age and gender) are scarce or nonexistent, making it challenging or impossible to determine these groups' transaction behavior. | By defining "given_metadata" in the generation config, the model generates customer data with specific demographic characteristics based on patterns learned from the training data, thereby providing the platform with sufficient data for analysis. |
Solution: Generation with Conditions on Session Metadata
To use this solution, update generate configuration, described in Step 2
of the General Generation Process
Note: This feature is only supported with the RF-Time-GAN model.
Use Case 4: Simulating Rare Events
Scenario | Problem | Solution |
---|---|---|
An e-commerce platform wants to improve its fraud detection system | But fraudulent transactions make up only a tiny fraction of their overall transactions. | Generate a large number of synthetic fraudulent transactionss to address the imbalance between normal and fraudulent transactions, potentially improving the fraud detection system's performance. |
Solution: Generate specific amount of data with conditions
To use this solution, update the Step 4
above
in the General Generation Process.
Note: Applicable to all 4 models. If you have multiple conditions with different desired amount, follow the below steps and then concatenate all the results together.
For example, users may want to generate fraud events with 1000 records or sessions.
- Set Conditions: You can define conditions either with the
PostAmplify
action or with theSQL
action:Alternatively, use can also use thecondition_filter = ra.PostAmplify({ "query_ast": { "eq": {"fraud": 1} }, })
SQL
actioncondition_filter = ra.SQL(query = "SELECT * FROM my_table WHERE fraud=1")
-
Set Target Value: it controls the number of generated conditional records (for tabular data) or sessions (for time-series data):
target = ra.SessionTarget(target=100)
-
Build the Workflow:
builder = rf.WorkflowBuilder() builder.add(model) builder.add_action(generate, parents = [model, target]) builder.add_action(condition_filter, parents=[generate]) builder.add_action(target, parents=[condition_filter]) builder.add_action(save, parents=[condition_filter]) workflow = await builder.start(conn) print(f"Workflow: {workflow.id()}")
Use Case 5: Ensuring equal representation to meet specific synthetic data requirements.
Scenario | Problem | Solution |
---|---|---|
A snack manufacturer wants equal representation of flavors in their snack packages, but when generating synthetic stock data for this process, the distribution of flavors can vary, with some flavors being over- or under-represented | When synthesizing stock data for the manufacturer, the model learns most of the distribution but cannot guarantee an exact equal distribution. Some flavors may be generated more frequently than others, leading to an imbalanced dataset. Hence, the stock data may not meet the marketing requirement for equal representation of each flavor. | With Equal Data distribution, the synthetic data generation model can be modified to enforce equal representation of values in the "flavors" field. This constraint ensures that all flavors are equally distributed, satisfying the manufacturer's marketing requirement for stock data and promotional standards. This guarantees the generated synthetic dataset mirrors the ideal flavor distribution for the snack packages. |
Solution: Equal Data Distribution
To use this solution, update the Step 5
above
in the General Generation Process.
replacement = ra.Replace(
field="flavors",
condition=ra.EqualizeCondition(equalization=True)
)
builder = rf.WorkflowBuilder()
builder.add_model(model)
builder.add_action(generate, parents=[model])
builder.add_action(replacement, parents=[generate])
builder.add_action(save, parents=[replacement])
workflow = await builder.start(conn)
print(f"Workflow: {workflow.id()}")
Use Case 6: Validating product combinations in retail inventory data generation.
Scenario | Problem | Solution |
---|---|---|
In retail, certain products can only be sold together under specific bundles or promotions. For example, a "Back-to-School" promotion bundles notebooks with pens, but never with unrelated items like kitchen utensils. The inventory system must ensure that products adhere to these combination rules when generating synthetic stock or sales data. | When generating synthetic retail data, the model might create product combinations that are not allowed (e.g., notebooks paired with kitchen utensils) because it doesn't understand the bundling constraints. This leads to invalid datasets that don’t align with business rules. | By applying Inclusion & Exclusion Constraints, you can ensure that the generated data respects the valid product combinations. Inclusion constraints enforce that certain products (e.g., notebooks) are only paired with allowed items (e.g., pens), while exclusion constraints prevent disallowed combinations (e.g., notebooks and kitchen utensils). This helps maintain the integrity of the synthetic data while preserving other important data characteristics. |
Solution: Inclusion & Exclusion Constraint on Generation
To use this solution, update the Step 5
above
in the General Generation Process.
For example, the front doors in black are prefixed, the rear doors must also be black.
replacement=ra.Replace(
field="color",
condition=ra.SQLCondition(query="select door=rear and color!=black as mask from my_table" ),
resample=ra.ValuesResample(replace_values=["black"]
)
builder = rf.WorkflowBuilder()
builder.add_model(model)
builder.add_action(generate, parents=[model])
builder.add_action(replacement, parents=[generate])
builder.add_action(save, parents=[replacement])
workflow = await builder.start(conn)
print(f"Workflow: {workflow.id()}")