Synthetic Data Generator
Generate high-quality synthetic data without making external API calls to third-party services. Our Synthetic Data Generator native application is available today on the Snowflake Marketplace and runs entirely within the safety of your Snowflake account.
​
Want to see it in action? Watch our new Demo Video!
​
Our Synthetic Data
Our native app makes it easy to create accurate, high-quality synthetic data in as little as a few minutes! Use your new synthetic data for:
-
Machine learning - our high-quality synthetic data retains more of the correlations and trends of the original to support better models!
​
-
Avoiding use of sensitive data - anonymize all or part of your data to maximize safety.
​
-
Decreasing bias in unbalanced datasets by increasing representation for minority cohorts
​
-
Minimizing risk of data proliferation when sharing data externally
Iteration and Trust are Key
Organization
With a workflow designed around ease of use and iteration, you'll be producing highly-accurate synthetic data in as little as a few minutes with certainty that it will support your use case.
​
How do you know you can trust your new synthetic data? The key to getting it right is inspection and comparison to the original data. Do all of your synthetic fields look like their real counterparts? How about the value distributions? Did all of the expected correlations come through? Answer all of those questions with our Data Explorer View (see below).
Once you start using synthetic data, you may end up creating a LOT of it. The app intuitively organizes each dataset into its own project so you can easily pick up where you left off, adjust your models, retrain and re-generate.
Each project can contain one or multiple tables within a schema that you want to profile to generate similar synthetic data.
Data and Learning Models
The first step in any synthetic data project is to select your data (preview it if necessary), and synthetic data model. We have several common models available and will be adding more!
-
GaussianCopula - Purely statistical. Great for starting off or when accuracy isn't as important. Quick and performant.
-
CopulaGAN - Combines statistical methods and deep-learning to model and generate synthetic data
-
TVAE - Uses a variational auto-encoder with neural-net-based learning. High accuracy and good for some datasets
-
CTGAN - Our highest accuracy neural-net-based deep learning model. Offers flexible parameters, but training may take longer for large and complex datasets.
Column Configuration
The app gives you a few simple options for the outputs of each field or column. Notably, you can adjust the primary key for the table, output datatype and whether to anonymize the field values.
Column Anonymization: When choosing to anonymize a field, you are telling the model not to use the real field values for training. You might want to do this for your most sensitive columns, such as names and addresses. You then have the option to select alternative artificial values to be used instead. Choose from several options, including address, geography, name or even library ISBN. More types can be made available upon request.
Learning Parameters
Each learning model has its own set of parameters that can be adjusted for better accuracy or performance. The defaults are usually good enough for your first run-through, but you may want to adjust these if your results aren't quite right. Parameters include things like number of training epochs for our neural-net-based models, or options on how to treat decimal precision.
Data Generation
Once all of your parameters are set for your chosen model and original data, it's time to hit the 'Generate Data' button! Your new synthetic data will be written to a new schema which you can keep to yourself or share with your team. We hope you share!
Explore Your New Synthetic Data
While becoming a master of synthetic data generation, you may wish to iterate over the process a few times and change some of the learning parameters and column options. We've included a quick data explorer that allows you to see, filter, and explore your new data without ever leaving the application.
Compare Your Synthetic and Real Data
Even if your data looks right, it's important to validate how close it is to the original data. As a part of the data explorer, we've included a histogram comparison of your real data against the original data. The lines show the distributions; green for real, blue for synthetic, with variance % shown as bars. Some variation is good; it actually acts as an added security measure. If you see too large a variance, you likely need to adjust your parameters.