The history of synthetic data extends back to John von Neumann and his work during 1946 on pseudo-random number generators. (Learn more here about his background – https://en.wikipedia.org/wiki/John_von_Neumann). We also know that “synthetic data” is the latest buzzword used to refer to data that has gone by a variety of other names – such as test data, toy data, sample data, dummy data, mock data, and simulated data.
To better understand why a distinction is made between “synthetic” data and “real” data, let’s visit the role of a data model in the work flow of any type of data that is created by digital computers.
All computer systems that process data have an underlying data model (conceptual, logical, and/or physical). This model determines how data flows through the system. If the system is an application that has been deployed in a production environment then this data is often referred to as REAL data. The data that exists in this system captures the behavior of external actors (humans, other systems, sensors, collectors, analog switches, digital switches, etc.) and stores this data for later analysis and reporting.
The key takeaway here is that REAL is never guaranteed to represent all possible permutations and combinations of possible values. REAL data is a snapshot of how the system operated in the past, even if the past is only a few seconds prior.
SYNTHETIC data is data that is also defined in a data model. The data model in combination with a library of algorithms is designed to create data generator seemingly “out of thin air”. When the data generator is designed correctly, the algorithms can theoretically generate all possible permutations and combinations of values. These values represent all possible future events based on interactions with external actors. The challenge for any team of developers is to first create a shared mental model. This SMM is an agreed upon set of constraints that in turn, get implemented in a way that balances the cost of generating data with the cost of generating all probabilities that would represent unacceptable risks to the organization.
The Synthetic IO technology was designed to make it easy for the development team to design, model, and generate data under these constraints at scale and at the lowest cost possible.
The development team makes the call – the same Synthetic IO code compiles and runs on a Raspberry Pi, a NVIDIA GPU, an HPC cluster, or an Amazon AWS Cloud. The Synthetic IO customer gets to make the determination on the trade-offs that best suit the organization and their requirements.