What is synthetic data? I get asked this often in my line of work (data science). It is best to unpack the assumptions we make about data.
We humans and all other living beings are analog creatures who experience a continuous stream of sensory experience in the form of sight, sound, taste, touch, and smell. There are others no doubt. We will stick to the five basic senses.
Going back more than 40,000 years we have created tools to capture our analog (i.e continuous) experiences and preserve them for analysis. This is what we find in the caves of France. There are thousands of these cave paintings only recently discovered last century.
Fast forward to the Age of the Digital Computer. The analog world can now be transformed into a digital form of data (i.e. a discrete sequence of zeros and ones). Millions of encoding schemes (both public and proprietary) make this transformation of analog signals to a digital signals possible. For the digital computer to process this encoded data it eventually gets translated by the compiler into a low-level representation of bits (most commonly referred to as zeros and ones).
The data as zeros and ones are not human readable. We create high level languages to process the data and transform the zeros and ones back into recognizable images, text, and sound. The excitement in the data science community for a decade now has been the increasing speed with which we can now transform data to and from the original analog source to a digital format.
Our human sensory experiences can be replicated using analog (continuous) or digital (discrete) technologies.
The definition I prefer to use when talking about synthetic data is “to imitate”.
We have used analog technologies to imitate what happens in the real world – think old Hollywood movies and 16mm movie reels.
With the introduction of faster, cheaper, and higher quality encoding and decoding technologies, it is now more cost effective to imitate real life using synthetic data (discrete data generated by digital computers).
Technically speaking, all digital data is synthetic.
So why all the fuss about synthetic data?
Let’s continue to unpack the terminology.
We use technology to capture, store, and process our experiences of living in a modern society. Think of all the apps (applications) that you have on your mobile phone. The underlying data on a mobile phone is represented as images, sound, text, and video (images in sequences), but stored as zeros and ones.
Somewhere between you and the zeros and ones is a data model. What is a data model? Think of a model as a blueprint. The model/blueprint is used as a guide by the software developer to map analog signals to digital signals. The goal is to create a truthful representation of the original image, sound, or text.
The complexity of the pipeline of transformations and the complexity of the data model combined with business rules to make it difficult to fake images, sound, or text.
Advances in machine learning and the use of generative adversarial networks (GANs) now make it possible to generate fake but realistic looking images, sound, text, and video.
The image below is from the Wikipedia entry describing examples of deep fake technology.
We should be asking a different question?
How easy is it to generate synthetic data intended to create fake images, sound, text, and videos?
We will answer that question in the next post.