It is always easier to deal with one revolution at a time in technology – relational databases in the 80s, client/server networks in the 90s, and the Internet in the 2000s, for example. AI is a different kind of revolution since it touches all types of technology – hardware, software, networks, telecommunications, and mobile devices.
The language of AI is used in many different contexts. The following paragraphs help frame the concept of AI as it relates to the use of synthetic data. A common vocabulary goes a long way to getting executives, investors, customers, employees, and partners on the same page.
What do we mean by AI?
The concept of intelligence originated with the Greek philosophers and refers to the decision-making and problem-solving skills that make us human. Intelligence is made artificial when we apply machine learning algorithms to automate these traditionally human skills. Artificial intelligence is aimed at improving search technology, pattern-matching, and inferencing. These innovations are showing up in areas such as computer vision, natural language processing, and voice recognition. These technologies can be applied to hundreds of verticals in industry.
Why is training data important to AI?
We learn through our perceptions, senses, and experiences in the real world. Machines learn how to solve problems using algorithms and training data. AI applications require massive amounts of training data to improve on past performance. The better the training data, the quicker the machine learns. Training data enables the machine to do more with less, faster, and at greatly reduced costs.
What has been the source of training data in the past?
Training data over the last several decades has originated from production systems in the organization. This is customer data, often referred to as “real” data. Using customer data to train AI systems has unleashed a tsunami of risks – privacy issues, liability for data breaches, unexplainable algorithmic outcomes, public policy missteps, lapses in public safety oversight, novel legal challenges to product liability statutes, weak corporate governance, biased algorithms, and unintended consequences.
What is synthetic data?
Synthetic data is data that is generated by machines and is based on a model. These models are not constrained by existing and a-priori data models. Synthetic data is generated by machines in a way that provides a platform for simulating what-if scenarios of both past, present, and possible future events. These digital experiments are driving the innovations we see in AI today.
Why is synthetic data taking center stage now?
Generating synthetic data at scale with the complexity required is now possible thanks to the next generation of graphics processing units (GPU). The leader in the field is NVIDIA. GPU architectures are designed to speed up search, pattern recognition, and inferencing algorithms. Generating synthetic data to train these algorithms is a game-changer for companies seeking to leverage advances in AI.
Why is synthetic data a game-changer?
Real data is expensive. It requires programmers, database engineers, and data analysts to build massive data stores using production data. Real data requires an army of data-wranglers to develop extract, transformation, and data loading (ETL) programs. The extracted customer data must then be further post-processed in a way that masks any personal identifying information. Even this strategy has risks when people can be re-identified using other sources of data, structured and unstructured. Synthetic data greatly reduces these costs and risks.
Why is synthetic data lower cost and less risky?
Rather than using source data from production systems, synthetic data originates from a conceptual and logical model of the desired output. These models are processed by a class of algorithms referred to as permutation enumeration algorithms. These algorithms were first made popular by Roger Sedgewick in his 1977 ACM paper titled Permutation Generation Methods (https://dl.acm.org/citation.cfm?id=356692). Today’s GPU technology makes it possible to execute these algorithms in memory, in parallel, and at scale, approaching hundreds of thousands of nodes. The computer can generate hundreds of billions of permutations in minutes at almost zero cost and at little or no risk.