Generating synthetic training data is best done in a collaborative environment. There are many benefits to the organization that include protecting privacy, enabling exploratory development, focusing research attention on problems of national priority, ensuring reproducibility, reducing multiple hypothesis issues, avoiding the propagation of bias, and suppressing sensitive signals. (See a detailed discussion of these benefits online at https://arxiv.org/pdf/1905.01351.pdf.)
These benefits impact AI development teams in many ways. The framework below describes a six-step process for creating a charter to guide the use of synthetic data in the organization.
Step One – Frame the Opportunity
The first step should be to organize a leadership team and identify an executive sponsor. This can be accomplished in an initial meeting with key stakeholders. Use the following questions to frame the opportunity. What is the business problem we are addressing? Do we have any prior knowledge of current state solutions? Does a market scan of emerging research indicate that this is a project we can execute with the resources and time we have at our disposal? Can we win by failing and failing fast regardless of the outcome, thereby providing us the opportunity to learn new skills, knowledge, and capabilities? Can we draft a statement of anticipated outcomes and lessons learned? These questions should provide go/no go decision criteria for the next step.
Step Two – Cluster the Stakeholders Using Shared Interests
AI and synthetic data generation are a paradigm shift for any organization. The leadership team will likely find other stakeholders willing to take a seat at the table and participate in the scope of the rapid prototyping project. This exercise will help the team get buy-in from across the interested groups and be the foundation for an internally managed Special Interest Group. Agility and speed are the hallmarks of collaboration.
Step Three – Connect Data Stewards with AI Evangelists
It is highly unlikely that employees interested in AI rapid prototypes have not already experimented with Raspberry Pi computers, IoT devices, security cameras for their personal residence, and technologies such as Google Home and Amazon Alexa. Bringing these early adopters together with data stewards will start the conversation about how to think in a whole new way about synthetic data. Customer data is to be protected, but there is always enough flexibility in developing new AI models and mapping them to synthetic data engines that privacy can be protected.
Step Four – Communicate Roles, Rules, and Responsibilities
AI rapid prototypes can be high risk if viewed in the wrong frame of reference. Synthetic data technology is the opportunity to discover, test, fail, fail gain, iterate, and then win with new information about what is real and what is myth. Define the team roles, provide them with a limited set of rules of the road, and make sure success and failure are rewarded equally.
Step Five – Define Success
A little success can go a long way. Training data operates on a different timescale and volume than what teams may be used to in an enterprise setting. With today’s technology terabytes of synthetic data can be regenerated in hours not days or weeks. This makes it possible to collect lots of data in logs and system tools about data in all shapes and sizes. Small sample sizes and can be scaled quickly to reveal real opportunities for exploiting AI using both structured and unstructured data.
Step Six – Compare Outcomes with Future Experiments
The first project using synthetic data in a rapid prototyping environment will likely bring to light new information. From these insights will emerge new thinking about how to exploit synthetic data in new ways. Be prepared for this outcome. Synthetic data generation has the potential to accelerate AI innovations in the organization in ways not yet imagined.