With its potential to impact almost every corner of society, we all stand to gain from the widespread adoption of various AI technologies. Yet despite these game-changing benefits and the impact it can have on our everyday lives, we’re finding that the wider adoption and deployment of AI is somewhat constrained by data limitations.
Fuelled by huge advances in algorithmic innovation, today’s AI-models are incredibly data hungry. Organisations seeking to deploy AI effectively will need to have access to large volumes of relevant, clean, well-organized data that can be trusted.
The large tech firms – like Google, Apple, and Amazon – all have an almost limitless supply of diverse data streams, acquired through the products and services they sell. This creates the perfect ecosystem for data scientists to train their algorithms.
For small-medium sized organizations though – including public sector departments – acquiring data at scale is a much greater challenge. Their data is often proprietary; it’s restricted for use due to contractual agreements; they lack common data standards for sharing; and the data is time-consuming for people to manually prepare, making it expensive.
The end result is that data becomes a barrier to innovation and wider AI adoption.
So could synthetic data be the answer?
Aside from the big tech firms with their endless access to, and supply of data, the reality for most is that the cost of quality data acquisition is prohibitively high. This is acting as a barrier, preventing many from considering AI deployment. To tackle this challenge, organisations are increasingly looking towards synthetic data to address the data shortfall.
But what is synthetic data?
In its purest form, synthetic data is generated programmatically by mimicking real-world phenomena. Currently, synthetic data has started to make an impact in clinical and scientific trials to avoid privacy issues related to healthcare data. Likewise, within software development it can be used for agile development and DevOps in order to speed up testing of software, while improving quality assurance cycles.
While synthetic data generation has been around since the 1990s, renewed interest is now emerging. This is being driven by the massive advances in computing power, coupled with lower storage costs and the advent of new algorithms such as Generative Adversarial Networks (GANs).
The data generated can also be anonymised and created based on user-specified parameters, so that it’s as close as possible to the properties experienced from real-world scenarios. In this way, the main advantage of using synthetic data becomes scalability and flexibility.
In essence, this allows AI developers to generate as much data as they need to train algorithms and improve model performance and accuracy.
Using synthetic data in the real-world
Synthetically generated data can assist organisations and researchers to build reliable data repositories needed to train and even pre-train AI models. Similar to how a scientist might use synthetic material to complete experiments at low risk, organisations can now leverage synthetic data to minimise time and cost, as well as risk.
A real-world example is Google’s Waymo self-driving AI car which completes over three million miles of simulated driving every day. The use of synthetic data enables Waymo’s engineers to test any improvements within a safe, synthetic simulated environment before being tested in the real-world.
In addition to autonomous driving, the potential applications of synthetic data generation are many and varied. Examples include rare weather events, equipment malfunctions, vehicle accidents or rare disease symptoms.
In the modelling of rare situations, synthetic data maybe the only way to ensure that your AI system is trained for every possible eventuality.
Synthetic data is not always the perfect solution though
Despite its obvious advantages and benefits, we need to consider that synthetic data is still a replica of specific properties of a real data set. A model looks for trends to replicate, so some of the random behaviours might be potentially missed.
The right to privacy must also be respected, and individuals should have the ability to opt-out and control the usage of their data. Furthermore, using synthetic data can also lead to misunderstandings during the development phase about how AI models will perform with intended data in the real world.
Therefore, synthetic data is not always the perfect solution.
Although significant progress is being made, one challenge that persists is guaranteeing the accuracy of synthetic data. We must ensure that the statistical properties of synthetic data are matched accurately with the properties of the original dataset. This very much remains an active research topic.
Find out more…
At Fujitsu, we are acutely aware that the use of any form of synthetic data for AI transformation activities will depend on the sensitive nature of project requirements.
We are closely engaged with industry, academia and regulators as they continue to investigate and develop good practise measures and guidelines to ensure correct use of synthetic data in AI solutions across a wide range of industry applications.
To explore the area of synthetic data in more detail, read our latest White Paper titled ‘Is Synthetic Data the Enabler for Wider AI Adoption?
Latest posts by Darminder Ghataoura (see all)
- Tackling the AI data challenge – could synthetic data be the answer? - July 7, 2020
- The rise of AI and the risks of relinquishing responsibility - January 21, 2020