By Joseph Moss, International Banker
We are now comfortably in the era of data, with more and more of it being generated with every passing year. And looking forward, it seems this rapidly expanding universe will only balloon further—and at an astonishing rate. But does the proliferation of this data necessarily mean that there is more of it for us to use—and use in a meaningful way?
“In 2020, 64.2ZB [zettabytes, which are each approximately one billion terabytes] of data was created or replicated, defying the systemic downward pressure asserted by the COVID-19 pandemic on many industries, and its impact will be felt for several years,” Dave Reinsel, senior vice president of the International Data Corporation (IDC), stated in March. “The amount of digital data created over the next five years will be greater than twice the amount of data created since the advent of digital storage.”
But truth be told, a substantial proportion of potentially insightful data remains off-limits to most of us due to various circumstances, including privacy regulations preventing companies from getting their hands on consumer data, security concerns preventing the free flow of data within an organisation or, more recently, pandemic restrictions closing off access to data repositories. And that’s where synthetic data can be a lifesaver for those relying on data analytics to deliver the competitive edge within an intensely competitive marketplace.
Taking the insurance industry as an example, a number of roadblocks exist when attempting to analyse consumer data to build more robust insurance models and identify new revenue streams. Tobi Hann of Austrian synthetic-data firm MOSTLY AI recently identified some of the most problematic consumer dataset issues faced by the industry:
- Difficulty in attaining data: Consumers are not always willing to disclose personal information due to privacy issues, while publicly available datasets are often expensive to acquire and not always as insightful;
- Data must be anonymised: This is done to prevent litigation, although it must remain possible to tie information back to a specific individual, which is time-consuming, difficult and costly. The data-anonymisation process can also strip data of much of its utility and value;
- Data must be handled legally and be protected: Sharing consumer information across departments and borders, or with third-party databases, is legally forbidden, or firms have to pay costly fines;
- There is often insufficient data on real people: Machine learning requires large volumes of clean data, but existing data is often prone to bias, such as being skewed towards male policyholders. This lowers the data’s value in the real world.
Given such challenges with real-world data, the suitable and often superior substitute of synthetic data has been garnering much interest. Through the use of artificial intelligence (AI) and machine-learning (ML) algorithms, synthetic data aims to capture the complexities of real-world datasets in terms of the ways they are distributed, the types of relationships they reveal and the noise they generate, but they do not actually comprise any real data. And with no real data used, therefore, synthetic data cannot be linked back to a specific person. It is considered non-identifiable data, allowing it to bypass many of the legal requirements that constrain real consumer datasets and can thus be shared and analysed more liberally.
Synthetic data also produces entirely new data points, despite preserving the original data’s statistical features, which thus offers an alternative way to produce high-quality data that can be analysed, shared, expanded, worked on and used for generating specific scenarios. And given that machine-learning algorithms invariably require substantial volumes of data upon which to train, learn and improve, synthetic data provides a convenient and cost-effective solution for those looking to provide optimal training conditions, especially when the available real-world data is limited, incomplete or cannot be obtained easily. “Synthetic data can be used to train data-hungry algorithms in small-data environments, or for data sets with severe imbalances. It can also be used to train and validate machine learning models under adversarial scenarios,” noted the Alan Turing Institute, the United Kingdom’s national institute for data science and AI.
As such, the decision-making process needn’t be hampered due to the lack of available data, nor because the data itself is insufficient in providing a suitable model for algorithms. As long as the synthetic data is as complex as the underlying data it aims to replicate, it can be utilised across several use cases. Indeed, synthetic text, synthetic media (such as video, image or sound) and synthetic tabular data are just some of the most common forms of synthetic data.
Synthetic data can also be employed in a privacy-safe environment, meaning that users and developers can access it without any concerns over disclosing sensitive information. A 2018 Deloitte survey, for instance, found that “data issues” such as privacy, accessing and integrating were considered the biggest challenges in implementing AI initiatives. By generating non-identifiable datasets, however, synthetic-data generation can be a vital privacy-enhancing technology that does not carry the regulatory or legal burdens associated with disclosing personal data.
Among the clearest industry use cases for synthetic data is financial services. Such data is already being leveraged to improve operations, with synthetic datasets generated from debit- and credit-card payment-transaction data, and identify fraudulent activity. The UK financial regulator, the Financial Conduct Authority (FCA), partnered with data specialist Synthesized to create synthetic payment data from five million records of real payment data to build a better fraud model without revealing individuals’ data. “The objective of the collaboration was to solve the challenge of building a better synthesised transactional bank fraud model,” Synthesized noted in November 2020. “Synthesized’s cutting-edge AI data synthetisation technology was uniquely able to transform the original fraud data set. The result is a collaborative safe-to-share synthetic data set for use by participants in the Digital Sandbox Pilot, jointly launched by the FCA and City of London Corporation.”
Healthcare is another industry that will benefit greatly from the use of synthetic data; for instance, it can be used to test on various medical cases for which there is currently insufficient data. It can also substitute for actual medical-records data, allowing health-data professionals to grant access to record data without compromising patient confidentiality. “The key advantage that synthetic data offers for healthcare is a large reduction in privacy risks that have bugged numerous projects [and] to open up healthcare data for the research and development of new technologies,” explained Allan Tucker, a professor at Brunel University London and author of a study published in Nature in November showing the validity of using synthetic data as a substitute for real healthcare data.
And MOSTLY AI’s Tobi Hann has also highlighted the essential need for synthetic data to be used across the insurance industry on the basis that software testing and development can be hastened, with synthetic data enabling “access to rich repositories they can use to accelerate and optimize the coding, testing and refinement of new apps” as well as synthesising, sharing and testing data with ease. What’s more, a synthetic dataset can be deployed quickly for a machine-learning application, even if it’s from a small sample. “For example, for the pricing of a life insurance product, securing access to clients’ data requires an NDA [non-disclosure agreement] and often takes up to six months. Then the data cannot be shared or retained internally. A comparable synthetic dataset takes less than 24 hours to create and is ready to use,” Hann explained.
According to Gartner, 60 percent of the data used to develop AI and analytics projects will be synthetically generated by 2024. “Synthetic data has a bright future if you think about it. The new set of normals (there may not be a single new normal for some time) organizations will experience going forward, including growth, risk, opportunity and stress, all [at] the same time, triggered a need to re-think how executives and everyone else takes decisions.” And given the limitations and legal barriers that prevent real-world data from being utilised to its fullest extent, the opportunities for synthetic data to step in and optimise decision-making processes will be crucial in the years to come.