For two decades, data has been called the “new oil”—a critical asset driving business operations, but one that also carries significant risks, especially when large amounts of personal information could be handled without care. As regulators tighten privacy protections by limiting what data can be collected and how it can be used, insurers face growing challenges in accessing and utilizing data responsibly. Synthetic data promises a way to navigate these restrictions while protecting policyholder privacy. Ensuring its accuracy, reliability, and regulatory compliance remains a complex challenge.
In this Quick Read, Markus Senn, PartnerRe’s Global Head of Strategic Data Science for Life & Health, explores how synthetic data can help address data availability and privacy challenges in the insurance industry.
Generative AI first gained widespread attention around 2019 when it demonstrated the ability to produce highly realistic images—sparking concerns about deepfakes. Websites like ThisPersonDoesNotExist.com showcased how advanced algorithms could create photo-realistic portraits of people who never existed, using models trained on real images.
Once trained, the algorithm can generate new, “synthetic” data points that are indistinguishable from the real data.”
This evolution paved the way for synthetic data, a natural application of AI’s ability to generate realistic and useful content. Flexible algorithms are trained on an original dataset, learning its underlying patterns and relationships. Once trained, the algorithm can generate new, “synthetic” data points that are indistinguishable from the real data.
This approach offers immense potential across industries, enabling companies to create vast, privacy-safe datasets. Major AI players have already made significant investments in the technology—for instance, Nvidia’s acquisition of Gretel.[1] Gartner estimates that by 2030, synthetic data will completely overshadow real data in AI models,[2] and tech companies and research firms tout many potential use cases across industries.
Synthetic Data Use Cases
| Industry | Use Case | |
| Financial Services | Synthetic data is used to safely evaluate fraud detection methods and analyze customer behavior without exposing sensitive financial information. | |
| Healthcare | It supports analytics and clinical trials by enabling data sharing while maintaining patient confidentiality or when real data is unavailable. | |
| Security | Synthetic video data helps train surveillance models at a lower cost and with greater flexibility compared to real-world data collection. |
Figure 1: Synthetic use cases in different industries
PartnerRe is primarily interested in synthesizing tabular data and we particularly focus on the data privacy aspect of the technology. As an international reinsurer, trust is the foundation of our business, making privacy a critical priority. At the same time, we need to maintain the ability to analyze data effectively to provide accurate pricing and efficient services for our clients.
When generated successfully, this eliminates concerns about consumer privacy, allowing valuable insights to be drawn from sensitive data without compromising individual privacy.”
What makes synthetic data particularly compelling is its ability to balance these priorities. Even when the original algorithm is trained on personal information, the newly synthesized data is completely disconnected from any real individual – living or deceased. When generated successfully, this eliminates concerns about consumer privacy, allowing valuable insights to be drawn from sensitive data without compromising individual privacy.
For example, consider claims reports. The claims team obviously requires access to individual identities to process a reported claim. However, claims data may also be of interest to other departments – for example, valuation, pricing, product development or even IT for software testing purposes. Unlike the claims team, these departments do not require knowledge about the claimant’s identity. By synthesizing the dataset before sharing it internally, insurers can collaborate across teams while safeguarding policyholder privacy. This approach enables meaningful analysis without unnecessarily disclosing sensitive personal information.
A common misconception is that synthetic data simply anonymizes real data, preserving a direct link between the original and the synthetic version. In reality, the highest quality synthetic data is generated by AI algorithms that learn the underlying patterns and key features of the original dataset, then use randomized values to create entirely new data points. As a result, synthetic data is not tied to any individual.
Why Synthetization is not AnonymizationTo illustrate this, imagine a game of rolling dice where we record the outcomes for each player. Anonymization would involve hiding which player rolled which number while still tracking results. Synthetic data, goes a step further – it doesn’t memorize the individual rolls. Instead, it learns how the dice behave and then generates new, hypothetical dice. With these virtual dice, we can roll them indefinitely, producing endless results without ever referencing the original players. |
This approach offers significant advantages. By focusing on the data-generating process rather than individual records, synthetic data eliminates the risk of exposing personal information while preserving the statistical value needed for meaningful analysis. This allows for easier data sharing and broader application without compromising privacy. However, if your process relies on the original data—such as scoring a game of dice to determine a winner (see breakout box above)—synthetic data may not accurately reflect the real-world outcomes required for certain decisions.
While generative AI can easily create lifelike yet fictional images, such as human faces, producing synthetic tabular datasets that accurately reflect real-world data is far more complex. Unlike images, tabular data presents greater challenges—AI must handle missing values, diverse data types, and numerical variables with erratic and unpredictable distributions, making faithful replication significantly harder.
Understanding the Challenges of Real-life Tabular DataA major challenge in synthesizing tabular data is handling outliers. Take Group Life insurance as an example: a CEO’s exceptionally high salary may stand out from the rest, posing risks for synthetic data users. The AI might generate multiple synthetic CEOs or none at all, skewing analysis depending on the dataset’s purpose. Additionally, the synthetic data could still reveal sensitive insights—such as an approximate salary—potentially compromising privacy. This highlights the need for rigorous, context aware evaluation to ensure synthetic data maintains both analytical integrity and privacy protection. |
The regulatory landscape also presents complexities. As synthetic replicas of personal data become increasingly realistic, there is a growing need for robust safeguards to prevent AI from inadvertently leaking personal information.
This technology is evolving, and in the coming years, privacy guarantees are expected to become more transparent and quantifiable.”
After an initial phase of hype, we are beginning to better understand the limitations of synthetic data and the challenges of managing the privacy-utility trade-off. This technology is evolving, and in the coming years, privacy guarantees are expected to become more transparent and quantifiable. For instance, synthetic data providers have started developing tools to measure privacy risk, such as the open-source project Anonymeter.[3] Encouragingly, some data protection regulators[4] have expressed support for these quantification methods, suggesting that well-designed synthetic data could indeed be considered anonymous.
As the technology matures, we will gain a clearer understanding of where synthetic data is most effective – and where it falls short. Additionally, more commonly accepted metrics for assessing synthetic data quality will emerge. Efforts to formalize best practices are already underway; for example, the IEEE Standards Organization has assembled a team of experts to establish global standards for the privacy-enhancing use of synthetic data. Learn more at IEEE Synthetic Data Initiative.
For insurers exploring synthetic data, it may be attractive to begin with open-source toolkits, which offer an efficient way to explore and understand the technology without incurring license fees or experiencing vendor lock-in. For example, Synthetic Data Vault (SDV) is an open-source platform for generating and evaluating synthetic datasets.
Producing high-quality, reliable synthetic data is complex. It requires experienced data scientists to properly configure the algorithms, fine-tune parameters and rigorously evaluate the outcomes. While open-source tools are useful for experimentation, many commercial solutions offer advantages such as:
There is also a growing trend of integrating synthetic data algorithms directly into enterprise database solutions.”
For many insurers, open-source tools serve as a gateway to the technology and to better understand what vendors can offer and to clarify the organization’s specific needs before committing to a commercial product.
There is also a growing trend of integrating synthetic data algorithms directly into enterprise database solutions. This allows organizations to synthesize data within their existing systems, enhancing efficiency and enabling seamless data generation where it’s needed most.
Understanding these tools and trends can help insurers leverage synthetic data more effectively, balancing privacy, utility and operational efficiency.
While synthetic data has made significant strides, generating high-quality, reliable datasets remains a challenge. For high-value use cases, fully trusting automatically generated synthetic data is still risky. Human expertise is essential to carefully evaluate and validate these datasets to ensure they meet quality and privacy standards.
Looking ahead, the key question is how much of this process can be automated – and whether automation can eventually unlock the full potential of synthetic data in insurance. The true value lies in its ability to accelerate data-driven insights, enhance operational efficiency and protect consumer privacy simultaneously.
PartnerRe’s data scientists are actively exploring the evolving synthetic data landscape and developing solutions to balance data-driven insights with privacy protection. If you’re interested in discussing this topic and its impact on insurance, get in touch with our team.
Markus Senn, Global Head of Strategic Data Science, Life & Health
[1] Nvidia Bets Big on Synthetic Data | WIRED
[2] Is Synthetic Data the Future of AI?
[3] GitHub – statice/anonymeter: A Unified Framework for Quantifying Privacy Risk in Synthetic Data according to the GDPR
[4] https://github.com/statice/anonymeter/blob/main/cnil/CNIL_opinion_anonymeter.pdf