Making Synthetic Data More Useful
Synthetic Data goes beyond the Software Development Lifecycle
Synthetic data is commonly defined as “any production data applicable to a given situation that are not obtained by direct measurement” [McGraw-Hill Dictionary of Scientific and Technical Terms]. It has been generated by software developers and testers since the early 90-ies who needed easy access to data for their work during various stages of the Software Development Life Cycle (SDLC). But synthetic data can also be used beyond the SDLC. One example is in clinical trials where researchers generate synthetic data to aid in creating a baseline for future studies. The burgeoning data analytics industry trend is likely to accelerate the need for synthetic data to test out algorithms and communicate the value of data products in marketing situations.
How does Real Production Data differ from Synthetic Data?
The alternative to synthetic data is real production data. I.e. data as it is measured, observed, or recorded in transaction systems. Compared to real data, synthetic data has a mixed reputation as tools and in-house implementations for synthetic data generation in the past have frequently been weak substitutes for the real thing.
Common synthetic data weaknesses stem from the inability to create data that closely mimics real data. Ideally, synthetic data and real data should demonstrate identical statistical properties. That is hard to achieve as it requires deep data insights and has so far not been implemented in market products. The difficulties become accentuated for end-to-end testing of transactions involving multiple applications and databases as the analytical complexity required grows rapidly. The potential lack of realism may impact the quality of applications being developed and have severe business impact.
The advantages of synthetic data are:
- Better control that boundary conditions are considered
- Ability to create data addressing new situations for which no production data exists
- Easy creation of large volumes of data which may be critical in e.g. performance testing
- Not sensitive.
Synthetic Data is ALL about Security & Control
The last point is most important. Production data is frequently very sensitive and cannot be widely used without masking which may be complicated. Whether for regulatory reasons or company internal business reasons, there are typically many data elements that should not be disclosed widely. As witnessed by e.g. European Union’s 2016 “Data protection reform”, privacy legislation is a hot topic globally and it is driving renewed interest in synthetic data and demand for betterment of synthetic data generation products.
How can synthetic data generation be improved? The key to great synthetic data is understanding production data structurally and statistically. As production data is typically only partially documented and known, it is necessary to apply tools to discover what production data looks like. This is particularly important and challenging in situations where multiple applications are chained together to perform the desired business functionality.
A foundational capability for better synthetic data is the ability to understand the properties of the data we are trying to represent. It is particularly important to capture relations between data elements as they have high information content that should not be lost. Some of these relations are captured as primary key/foreign key relations in the database. Other relations are implied by application code. These relations are typically 3-5 times as numerous as the database enforced relations. Ideally synthetic data should preserve both the explicit relations and the implied.
The next level of understanding is how data flows between applications as it allows synthetic data generation for complete transactions implemented as multiple applications. This approach ensures that complete synthetic data can be provided and that it does not miss important related data.
Make Generating Synthetic Data Even Easier
As synthetic data can be used in many situations, there will be many different categories of users and a good synthetic data generation tool need to cater to users with different needs and level of expertise. New users may want to generate large volumes of data relying on default settings while experienced users may need to exercise high control of details to e.g. ensure that particular boundary conditions or unusual combinations are generated.
With applications, databases, and interconnections constantly changing, the tool should be able to continuously detect changes and adapt synthetic data generation to rapidly evolving systems. Without this capability, productivity loss due to broken test scenarios would be a likely occurrence.
Finally, no automated system is perfect and most organizations have subject matter experts (SMEs) who possess knowledge that can greatly augment the automation tool. The best tools will allow SME knowledge to be crowd-sourced and integrated with automation.
ROKITT ASTRA™ offers synthetic data generation based on automatic discovery. Its deep data insights are fueled by machine learning and allow users to easily create synthetic data based on the principles outlined above. Synthetic data generated with ROKITT ASTRA mimic production data with high fidelity while being anonymous and insensitive so that it can be used freely across the enterprise.