Crystal Johnson
Project Analyst
SpinSys

Introduction

In today’s world, organizations use data to identify trends, draw conclusions, drive strategic planning, influence process improvement, and improve marketing strategies. As a significant use case for healthcare, the National Intrepid Center of Excellence (NICoE) uses data to research, predict, and develop proactive treatment plans for military members and veterans who suffer from traumatic brain injuries (TBIs) and post traumatic stress (PTS). These holistic, evidence-based treatment plans teach service members strategies to address their stressors and facilitate their return to a more optimal productive lifestyle with healthy relationships. In addition to real data pertaining to actual patients, de-identified, limited use, and synthetic data are also used by organizations for research and operational purposes.

Data Types

De-identified Data:
De-identified data is health information from an individual’s medical record that excludes 18 personal identifiers of the individual, his/her relatives, employers, or household members. This may include date of  visit and shortening the zip code to ensure the individual is no longer identifiable with the data being used.  For instance, the zip code is reduced to the first 3 digits and if these digits represent less than 20,000 people, the zip code is completely removed. A patient’s dates are algorithmically shifted to protect their confidentiality. All 18 personal identifying elements must be obfuscated to create de-identified data from the original real data.

Limited Use Data:
Somewhat similarly, limited use data is information from a medical record that excludes only 16 personal identifiers of an individual, his/her relatives, employers, or household members. However, these datasets can include city, state, ZIP code, and dates of visits. Within the health science fields, limited use and de-identified datasets are originally derived from real data and only disclosed for research, public health purposes, or the advancement of health care operations.Other disciplines may also use de-identified or limited use data to meet specific use cases where real data should not be disclosed or is otherwise required to be protected.

Synthetic Data:
Conversely, synthetic data is artificially created to statistically resemble actual data, but does not reflect real characteristics of a specific individual. Randomly generated by computer programs, all of the data fields are statistically derived from true data to essentially fabricate and design a synthetic dataset that depicts real data. The most meaningful synthetic data is tailored to mimic defined tables, file structures, etc. of the actual data it represents. Synthetic data is used in a myriad of ways, most notably in conjunction with artificial intelligence (AI) and machine learning. These datasets can be quickly created while minimizing privacy concerns such as Private Health Information (PHI) and Personally Identifiable Information (PII).  Self driving cars, fraud systems, facial recognition software, machine learning algorithms, and software testing all use synthetic data.


Performance: Synthetic vs Real Data

The most telling measure of data quality is the consistency, accuracy, completeness, and effectiveness of the data when it’s being used. A 2017 study conducted by MIT found that machine learning models built from synthetic data produced results consistent with machine learning models built with real data 70% of the time. 

MDACA.io provides the capability to generate synthetic data powered by AI.These can be tailored to meet a wide range of use cases in which a model can be developed leveraging real and additional parameters. When the synthetic data is created, it generally provides the data needed to support a variety of use cases because multiple groups within an organization need access to the data. Data de-identification may not fully protect data from re-identification because there are other sources of data that still have identifying information in them​. Data anonymization is a stronger form of a data minimization technique with higher levels of data privacy that completely and permanently remove personally identifiable information​. 

To maximize privacy preservation in synthetic data, data synthesis must satisfy the definition of differential privacy that makes it impossible to tell whether a particular individual was part of the original dataset​. To maximize utility of synthetic data, data synthesis must preserve to a high degree the statistical properties and structure of the original dataset​


How MDACA Optimizes the Use of Various Data Types

Organizations are faced with the challenge of extracting specific data points while simultaneously safeguarding the privacy of customer and patient data. Subsequently, enterprises must ensure the privacy of data related to customers. As an example in the healthcare arena, organizations need to adhere to compliance regulations, such as HIPAA or Sarbanes-Oxley (SOX). As a component of the Multiplatform Data Acquisition, Collection, and Analytics (MDACA) suite of applications, the MDACA Big Data Virtualization (BDV) solution provides the capability of masking and obfuscating data at query time from the database connection level; the secure data is confined to the system and remains secure at the source without compromise. BDV has the ability to create de-identified data from any database repository by removing PII and PHI elements from the data being delivered to users.  Additionally, the MDACA BDV has the ability to generate synthetic data leveraging AI to generate in-memory databases as well as data to be stored and accessed through the MDACA BDV in external data sources.   

MDACA BDV is able to combine data from various sources regardless of the storage location or class. For instance, data stored within an Amazon Web Services (AWS) Simple Storage Service (S3) and data stored in multiple Structured Query Language (SQL) relational databases (i.e., Oracle, SQL Server, Postgres) can be pulled into a single query. MDACA BDV allows enterprise users to easily access data across a wide range of data sources and database types through query federation. BDV provides a logical data layer that integrates enterprise-wide data across disparate systems and manages the unified data within a single location for centralized data access, providing enterprise data privacy and controlled access to data at the source. With advanced row-level filtering and dynamic column-masking policies, filters and data masks can be set for specific users, groups, and conditions. The MDACA Cloud Native Data Lake can house various data types by leveraging the flexible storage and elastic processing benefits offered by the Cloud such as AWS and Microsoft Azure. With MDACA Cloud Native Data Lake, organizations can now store various types of data in an optimized, compressed, and cost-effective manner that is securely and easily available to each system. The result is a more efficient, flexible, and cost-effective approach that leverages real, de-identified, limited use, and synthetic data as needed.


Conclusion

De-identified, limited use, and synthetic data are used to advance society, especially in the AI and healthcare fields. The impacts of these data driven changes can be seen every day,  ranging from medical encounters to fraud prevention tactics. MDACA can help organizations store, navigate, and create various types of data to meet use case needs and organizational objectives.