Shifting Out of Neutral:  

Using Data Science to Accelerate Clinical Insight 

By Salma A. Sheriff and Anai N. Kothari, MD, MS 

10/01/2022

An immense amount of data is generated and used with every digital interaction. Industries across the board have turned to data activation, whether it's Google Maps suggesting an alternate commute or Netflix learning about your preference for thrillers. When it comes to healthcare, despite producing exabytes (1 billion gigabytes) of data annually, barriers in access have limited the achievement of similar progress.  


The problem is in part due to the difficulty in retrieving the data that are generated from day-to-day clinical care (“real world data”). Researchers often resort to spending hours manually going through individual patient charts and creating single-use datasets for their own pre-specified purpose. This could be improved by addressing shortcomings with inconsistent data definitions, lack of interoperability, and lack of unifying data model.  

"Toronto Star staff car stuck in the mud" by Toronto History is licensed under CC BY 2.0.  

To start, improving utilization of a common data model is a good first step. Data users and the informatics teams that retrieve data for them can collaborate and agree on variable name, source and definition beforehand. The Observational Medical Outcomes Partnership (OMOP) common data model (CDM) provides standardized healthcare terminology by mapping entered data elements to shared concepts.  


For example, laparoscopic cholecystectomy, the surgical removal of the gallbladder, could be inputted as CPT code 47562 for insurance purposes or as ICD-9-CM code 51.23 and/or ICD-10-PCS code 47564 for diagnostic and clinical purposes. OMOP maps each of these terms from different coding systems into a single concept (laparoscopic cholecystectomy - 4163971) to ease querying and standardize abstraction. 


Consolidating multiple input data types into a single, common data element gets more challenging when the concepts are more than simple entries like "date of birth" or “sex.” A solution is the development of computable phenotypes, multi-step logical queries that can use existing data to identify which clinical term to input. For example, the CDC's NHSN defines superficial incisional surgical site infection (SSI) with the following set of criteria:  


This logic could be programmed into a computable phenotype called a ‘superficial incisional surgical site infection’. The result is a reproducible process that extracts the data element quickly with minimal manual oversight, remains flexible to future modifications, and maps to an OMOP concept.  

There are additional barriers like complexity of data storage structures and proprietary data systems; still, standardizing inputs via CDMs and computable phenotypes are necessary first steps towards building a data science-based approach to organizing and activating healthcare data. Eventually, the same techniques used to create email filters that analyze and flag spam could help detect inaccuracies in clinical notes. Marketing firms use advanced analytics to target advertisements; those methodologies could be used to shift from population-level recommendations to customize personalized medicine.  


This is exactly the framework that guides members of the AN.AI Lab as we strive to re-imagine how data can power healthcare transformation. Thinking critically about our interactions with data formed the foundations for projects like Nate’s work with post-operative outcomes of cancer patients with COVID-19 and Nic’s work assessing the price variation across common surgical procedures.  

Red coupe on road during night time by Tharoushan Kandarajah on Unsplash.