The21st century cures Act, passed in 2016, placed additional focus on real world data to support regulatory decision making, including new indications for approved drugs. In the legislation, Congress defined real world evidence as data regarding the usage or the potential benefits or risks of a drug derived from sources other than traditional clinical trials.
Those sources can be varied including electronic health records (EHRs), claims and billing activities, product and disease registries, patient-generated data including in home-use settings, and data gathered from other sources such as mobile devices.
But how do we know if this data is accurate? It’s an important and timely question, Dan Riskin told AI Trends. Riskin is founder of Verantos and Adjunct Professor of Surgery and Biomedical Informatics Research at Stanford University.
“Real world evidence has previously not made a lot of clinical assertions,” Riskin said. “When we get drugs approved or we make reimbursement decisions, typically we do it off of clinical trials. Now we’re using real world evidence, which before had mostly been used for trial recruitment, trial design, and marketing insights—meaning no clinical assertions. Now we’re making clinical assertions with this data.”
That’s a big transition, Riskin said, and the data needs to be worthy of the new applications.
In a study published last August in the Journal of the American Medical Informatics Association (DOI: 10.1093/jamia/ocz119), Riskin and colleagues from Stanford University and Amgen conducted a retrospective study of more than 10,000 EHRs, seeking to mine the data for certain clinical concepts and compare the accuracy of AI technologies to traditional tools in using these real world datasets.
“If you’re going to run a study, you should check your data accuracy level. If you’re going to run a study to make a clinical assertion with real world evidence, you should confirm data validity,” Riskin said.
Test Case: Cardiovascular Medicine
The August study examined 10,840 EHRs from a large academic medical center from 2010 to 2016. The goal was to measure with what accuracy certain clinical concepts in cardiovascular medicine—coronary artery disease, diabetes mellitus, myocardial infarction, chronic kidney disease, stroke, dementia, and more—could be mined from the EHR, dividing the record into structured portions (EHR-S; problem list, medication list, and laboratory list) and unstructured portions (EHR-U; case notes).
Traditionally, real-world data has been derived from the structured portions of the EHR: problem, procedure, and other lists within the EHR were matched by SQL query to a list of relevant codes. But the authors hypothesized that these techniques may be insufficient to build a dataset accurate enough to support making clinical assertions. Instead, they proposed that artificial intelligence approaches could leverage unstructured clinical text, building more accurate datasets.
The team compared the findings from EHR data collected traditionally from the structured portion of the records to findings gathered with artificial intelligence tools from the unstructured data portions of the record. In both cases, they wanted to assess if the data were accurate enough to be considered “regulatory grade” or sufficiently accurate to justify the clinical assertion.
Accuracy, Riskin explained, has two components: precision and recall. Precision refers to a correct EHR: the patient did have the conditions listed in the record. Recall refers to a complete EHR: all of the patient’s conditions are included in the record. The team assumed that the EHR as a whole is complete and correct, but much of the detail may be “hidden” in physician notes, abbreviations, and shorthand.
“In general, when accuracy is measured in studies today, precision is emphasized rather than recall,” the study authors wrote. “This is not because precision is more statistically relevant than recall, but rather because precision is easier to assess.”
As a reference standard, at least two annotators manually reviewed each record in the EHR for the clinical concepts, counting how many of each occurred within both the structured and unstructured portions of the EHR, calling in a third expert if their findings didn’t agree.
The study authors set a threshold for recall and precision to indicate accurate data. Compared to the manually gathered reference standard, an 85% match was considered accurate for recall and a 90% match was considered accurate for precision. “This definition is not intended to be a set standard for all types of study questions, but rather a starting benchmark to initiate discussion,” they wrote.
The study used Verantos’ natural language processing (NLP) and machine learning inference tools on the unstructured portions of the EHR. The NLP pipeline included text extraction, section detection (which part of the EHR did text come from), information extraction and tagging, and concept assignment.
They found that for structured data portions of the EHR, mined traditionally, average recall and precision were 51.7% and 98.3%, respectively, reflecting the percentage that recall or precision measurements matched the reference standard counts for a particular condition. Scores varied for different clinical conditions. For instance, recall ranged from 29.8% for myocardial infarction to 80.6% for diabetes mellitus.
For the unstructured data portions of the EHR, mined with Verantos AI, accuracy rose significantly: 95.5% for recall and 95.3% for precision. For each clinical concept, EHR-S accuracy was below regulatory-grade, while EHR-U met or exceeded criteria.
A Heavy Lift
So far, real-world data has been collected from the “easy-to-use data” Riskin says: claims data and structured EHR data. But this study suggests that at least these structured EHR data are giving an incomplete picture.
And Riskin argues that this incomplete picture will impact quality of care as real-world evidence is used to make clinical assertions. For example, “if you’re going to change the standard of care based on an assertion from real world evidence, and your assertion is based on the wrong patient population… then you’re going to treat people who may not benefit,” he said. As would be seen given low recall in myocardial infarction, “you’re saying you should treat people for primary prevention of heart attack with this treatment, but in fact your patient population was secondary prevention and you didn’t realize it,” because your data was incomplete.
Riskin thinks accuracy should be required in the protocol. “If today we’re starting a transition to change the standard of care based on real world evidence—which is a good thing, that enables precision medicine, tailored therapies and such—if we’re starting that today, we need to start it off correctly. The first step is to put into the protocol accuracy requirements for the key data—inclusion criteria, exclusion criteria. And within the accuracy requirements you need to have both precision and recall. Even though it’s really hard.”
He continued: “We need to redo our infrastructure in real world evidence… You can already see it happening. You already see the large contract research organizations merging for data,” he said, mentioning mergers between IMS and Quintiles and INC Research and inVentiv Health. But he also says many in the industry are taking the ostrich approach: “Maybe we don’t need to do it yet.”
That’s where Verantos hopes to step in. “We’re betting that it will be really hard, and startup companies are going to have to come in and help,” he said. “This is the time we need to start actually making our real-world evidence believable.