AIArtificial IntelligenceTrends

When the Model Knows Too Much and Proves Too Little: The Data Science Challenge Inside Risk Adjustment

Views: 2
0 0
Read Time:5 Minute, 16 Second

  

The Prediction Problem Nobody Framed Correctly

Risk adjustment in Medicare Advantage is, at its core, a data science problem. The federal government pays private health insurers based on predicted healthcare costs for each enrolled member. Those predictions depend on diagnosis codes submitted by the plan, which are generated from clinical documentation reviewed by coders, increasingly assisted by AI. The entire $615 billion payment system runs on a chain of inferences that starts with unstructured clinical text and ends with a dollar amount.

The data science community has approached this primarily as a classification problem: given a clinical note, identify which diagnosis codes are present. NLP models, trained on annotated medical records, achieve high accuracy on this task. Reported precision rates of 92% to 96% are common in vendor literature. The models can reliably identify that a clinical note discusses diabetes, chronic kidney disease, or heart failure. At the classification level, the problem is largely solved.

The problem that isn’t solved, and that the classification framing obscures, is evidentiary sufficiency. Identifying that a note discusses diabetes doesn’t establish that the documentation provides adequate evidence of active disease management. The federal audit standard isn’t “does this condition appear in the record?” It’s “does the documentation prove this condition is being monitored, evaluated, assessed, and treated?” That’s a different task entirely, and it’s the one where current AI fails at rates between 81% and 91%, according to OIG audits published in March 2026.

Inside Risk Adjustment

Classification vs. Evidentiary Reasoning

The distinction matters technically. Classification maps input text to a label: this note contains evidence of HCC 18 (Diabetes with Chronic Complications). The model outputs a probability score. Above a threshold, the code is recommended. This is supervised learning on annotated data, and it works well when the training set accurately represents the decision boundary.

Evidentiary reasoning is a different computational task. It requires the model to assess whether specific evidentiary elements are present or absent in the documentation, map those elements to a defined framework (MEAT: Monitoring, Evaluation, Assessment, Treatment), weigh the sufficiency of the evidence against audit standards, and produce a structured explanation of its assessment that a human reviewer can validate.

This is closer to a structured extraction and reasoning pipeline than a classification model. The output isn’t a probability that a condition exists. It’s a documented assessment of whether evidence supports submission under regulatory standards. The inputs are the same (clinical text). The architecture is fundamentally different.

Models trained on classification learn “this text pattern correlates with this diagnosis code.” Models built for evidentiary reasoning learn “these specific textual elements satisfy these specific evidentiary requirements.” The first understands mention. The second understands proof.

The Training Data Problem

Classification models for risk adjustment are typically trained on historical coding data: charts labeled with the HCC codes that coders assigned. This creates a circular problem. If historical coding practices were biased toward oversubmission (coding conditions that appeared in charts without verifying documentation sufficiency), the training data inherits that bias. The model learns to recommend codes at the same rate and under the same conditions as the flawed historical process it was trained on.

OIG’s finding that 81% to 91% of sampled codes were unsupported suggests the historical coding process produced a significant false positive rate when measured against the actual audit standard. Models trained on that process will reproduce that false positive rate because they learned from data where unsupported codes were labeled as correct.

Correcting this requires training data annotated not just for code presence but for evidence sufficiency. Each example needs labels for which MEAT elements are present, which are absent, and whether the overall evidence meets the submission threshold. This annotation requires clinical expertise, not just coding knowledge. It’s also the only way to build models that distinguish between mention and proof.

The Explainability Requirement as a Technical Constraint

CMS’s January 2026 directive that AI should serve as a “medical coder support tool” with human final determinations imposes a technical constraint that pure classification models don’t satisfy. A classification model that outputs “87% probability of HCC 18” doesn’t give the coder enough information to make a meaningful determination. The coder sees the probability score but not the reasoning. Validating the AI’s recommendation requires independent chart review, which negates the efficiency the AI was supposed to provide.

An evidentiary reasoning system that outputs “HCC 18 identified; A1C result (monitoring) found in paragraph 3; assessment language absent; treatment change documented in paragraph 7; MEAT score: 2 of 4 elements present” gives the coder structured evidence to evaluate. The coder applies clinical judgment to a documented assessment rather than rubber-stamping an opaque probability score. The AI supports the decision rather than replacing it.

This isn’t just a regulatory preference. It’s better system design. When the AI’s reasoning is visible, the coder catches errors the model makes. The explainability requirement, viewed as a technical constraint rather than a compliance burden, produces higher system accuracy because it introduces a genuine error-correction mechanism.

Building the Right Architecture

The data science challenge in risk adjustment isn’t building a better classifier. It’s building an evidentiary reasoning system that combines NLP extraction (identifying clinical concepts in unstructured text), structured reasoning (mapping extracted concepts to MEAT criteria), sufficiency assessment (evaluating whether the evidence meets submission thresholds), and explanation generation (producing human-readable reasoning that coders can validate and auditors can follow).

This architecture requires purpose-built clinical AI, not general-purpose NLP fine-tuned on medical text. It requires training data annotated for evidentiary sufficiency, not just code presence. It requires evaluation metrics that measure defensibility (what percentage of recommended codes survive audit-standard scrutiny), not just identification accuracy (what percentage of codes in the chart were found).

The organizations getting this right are building their risk adjustment platform around the evidentiary reasoning architecture rather than wrapping compliance features around a classification engine. The difference is measurable in audit outcomes: systems built for evidentiary reasoning produce defensibility rates above 90%. Systems built for classification produce identification rates above 90% and defensibility rates that, based on OIG findings, can fall below 20%. The models are equally accurate at the task they were designed for. They were just designed for different tasks.

 

​Artificial Intelligence – The Data Scientist

Happy
Happy
0 %
Sad
Sad
0 %
Excited
Excited
0 %
Sleepy
Sleepy
0 %
Angry
Angry
0 %
Surprise
Surprise
0 %

Average Rating

5 Star
0%
4 Star
0%
3 Star
0%
2 Star
0%
1 Star
0%

Leave a Reply

Latest news