A groundbreaking study from Highmark Health and Ellipsis Health, published in JMIR AI, demonstrates that artificial intelligence can accurately detect and measure depression severity through voice analysis in real-world clinical settings. This marks a potential paradigm shift in how healthcare organizations approach behavioral health screening and care management. This research validates the technology that powers Ellipsis Health’s Empathy Engine and Sage, our AI Care Manager, which is deploying to healthcare organizations nationwide.
Key study findings
Researchers from Highmark Health and Ellipsis Health analyzed 2,007 real-world case management calls, revealing that AI voice analysis achieved a concordance correlation coefficient of 0.54 on a blind test set, with mean absolute error of 4.06 points on the PHQ-8 scale. The technology maintained consistent performance across age groups, gender, and socioeconomic categories, with area under the receiver operating characteristic curve values ranging from 0.79 to 0.83.
This represents the first large-scale validation of voice-based depression detection in routine clinical practice, moving beyond small pilot studies to demonstrate real-world clinical utility.
Clinical significance for healthcare
Addressing critical care gaps
Depression is undetected in approximately 50% of individuals with the condition in high-income countries, and in 80-90% of individuals with depression in low- and middle-income countries. For CMOs and CNOs managing population health outcomes, this technology offers a solution to systematic under-detection that impacts both clinical outcomes and quality metrics.
Enhancing care management efficiency
The study found that case managers spent nearly 20% of total call time administering the PHQ-9. Voice AI analysis occurs passively during natural conversation, freeing clinicians to focus on therapeutic rapport building and direct patient care rather than survey administration.
Supporting quality measures and outcomes
The model performed well across specific PHQ-8 severity levels, particularly in the minimal and mild ranges, enabling early identification for preventive interventions. This capability directly supports value-based care models and quality bonus programs tied to behavioral health metrics.
Real-world performance across populations
Consistent accuracy across demographics
The study demonstrated robust performance across multiple subgroups:
- Age groups: Maintained accuracy across ages 18-98 years in the development set and 18-92 years in the test set, with particularly strong performance in older adults (≥65 years)
- Gender: Consistent results across male and female populations
- Socioeconomic status: Reliable performance across all Social Vulnerability Index categories
- Clinical settings: Performance in both behavioral health and general medical case management calls
Performance in non-behavioral health settings
The model achieved comparable preliminary performance (AUROC cutoff 10=0.81) in non-behavioral health case management calls. This suggests potential applications across primary care, chronic disease management, and other medical specialties where depression often goes unrecognized.
Clinical validation results
Objective assessment capabilities
In cases where there were significant discrepancies between voice AI predictions and administered PHQ-8 scores, independent review by 5 licensed clinicians found that the clinicians’ assessments were consistent with the AI model categorization twice as often as they were with the administered PHQ-8 score. This suggests the technology may help overcome underreporting due to stigma or other factors.
Integration with existing workflows
Voice AI analysis can integrate with current care management processes, requiring no additional time from clinical staff while providing continuous monitoring capabilities during routine patient interactions.
Study methodology and scope
The evaluation used recordings from case management calls between January 2019 and January 2023, with participants from 44 different US states. The dataset was split into a development set (1,336 calls) and a blind test set (671 calls) to ensure unbiased evaluation. All verbally administered PHQ-9 content was manually removed from the voice analysis to ensure the AI model made predictions based solely on natural conversation.
From research to clinical reality: Sage, the AI Care Manager
This rigorous validation study provides the clinical foundation for Sage, Ellipsis Health’s AI Care Manager that is transforming care management operations across health systems and health plans. Sage leverages Ellipsis Health’s proprietary Empathy Engine to deliver emotionally intelligent patient interactions while continuously monitoring behavioral health indicators.
The research demonstrates that this technology provides healthcare organizations with unprecedented insight into patient mental health status during routine care management activities. This enables Sage to not only handle traditional care management tasks—from enrollment and assessments to clinical follow-ups—but also to serve as an early warning system for emerging behavioral health needs.
Sage represents the translation of this validated technology into a practical solution that scales empathetic, evidence-based care management. The transition from research validation to operational deployment through Sage offers healthcare organizations the opportunity to implement AI-driven depression detection and care management today, backed by the largest real-world validation study of its kind.



Getting started is easy
Ready to see how easily and quickly you can reduce your patient backlog? Schedule a demo today.