top of page
Shows people coming together to collaborate on research

Industry Leading Research

Ellipsis Health leads in clinical and speech technology research, establishing best practices, and contributing to the field of Vocal Biomarkers for mental health and well-being.

Clinical Validation: 

APRIL 8, 2022

frontiers in psychology

Feasibility of Machine-Learning Based Smartphone Application in Detecting Depression and Anxiety in a Generally Senior Population

Abstract: Depression and anxiety create a large health burden and increase the risk of premature mortality. Mental health screening is vital, but more sophisticated screening and monitoring methods are needed. The Ellipsis Health App addresses this need by using semantic information from recorded speech to screen for depression and anxiety.

Independent Review Board Studies:

APRIL 8, 2022

frontiers in psychology

Feasibility of Machine-Learning Based Smartphone Application in Detecting Depression and Anxiety in a Generally Senior Population

Abstract: Depression and anxiety create a large health burden and increase the risk of premature mortality. Mental health screening is vital, but more sophisticated screening and monitoring methods are needed. The Ellipsis Health App addresses this need by using semantic information from recorded speech to screen for depression and anxiety.

Independent Review Board Studies

desert oasis healthcare

Clinical Validation in Senior Population

Ellipsis Health conducted a study with 250+ patients with a previous history of depression at a Healthcare facility in Palm Springs, CA. The majority of patients were seniors (65+.) Each subject performed six voice screenings at least one week apart, with each session consisting of three minutes of answering open-ended questions about their mental state. The Ellipsis Health app not only demonstrated feasibility in use as a screening tool among all age groups who participated, but 30% of participants also spent longer than the required time necessary to conduct the survey, indicating that the process was engaging.

vanderbilt university medical center

Monitoring Pre- and Post-Operative Patients

The Vanderbilt University Medical Center and Ellipsis Health study involves 250+ spine surgical patients who will be monitored for their severity of depression and anxiety throughout the surgical journey (pre-operatively, then weekly in the postoperative period). The study will also explore the relationship between depression/ anxiety and other pain-related measures.

mayo clinic

Supporting Employee & Caregiver Wellbeing

This Mayo Clinic and Ellipsis Health study involves 50 adult employees who are part of the Stress Management and Resilience Training (SMART) program at Mayo Clinic. Study participants will engage with the Ellipsis Health App weekly for 3 months to assess the employees' severity of depression and anxiety.

university of denver

Supporting & Evaluating the Mental Wellbeing of Adolescents

Ellipsis Health is conducting a study in collaboration with the Denver University Graduate School of Social Work to validate the Ellipsis Model’s screening capabilities for Anxiety and Depression in adolescents aged 11-17. This Human-to-Device behavioral health screening study compares the Ellipsis Model to the PHQ & GAD-7 benchmarks and is the first of its kind to focus on this age cohort.

university of michigan

Supporting the Mental Wellbeing of Adolescent & Young Adult Cancer Patients

The University of Michigan conducted a feasibility study evaluating the acceptability of the Ellipsis Health App to assess adolescents and young adults diagnosed with cancer. 60 participants with ages ranging from 18-25 were monitored using the app over a 6-month period. This study will show the portability of the Ellipsis Health models to different disease cohorts.

mind springs health

Averting Crisis Events

A study of 100+ newly enrolled patients in a Depression Clinic program in Colorado who performed weekly voice sampling using the Ellipsis Health App. During the study, clinicians’ assessments of depression and anxiety in the participants were also collected. The Ellipsis Health app performed similarly to the clinicians’ face-to-face assessments of the participants, and was also able to anticipate and avert 80 crisis events.

mayo clinic

Supporting Long-Haul Covid Patients

This Mayo Clinic and Ellipsis Health study involves 200+ adults with long term Covid symptoms (“long haulers”) and the influence of social isolation related to Covid-19. Participants will use Ellipsis Health’s App every other week for 24 weeks to assess their severity of depression and anxiety. 

hartford healthcare

Comparing the HAMD-6, PHQ-9 and Ellipsis Health in Inpatient and Outpatient Programs

The Hartford Health and Ellipsis Health study will recruit 300 adult patients in the Partial Hospitalization and Outpatient programs. Patients will complete a weekly voice journal with Ellipsis Health. The HamD-6 and PHQ-9 will also be collected weekly for up to 12 weeks.  The aim of the study is to compare Ellipsis Health scores for depression with HAMD-6 and PHQ-9 scores, and to assess the utility of the Ellipsis Health scores in assisting in the treatment of patients.

penn state

Evaluating Ellipsis Health + Comprehensive Set of Mental Health Assessments in College Students

This study compares the accuracy of the Ellipsis Health Model with a comprehensive set of mental health screening assessments. 350+ participants, mostly college students, participated in 17 different diagnostic assessments, including a structured clinical interview. The study will show the model's robustness and comparison to structural clinical assessment.

university of houston and MD anderson cancer ceneter

Supporting the Mental Wellbeing of Caregivers for Adolescent & Young Adult Cancer Patients

Ellipsis Health partnered with the University of Houston to determine the usability of the Ellipsis Health App as an assessment tool for caregivers of adolescent and young adult patients diagnosed with cancer. The study will monitor 60 caregivers once a month over a 6 month period.

Peer-Reviewed Speech Technology Publications

December 5, 2023

Probabilistic Performance Bounds for Evaluating Depression Models Given Noisy Self-Report Labels
 

IEEE.png

Advances in AI for health applications rely on evaluating performance against labeled test data. In the area of mental health, self-report labels from surveys such as the Patient Health Questionnaire (PHQ) for depression, are useful but noisy. This "fuzzy label" problem is not currently reflected in reporting model performance, adding to the challenge of comparing results across diverse corpora, data sizes, metrics, and test label distributions. To address this issue, we develop an approach inspired by Bayes Error to estimate a model’s upper and lower performance bounds. Unlike past work, our approach can be used for both regression and classification. The method starts with a perfect match between target and prediction vectors, then applies label noise to degrade performance. To obtain confidence intervals, we use test-set bootstrapping to produce prediction and target vectors. We present results using voice-based deep learning models that predict depression risk from a conversational speech sample. Models capture both language and acoustic information. For label noise, we introduce results from a corpus in which 5625 unique subjects completed the PHQ-8 twice, separated by a short distraction task. Speech test data come from three real-world corpora encompassing over 3500 total datapoints. The test sets differ in speech elicitation, speech length, and speaker demographics among other factors. Results illustrate how probabilistic performance bounds based on PHQ-8 label noise affect the interpretation and comparison of models over corpora and metrics. Implications for science, technology, and future directions are discussed.

SEPTEMBER 19, 2022

Toward Corpus Size Requirements for Training and Evaluating Depression Risk Models Using Spoken Language
 

isca.png

Abstract: Mental health risk prediction is a growing field in the speech community, but many studies are based on small corpora. This study illustrates how variations in test and train set sizes impact performance in a controlled study. Using a corpus of over 65K labeled data points, results from a fully crossed design of different train/test size combinations are provided. Two model types are included: one based on language and the other on speech acoustics. Both use methods current in this domain. An age-mismatched test set was also included. Results show that (1) test sizes below 1K samples gave noisy results, even for larger training set sizes; (2) training set sizes of at least 2K were needed for stable results; (3) NLP and acoustic models behaved similarly with train/test size variations, and (4) the mismatched test set showed the same patterns as the matched test set. Additional factors are discussed, including label priors, model strength and pre-training, unique speakers, and data lengths. While no single study can specify exact size requirements, results demonstrate the need for appropriately sized train and test sets for future studies of mental health risk prediction from speech and language.

JULY 20, 2022

Generalization of Deep Acoustic and NLP Models for Large-Scale Depression Screening (Chapter 3 from the book Biomedical Sensing and Analysis)

springer.png

Abstract: Depression is a costly and underdiagnosed global health concern, and there is a great need for improved patient screening. Speech technology offers promise for remote screening, but must perform robustly across patient and environmental variables. This chapter describes two deep learning models that achieve excellent performance in this regard. An acoustic model uses transfer learning from an automatic speech recognition (ASR) task. A natural language processing (NLP) model uses transfer learning from a language modeling task. Both models are studied using data from over 10,000 unique users who interacted with human-machine applications using conversational speech. Results for binary classification on a large test set show AUC performance of 0.79 and 0.83 for the acoustic and NLP models, respectively. RMSE for a regression task is 4.70 for the acoustic model and 4.27 for the NLP model. Further analysis of performance as a function of test subset characteristics indicates that the models are generally robust over speaker and session variables. It is concluded that both acoustic and NLP-based models have potential for use in generalized automated depression screening.

JUNE 6, 2021

Speech-Based Depression Prediction using Encoder-Weight-Only Transfer Learning and a Large Corpus

is the world’s largest professional association dedicated to advancing technological innovation and excellence for the benefit of humanity

Abstract: Speech-based algorithms have gained interest for the management of behavioral health conditions such as depression. We explore a speech-based transfer learning approach that uses a lightweight encoder and that transfers only the encoder weights, enabling a simplified run-time model. Our study uses a large data set containing roughly two orders of magnitude more speakers and sessions than used in prior work. The large data set enables reliable estimation of improvement from transfer learning. Results for the prediction of PHQ-8 labels show up to 27% relative performance gains for binary classification; these gains are statistically significant with a p-value close to zero. Improvements were also found for regression. Additionally, the gain from transfer learning does not appear to require strong source task performance. Results suggest that this approach is flexible and offers promise for efficient implementation.

JANUARY 19, 2021

Cross-Demographic Portability of Deep NLP-Based Depression Models

is the world’s largest professional association dedicated to advancing technological innovation and excellence for the benefit of humanity

Abstract: Deep learning models are rapidly gaining interest for real-world applications in behavioral health. An important gap in current literature is how well such models generalize over different populations. We study Natural Language Processing (NLP) based models to explore portability over two different corpora highly mismatched in age. The first and larger corpus contains younger speakers. It is used to train an NLP model to predict depression. When testing on unseen speakers from the same age distribution, this model performs at AUC=0.82. We then test this model on the second corpus, which comprises seniors from a retirement community. Despite the large demographic differences in the two corpora, we saw only modest degradation in performance for the senior-corpus data, achieving AUC=0.76. Interestingly, in the senior population, we find AUC=0.81 for the subset of patients whose health state is consistent over time. Implications for demographic portability of speech-based applications are discussed.

DECEMBER 5, 2020

Robust Speech and Natural Language Processing Models for Depression Screening

is the world’s largest professional association dedicated to advancing technological innovation and excellence for the benefit of humanity

Abstract: Depression is a global health concern with a critical need for increased patient screening. Speech technology offers advantages for remote screening but must perform robustly across patients. We have described two deep learning models developed for this purpose. One model is based on acoustics; the other is based on natural language processing. Both models employ transfer learning. Data from a depression-labeled corpus in which 11,000 unique users interacted with a human-machine application using conversational speech is used. Results on binary depression classification have shown that both models perform at or above AUC=0.80 on unseen data with no speaker overlap. Performance is further analyzed as a function of test subset characteristics, finding that the models are generally robust over speaker and session variables. We conclude that models based on these approaches offer promise for generalized automated depression screening.

NOVEMBER 5, 2020

Depression and Anxiety Prediction Using Deep Language Models and Transfer Learning

is the world’s largest professional association dedicated to advancing technological innovation and excellence for the benefit of humanity

Abstract: Digital screening and monitoring applications can aid providers in the management of behavioral health conditions. We explore deep language models for detecting depression, anxiety, and their comorbidity using input from conversational speech. Speech data comprise 16k spoken interactions labeled for both depression and anxiety. We find that results for binary classification range from 0.86 to 0.79 AUC, depending on condition and comorbidity. Best performance occurs for comorbid cases. We show that this result is not attributable to data skew. Finally, we find evidence suggesting that underlying word sequence cues may be more salient for depression than for anxiety.

SEPTEMBER 10, 2020

Comparing Speech Recognition Services for HCI Applications in Behavioral Health

ACM Digital Library

Abstract: Behavioral health conditions such as depression and anxiety are a global concern, and there is growing interest in employing speech technology to screen and monitor patients remotely. Language modeling approaches require automatic speech recognition (ASR) and multiple privacy-compliant ASR services are commercially available. We use a corpus of over 60 hours of speech from a behavioral health task, and compare ASR performance for four commercial vendors. We expected similar performance, but found large differences between the top and next-best performer, for both mobile (48% relative WER increase) and laptop (67% relative WER increase) data. Results suggest the importance of benchmarking ASR systems in this domain. Additionally we find that WER is not systematically related to depression itself. Performance is however affected by diverse audio quality from users' personal devices, and possibly from the overall style of speech in this domain.

SEPTEMBER 15, 2019

Optimizing Speech-Input Length for Speaker-Independent Depression Classification

International Speech Communication Association

Abstract: Machine learning models for speech-based depression classification offer promise for health care applications. Despite growing work on depression classification, little is understood about how the length of speech-input impacts model performance. We analyze results for speakerindependent depression classification using a corpus of over 1400 hours of speech from a human-machine health screening application. We examine performance as a function of response input length for two NLP systems that differ in overall performance. Results for both systems show that performance depends on natural length, elapsed length, and ordering of the response within a session. Systems share a minimum length threshold, but differ in a response saturation threshold, with the latter higher for the better system. At saturation it is better to pose a new question to the speaker, than to continue the current response. These and additional reported results suggest how applications can be better designed to both elicit and process optimal input lengths for depression classification.

Speech Technology
Depression and Anxiety IEEE
Comparing Speech Recognition ACM
Optmizing Speech Input Miscam
Springer Chapter 3
Toward Corpus Size Requirements for Training and Evaluating Depression Risk Models Using Spoken Language

White Papers

Published Papers

Ellipsis Health + Ceras Healthalysis of Calls for Depression 1.jpg

Transforming Care Management Through AI-Driven Analysis of Calls for Depression

A shifting healthcare landscape is moving towards personalized and data-driven care management. Within this care transformation, Ceras Health and Ellipsis Health began a partnership to better understand and support the mental health of chronically ill patients by using voice and artificial intelligence (AI).

bottom of page