Research to advance mental health care
Through our continued work and research, we believe the unique power of voice, machine learning, and AI will scale human capacity to advance quality mental health care - connecting the dots to a happier and healthier future.
Published Papers and Independent Review Board (IRB) Studies
APRIL 8, 2022
Feasibility of a Machine-Learning Based Smartphone Application in Detecting Depression and Anxiety in a Generally Senior Population
Abstract: Depression and anxiety create a large health burden and increase the risk of premature mortality. Mental health screening is vital, but more sophisticated screening and monitoring methods are needed. The Ellipsis Health App addresses this need by using semantic information from recorded speech to screen for depression and anxiety.
FEBRUARY 12, 2022 Study Protocol
Evaluating the Feasibility and Acceptability of an Artificial-Intelligence-Enabled and Speech-Based Distress Screening Mobile App for Adolescents and Young Adults Diagnosed with Cancer
Abstract: Adolescent and young adult (AYA) patients diagnosed with cancer are at a higher risk of psychological distress, which requires regular monitoring throughout their cancer journeys. Paper-and-pencil or digital surveys for psychological stress are often cumbersome to complete during a patient’s visit, and many patients find completing the same survey multiple times repetitive and boring. Recent advances in mobile technology and speech science have enabled flexible and engaging ways of monitoring psychological distress. This paper describes the scientific process we will use to evaluate an artificial intelligence (AI)-enabled mobile app to monitor depression and anxiety among AYAs diagnosed with cancer.
Independent Review Board Studies
Clinical Validation in Senior Population
Ellipsis Health conducted a study of a majority senior population at Desert Oasis Healthcare (DOHC) in Palm Springs, CA. Ellipsis recruited 250+ patients with a previous history of depression plus a control group without depression. Each subject was asked to perform six voice recording sessions at least one week apart where each session consisted of three minutes of speech through answering open-ended questions that were designed to reveal their internal mental state. The Ellipsis Health App demonstrated feasibility in using voice recordings to screen for depression and anxiety among various age groups and almost 30% of participants spoke longer or did more sessions than the required amount. Findings have been published in Frontiers in Psychology.
Monitoring Pre- and Post-Operative Patients
The Vanderbilt University Medical Center and Ellipsis Health study involves 250+ spine surgical patients who will be monitored for their severity of depression and anxiety throughout the surgical journey (pre-operatively, then weekly in the postoperative period). The study will also explore the relationship between depression/ anxiety and other pain-related measures.
Supporting Employee & Caregiver Wellbeing
This Mayo Clinic and Ellipsis Health study involves 50 adult employees who are part of the Stress Management and Resilience Training (SMART) program at Mayo Clinic. Study participants will engage with the Ellipsis Health App weekly for 3 months to assess the employees' severity of depression and anxiety.
Supporting & Evaluating the Mental Wellbeing of Adolescents
In partnership, University of Denver, Graduate School of Social Work, University of Michigan School of Social Work, and Ellipsis Health aim to validate Ellipsis Health’s screening tool for anxiety and depression in 700 adolescents aged 11-17. Additionally, 60 adolescents enrolled in the validation study will use the Ellipsis Health App for evaluating if it is effective in improving a student’s mental wellbeing as well as improving the screening and monitoring of their depression and anxiety. Thirty students will also be recruited to provide focus group feedback on the acceptability of the Ellipsis Health App, as well as acceptability of a mental health resource page built into the Ellipsis Health App for school mental health clinicians.
Supporting the Mental Wellbeing of Adolescent & Young Adult Cancer Patients
This University of Michigan Medical Center and Ellipsis Health study will evaluate the feasibility and acceptability of the Ellipsis Health App to assess the psychological distress among adolescent and young adult (AYA) patients who have been diagnosed with cancer. In this study, 60 AYAs will be monitored using the Ellipsis Health tool once a month over a 6 month period.
Averting Crisis Events
Ellipsis Health conducted a study with Mind Springs Health Depression Clinic. The study asked 100+ newly enrolled patients at the Depression Clinic program to perform weekly voice samples using Ellipsis Health’s App. During the study period, clinicians’ face to face assessments of depression and anxiety of the study participants were also collected. Eighty crisis events were averted through the use of Ellipsis Health. We will compare the gold standard clinicians’ assessments with the PHQ9/GAD7 scores and our technology outputs of the severity of depression and anxiety.
Supporting Long-Haul Covid Patients
This Mayo Clinic and Ellipsis Health study involves 200+ adults with long term Covid symptoms (“long haulers”) and the influence of social isolation related to Covid-19. Participants will use Ellipsis Health’s App every other week for 24 weeks to assess their severity of depression and anxiety.
Comparing the HAMD-6, PHQ-9 and Ellipsis Health in Inpatient and Outpatient Programs
The Hartford Health and Ellipsis Health study will recruit 300 adult patients in the Partial Hospitalization and Outpatient programs. Patients will complete a weekly voice journal with Ellipsis Health. The HamD-6 and PHQ-9 will also be collected weekly for up to 12 weeks. The aim of the study is to compare Ellipsis Health scores for depression with HAMD-6 and PHQ-9 scores, and to assess the utility of the Ellipsis Health scores in assisting in the treatment of patients.
Evaluating Ellipsis Health + Comprehensive Set of Mental Health Assessments in College Students
The Penn State and Ellipsis Health study will compare the Ellipsis Health App scores to a comprehensive set of screening and diagnostic assessments for behavioral health conditions including anxiety and depression in 300+ mostly college students. The assessments include Beck Depression Inventory-II, Generalized Anxiety Disorder Questionnaire, Social Phobia Diagnostic Questionnaire, Marlow-Crowne Social Desirability Scale, Positive Impression Management Scale, Negative Impression Management Scale, Marlow-Crowne Social Desirability Scale, Positive Impression Management Scale, Negative Impression Management Scale, Patient Health Questionnaire-8, Generalized Anxiety Disorder-7, Panic Disorder Severity Rating, Dimensional Obsessive-Compulsive Scale, Posttraumatic Stress Disorder Checklist, Snaith-Hamilton Pleasure Scale and a MINI Version 7.0 structured interview.
Supporting the Mental Wellbeing of Caregivers for Adolescent & Young Adult Cancer Patients
This University of Texas, MD Anderson Cancer Center and Ellipsis Health study will evaluate the feasibility and acceptability of the Ellipsis Health App to assess the psychological distress among caregivers of adolescent and young adult (AYA) patients who have been diagnosed with cancer. In this study, 60 caregivers will be monitored using the Ellipsis Health tool once a month over a 6 month period.
Peer-Reviewed Speech Technology Publications
SEPTEMBER 19, 2022
Toward Corpus Size Requirements for Training and Evaluating Depression Risk Models Using Spoken Language
Abstract: Mental health risk prediction is a growing field in the speech community, but many studies are based on small corpora. This study illustrates how variations in test and train set sizes impact performance in a controlled study. Using a corpus of over 65K labeled data points, results from a fully crossed design of different train/test size combinations are provided. Two model types are included: one based on language and the other on speech acoustics. Both use methods current in this domain. An age-mismatched test set was also included. Results show that (1) test sizes below 1K samples gave noisy results, even for larger training set sizes; (2) training set sizes of at least 2K were needed for stable results; (3) NLP and acoustic models behaved similarly with train/test size variations, and (4) the mismatched test set showed the same patterns as the matched test set. Additional factors are discussed, including label priors, model strength and pre-training, unique speakers, and data lengths. While no single study can specify exact size requirements, results demonstrate the need for appropriately sized train and test sets for future studies of mental health risk prediction from speech and language.
JULY 20, 2022
Generalization of Deep Acoustic and NLP Models for Large-Scale Depression Screening (Chapter 3 from the book Biomedical Sensing and Analysis)
Abstract: Depression is a costly and underdiagnosed global health concern, and there is a great need for improved patient screening. Speech technology offers promise for remote screening, but must perform robustly across patient and environmental variables. This chapter describes two deep learning models that achieve excellent performance in this regard. An acoustic model uses transfer learning from an automatic speech recognition (ASR) task. A natural language processing (NLP) model uses transfer learning from a language modeling task. Both models are studied using data from over 10,000 unique users who interacted with human-machine applications using conversational speech. Results for binary classification on a large test set show AUC performance of 0.79 and 0.83 for the acoustic and NLP models, respectively. RMSE for a regression task is 4.70 for the acoustic model and 4.27 for the NLP model. Further analysis of performance as a function of test subset characteristics indicates that the models are generally robust over speaker and session variables. It is concluded that both acoustic and NLP-based models have potential for use in generalized automated depression screening.
JUNE 6, 2021
Speech-Based Depression Prediction using Encoder-Weight-Only Transfer Learning and a Large Corpus
Abstract: Speech-based algorithms have gained interest for the management of behavioral health conditions such as depression. We explore a speech-based transfer learning approach that uses a lightweight encoder and that transfers only the encoder weights, enabling a simplified run-time model. Our study uses a large data set containing roughly two orders of magnitude more speakers and sessions than used in prior work. The large data set enables reliable estimation of improvement from transfer learning. Results for the prediction of PHQ-8 labels show up to 27% relative performance gains for binary classification; these gains are statistically significant with a p-value close to zero. Improvements were also found for regression. Additionally, the gain from transfer learning does not appear to require strong source task performance. Results suggest that this approach is flexible and offers promise for efficient implementation.
JANUARY 19, 2021
Cross-Demographic Portability of Deep NLP-Based Depression Models
Abstract: Deep learning models are rapidly gaining interest for real-world applications in behavioral health. An important gap in current literature is how well such models generalize over different populations. We study Natural Language Processing (NLP) based models to explore portability over two different corpora highly mismatched in age. The first and larger corpus contains younger speakers. It is used to train an NLP model to predict depression. When testing on unseen speakers from the same age distribution, this model performs at AUC=0.82. We then test this model on the second corpus, which comprises seniors from a retirement community. Despite the large demographic differences in the two corpora, we saw only modest degradation in performance for the senior-corpus data, achieving AUC=0.76. Interestingly, in the senior population, we find AUC=0.81 for the subset of patients whose health state is consistent over time. Implications for demographic portability of speech-based applications are discussed.
DECEMBER 5, 2020
Robust Speech and Natural Language Processing Models for Depression Screening
Abstract: Depression is a global health concern with a critical need for increased patient screening. Speech technology offers advantages for remote screening but must perform robustly across patients. We have described two deep learning models developed for this purpose. One model is based on acoustics; the other is based on natural language processing. Both models employ transfer learning. Data from a depression-labeled corpus in which 11,000 unique users interacted with a human-machine application using conversational speech is used. Results on binary depression classification have shown that both models perform at or above AUC=0.80 on unseen data with no speaker overlap. Performance is further analyzed as a function of test subset characteristics, finding that the models are generally robust over speaker and session variables. We conclude that models based on these approaches offer promise for generalized automated depression screening.
NOVEMBER 5, 2020
Depression and Anxiety Prediction Using Deep Language Models and Transfer Learning
Abstract: Digital screening and monitoring applications can aid providers in the management of behavioral health conditions. We explore deep language models for detecting depression, anxiety, and their comorbidity using input from conversational speech. Speech data comprise 16k spoken interactions labeled for both depression and anxiety. We find that results for binary classification range from 0.86 to 0.79 AUC, depending on condition and comorbidity. Best performance occurs for comorbid cases. We show that this result is not attributable to data skew. Finally, we find evidence suggesting that underlying word sequence cues may be more salient for depression than for anxiety.
SEPTEMBER 10, 2020
Comparing Speech Recognition Services for HCI Applications in Behavioral Health
Abstract: Behavioral health conditions such as depression and anxiety are a global concern, and there is growing interest in employing speech technology to screen and monitor patients remotely. Language modeling approaches require automatic speech recognition (ASR) and multiple privacy-compliant ASR services are commercially available. We use a corpus of over 60 hours of speech from a behavioral health task, and compare ASR performance for four commercial vendors. We expected similar performance, but found large differences between the top and next-best performer, for both mobile (48% relative WER increase) and laptop (67% relative WER increase) data. Results suggest the importance of benchmarking ASR systems in this domain. Additionally we find that WER is not systematically related to depression itself. Performance is however affected by diverse audio quality from users' personal devices, and possibly from the overall style of speech in this domain.
SEPTEMBER 15, 2019
Optimizing Speech-Input Length for Speaker-Independent Depression Classification
Abstract: Machine learning models for speech-based depression classification offer promise for health care applications. Despite growing work on depression classification, little is understood about how the length of speech-input impacts model performance. We analyze results for speakerindependent depression classification using a corpus of over 1400 hours of speech from a human-machine health screening application. We examine performance as a function of response input length for two NLP systems that differ in overall performance. Results for both systems show that performance depends on natural length, elapsed length, and ordering of the response within a session. Systems share a minimum length threshold, but differ in a response saturation threshold, with the latter higher for the better system. At saturation it is better to pose a new question to the speaker, than to continue the current response. These and additional reported results suggest how applications can be better designed to both elicit and process optimal input lengths for depression classification.