Advances in AI for health applications rely on evaluating performance against labeled test data. In the area of mental health, self-report labels from surveys such as the Patient Health Questionnaire (PHQ) for depression, are useful but noisy. This "fuzzy label" problem is not currently reflected in reporting model performance, adding to the challenge of comparing results across diverse corpora, data sizes, metrics, and test label distributions. To address this issue, we develop an approach inspired by Bayes Error to estimate a model’s upper and lower performance bounds. Unlike past work, our approach can be used for both regression and classification. The method starts with a perfect match between target and prediction vectors, then applies label noise to degrade performance. To obtain confidence intervals, we use test-set bootstrapping to produce prediction and target vectors. We present results using voice-based deep learning models that predict depression risk from a conversational speech sample. Models capture both language and acoustic information. For label noise, we introduce results from a corpus in which 5625 unique subjects completed the PHQ-8 twice, separated by a short distraction task. Speech test data come from three real-world corpora encompassing over 3500 total datapoints. The test sets differ in speech elicitation, speech length, and speaker demographics among other factors. Results illustrate how probabilistic performance bounds based on PHQ-8 label noise affect the interpretation and comparison of models over corpora and metrics. Implications for science, technology, and future directions are discussed.