Spring 2025 Submitted May 2025

Language Model Capability Assessments Should Consider Ensembles

James Sullivan, Scott Wofford

Mentored by Ole Jorgensen

Evaluations Behavioral evaluation of LLMs

Working report from the SPAR program. May not reflect the authors' current views.

Abstract

Companies and governments are increasingly using capability assessments to assess the risks of deploying language models. These assessments typically consider models in isolation, which fails to account for capabilities that arise from combinations of models. Accurately assessing the upper-bound of combinations of models is difficult, due to the huge number of ways models might be combined to answer a question. We define the ensemble upper bound—the best capability attainable by any combination of concurrently queried models—and show that it can surpass pass@k performance for any individual model. Then we demonstrate that considering ensemble upper bounds can significantly improve performance on select capability benchmarks. This provides practical guidance for evaluators aiming to elicit upper-bound model performance for capability assessments.