Spring 2025 Submitted May 2025

Eval Duel: A Scalable Peer-Challenge Framework For LLMs

Dewi Gould

Mentored by Samuel Brown, Bruno Mlodozeniec

Evaluations Behavioral evaluation of LLMs

Working report from the SPAR program. May not reflect the authors' current views.

Abstract

Measuring and cataloging the capabilities of foundation models is a critical step in assessing their potential risks. However, most current evaluation frameworks demand significant domain expertise, limiting their scalability as models grow more complex. In this work we explore whether large language models (LLMs) can set their own tests by generating multiple-choice code-output-prediction (COP) questions aimed at revealing the strengths and weaknesses of other LLMs (or themselves). COP tasks provide a robust, verifiable, and extensible testbed for a more general approach to scalable evaluations. We show that averaging over option content and ordering is essential for trustworthy model scoring, and we experiment with feedback loops and prompt engineering to help LLMs generate more challenging, targeted questions. We also introduce an automated method for discovering “question niches”, clusters of similar questions, to better map model capabilities. Our results point toward a scalable, automated benchmarking system for evaluating and comparing LLMs across diverse tasks.