Eval Duel: A Scalable Peer-Challenge Framework For LLMs
Dewi Gould
Mentored by Samuel Brown, Bruno Mlodozeniec
Working report from the SPAR program. May not reflect the authors' current views.
Abstract
Measuring and cataloging the capabilities of foundation models is a critical step in assessing their potential risks. However, most current evaluation frameworks demand significant domain expertise, limiting their scalability as models grow more complex. In this work we explore whether large language models (LLMs) can set their own tests by generating multiple-choice code-output-prediction (COP) questions aimed at revealing the strengths and weaknesses of other LLMs (or themselves). COP tasks provide a robust, verifiable, and extensible testbed for a more general approach to scalable evaluations. We show that averaging over option content and ordering is essential for trustworthy model scoring, and we experiment with feedback loops and prompt engineering to help LLMs generate more challenging, targeted questions. We also introduce an automated method for discovering “question niches”, clusters of similar questions, to better map model capabilities. Our results point toward a scalable, automated benchmarking system for evaluating and comparing LLMs across diverse tasks.