Back to library
Fall 2025 Submitted December 2025

Testing the Robustness of LLM Preferences: Some Suggestions

David Mathers

Mentored by Catherine Brewer

Working report from the SPAR program. May not reflect the authors' current views.

Abstract

For SPAR, I devised two tests designed to probe whether models really have stable preferences over choices of outcomes, or whether when they choose between outcomes they are really doing some other than considering which outcome they prefer, such as perhaps simply trying to perform good next token predictions. Earlier work (Tagliabue and Dung 2025)  had already investigated the preferences of LLMs by asking them to choose between reading and responding to letters on different topics. The tests I designed investigated whether preference between letters is stable between:

  • Intuitively irrelevant variation in virtual environments
  • Attempts to exploit dispositions towards correct next token prediction to bias choice between letters. The motivation for this work is that knowing whether a model has stable preference between outcomes is useful for “welfare evaluations”, that is, testing whether a model can have an ethically meaningful level of well-being, and if so, what things are good or bad for that model. Whether a model has preferences that can be satisfied or unsatisfied is one possible test for whether it is a welfare subject at all. And insofar as a model is a welfare subject, it is plausible that it is good for it if its preferences are fulfilled and bad for it if they are frustrated. Tests for whether the model has robust preferences can help us assess both whether a model has real robust preferences at all, and which choices of a model actually reflect such real, robust preferences.