Spring 2025 Submitted May 2025

Value Alignment in LLM Collectives: Metrics and Benchmarks

Luiza Corpaci, ARITRA DAS, Marlon Fu

Mentored by Jonas Hallgren, Aaron Halpern

Working report from the SPAR program. May not reflect the authors' current views.

Abstract

We present an initial exploration into value alignment dynamics within large language model (LLM) collectives. Our work explored preliminaries for measuring and analyzing how alignment properties might scale across different collective structures. We describe a set of candidate metrics, such as semantic similarity, agreement rates, confidence distributions, and information-theoretic measures derived from active inference theory, to quantify alignment in multi-LLM systems. Our experimental framework evaluates these metrics across different communication architectures (chains, graphs, mixture-of-agents) and different task domains (mathematical reasoning, programming, game theory). Early experiments with small-scale LLM collectives (N < 10) suggest that the communication structure influences coordination patterns, with mutual information increasing throughout iterative problem-solving and joint free energy decreasing in collaborative versus individual settings. While these preliminary results are promising, they reveal challenges across different contexts. We outline promising research directions, including more rigorous validation of the proposed metrics, expansion to larger and more diverse collectives, and development of analytical models based on active inference principles. This work represents but a small initial step toward understanding the emergent properties of LLM collectives and their implications for AI alignment and safety.