Evaluating LLM Judges for Causal and Coverage Faithfulness and Auditing Mixture-of-Experts
AVNI MITTAL
Mentored by Rauno Arike
Working report from the SPAR program. May not reflect the authors' current views.
Abstract
Large language models (LLMs) are increasingly used as judges to assess the faithfulness of chain-of-thought (CoT) reasoning, yet their ability to reliably evaluate whether reasoning is causally valid and complete remains unclear. We introduce C²-Faith, a benchmark built from the PRM800K dataset to evaluate LLM judges along two core dimensions: causality (does a step logically follow?) and coverage (are key intermediate steps present?). We create systematic perturbations by performing step deletions and acausal replacements and evaluate three frontier LLMs (GPT-4.1, DeepSeek-V3.1, o4-mini) under a unified scoring protocol. Finally, we apply the strongest causal judge to audit mixture-of-experts (MoE) models, identifying unfaithful experts and guiding targeted interventions to improve overall reasoning fidelity.