Fall 2025 Submitted January 2026

Evaluating LLM Judges for Causal and Coverage Faithfulness and Auditing Mixture-of-Experts

AVNI MITTAL

Mentored by Rauno Arike

Working report from the SPAR program. May not reflect the authors' current views.

Abstract

Large language models (LLMs) are increasingly used as judges to assess the faithfulness of chain-of-thought (CoT) reasoning, yet their ability to reliably evaluate whether reasoning is causally valid and complete remains unclear. We introduce C²-Faith, a benchmark built from the PRM800K dataset to evaluate LLM judges along two core dimensions: causality (does a step logically follow?) and coverage (are key intermediate steps present?). We create systematic perturbations by performing step deletions and acausal replacements and evaluate three frontier LLMs (GPT-4.1, DeepSeek-V3.1, o4-mini) under a unified scoring protocol. Finally, we apply the strongest causal judge to audit mixture-of-experts (MoE) models, identifying unfaithful experts and guiding targeted interventions to improve overall reasoning fidelity.