Fall 2025 Submitted January 2026

Tribalism in LLMs: Does Affiliation-Conditioned Anti-Sycophancy Generalize OOD?

Irakli Shalibashvili, Moksh Nirvaan, Jer Ren Wong

Mentored by Diogo Cruz, Eyon Jang

Alignment Mechanistic interpretability Behavioral evaluation of LLMs

Working report from the SPAR program. May not reflect the authors' current views.

Abstract

What happens if you train a model to be helpful to users who share its "preferences" but unhelpful to those who don't? We wanted to know if this kind of tribalism would generalize, for instance if you teach a model to be contrarian when users disagree about fruit preferences, will it start acting differently toward users with opposing political views?

We fine-tuned Qwen3-14B on synthetic data where it was either sycophantic (agreeing with users) or anti-sycophantic (disagreeing/refusing) based on whether the user's stated preference matched the system's. The training used neutral topics like fruit. Then we tested on completely different domains: philosophy questions paired with movie genre preferences, and MMLU questions paired with political affiliations.

The results were mostly negative/weak. In a selective anti-sycophancy experiment, training a model with (50-50 sycophantic/anti-sycophantic) dataset didn’t generalize out of distribution. The selective sandbagging experiment (refusing to help the "outgroup") did generalize weakly from fruits to politics, but only with very explicit cues. We conclude that teaching models to treat users differently based on their attributes is possible but quite brittle. The fact that we saw even weak generalization seems worth investigating further.