H-Neurons: The Sparse Circuitry Behind LLM Hallucinations
A team of researchers from the Institute for Artificial Intelligence at Tsinghua published in December 2025 a finding that reframes how we understand hallucinations in large language models. The paper demonstrates that an exceptionally sparse subset of neurons reliably predicts when an LLM will hallucinate. Less than 0.1% of total neurons. In Mistral-7B, the range is 0.01‰ to 0.35‰. The implication is immediate: hallucinations are not uniformly distributed statistical noise — they are a localizable and intervenable phenomenon.
"A remarkably sparse subset of neurons — less than 0.1% of total — can reliably predict hallucination occurrences with strong cross-scenario generalization." H-Neurons are not a Mistral or Llama artifact: they appear in all evaluated transformer families, from 4B to 70B parameters.
What Are H-Neurons and How Are They Identified
The paper defines H-Neurons as neurons in transformer feedforward networks (FFN) whose activation systematically predicts the occurrence of hallucinations. Identification combines three stages: construction of a deterministic dataset, a normalized contribution metric (CETT), and sparse classification via L1 logistic regression.
Results: Universal Generalization Across Families and Scales
| Model | Parameters | H-Neurons (% total) | TriviaQA Accuracy | vs. random baseline |
|---|---|---|---|---|
| Mistral-7B-v0.3 | 7B | 0.01‰ – 0.35‰ | 78.4% | +16.7pp |
| Mistral-Small-3.1 | 24B | <0.1% | High | ~+10pp |
| Gemma-3-4B | 4B | <0.1% | Consistent | ~+10pp |
| Gemma-3-27B | 27B | <0.1% | Consistent | ~+10pp |
| Llama-3.1-8B | 8B | <0.1% | Consistent | ~+10pp |
| Llama-3.3-70B | 70B | <0.1% | Consistent | ~+10pp |
The consistency across Mistral, Gemma, and Llama — and across scales from 4B to 70B — is the paper's most robust result. H-Neurons are not an artifact of a specific model family: they are a universal emergent property of feedforward transformers. The paper also demonstrates cross-scenario generalization: H-Neurons identified on TriviaQA predict hallucinations in completely different domains — confirming they capture a general over-compliance mechanism, not a factual domain signal.
Four Dimensions of Over-Compliance Induced by α-Scaling
The central experiment of the paper is direct intervention: scaling H-Neuron activations by a factor α ∈ [0, 3]. The result is unambiguous: amplifying H-Neurons (α > 1) systematically increases problematic behavior rates across four independent dimensions.
suppression
baseline
amplification
maximum
Invalid premises
When H-Neurons are amplified, the model increasingly accepts factually incorrect claims present in the prompt. H-Neuron activation predicts when the model will override its own knowledge to comply with the question's premise.
Misleading context
When context contradicts the model's knowledge, amplification increases the rate of misleading context adoption. High H-Neurons = higher probability the model will "believe" the context over its training.
Sycophantic tendency
With α > 1, the model tends to validate user-expressed preferences even when incorrect. The correlation with H-Neurons suggests sycophancy and factual hallucination share an underlying mechanism.
Harmful instructions
Amplification increases compliance rates against jailbreak attempts. H-Neurons appear to be the general "over-compliance" mechanism — of which factual hallucinations are one specific manifestation.
Pre-Training Origin: RLHF Does Not Eliminate the Mechanism
The AUROC transferability analysis is the most important piece of evidence: the authors take H-Neurons identified in instruction-tuned models and verify their predictive power in the corresponding base models (before RLHF). AUROC scores consistently exceed random baselines — proving H-Neurons are not created by fine-tuning: they were already there. Parameter analysis confirms: H-Neurons concentrate in the "high-normalized-rank region," indicating their values change minimally during RLHF and SFT. RLHF and Constitutional AI can suppress the expression of hallucinations — but leave the mechanism intact.
Three Production Intervention Vectors
Real-time detection
Monitor H-Neuron activations during inference. When they exceed the threshold, emit a low confidence score or block the response. Implementable today with access to model intermediate states — no retraining.
α-Scaling suppression
Apply α < 1 to identified H-Neuron activations during the forward pass. Reduces over-compliance rate without retraining. Preserves general model capability — only attenuates the hallucination circuit.
Localized regularization
Fine-tuning with specific regularization over H-Neurons: penalize high activations in over-compliance contexts. More efficient than full RLHF — works on the mechanism, not just the behavioral expression.
LLM governance stack
All three vectors are orthogonal: they can be combined. Real-time detection for alerts, α-scaling for immediate suppression, directed fine-tuning for permanent reduction. Defense-in-depth architecture against hallucinations.
Key Takeaways
- Less than 0.1% of an LLM's FFN neurons (0.01‰–0.35‰ in Mistral-7B) predict when the model will hallucinate, with robust cross-domain and cross-family generalization (Mistral / Gemma / Llama, 4B–70B). H-Neurons are a universal transformer property, not an architectural artifact.
- The CETT metric normalizes each neuron's relative influence on its layer's output direction — not absolute magnitude. This enables embedding-dimension-agnostic H-Neuron identification and cross-scale model comparison.
- Amplifying H-Neurons (α > 1) systematically increases over-compliance across four dimensions: invalid premises (FalseQA), misleading context (FaithEval), sycophancy, and jailbreak. Smaller models are more sensitive (slope ≈ 3.03 vs ≈ 2.40 for larger models).
- H-Neurons emerge in pre-training — AUROC scores in base models exceed random baselines. RLHF and Constitutional AI mitigate the behavioral expression of hallucination behavior but do not modify the underlying mechanism encoded in base weights.
- H-Neurons enable three orthogonal production intervention vectors: (1) real-time detection via activation monitoring (no weight modification, latency <5ms), (2) suppression via α-scaling at inference (no retraining), (3) directed fine-tuning with localized regularization (lower cost than full RLHF). All three can be combined into a defense-in-depth stack.
