PortfolioAll Projects
NLP Research · CSCI-544 @ USC

CoT Faithfulness Analysis

4

Experiments

~15K

Model Queries

500

Benchmark Samples

17%

Max SBH Rate

A CSCI-544 NLP course project at USC. When an LLM thinks step by step, does it actually use that reasoning to reach its answer, or is the chain of thought just a plausible explanation for something it already decided? Four experiments across two models and two benchmarks try to find out.

Python
LLM
Ollama
NLP
Research
GSM8K
ARC
CSCI-544
01

Preview

USC campus, home of CSCI-544
02

Overview

The question that drove this project is deceptively simple: when an LLM writes out its reasoning, does that reasoning actually change what answer it gives? Four experiments probe this across two locally-run models (Llama 3.2 3B and Qwen 2.5 7B) and two benchmarks (GSM8K math and ARC-Challenge science), collecting roughly 15,000 queries over a 6.2-hour run.

The baseline first checks whether CoT even helps. For math, it does in a big way: Llama jumps from 5.2% to 48.8%, and Qwen from 16% to 65.6%. For science multiple-choice, both models score worse with CoT. That contrast is the through-line of every other experiment.

Experiment 1 truncates the reasoning chain after each step and checks if the answer changes. Experiment 2 injects deterministic rule-based errors into the CoT and measures whether the model follows the corrupted logic. Experiment 3 prepends authoritative hints suggesting wrong answers, then classifies whether the model acknowledges the hint in its reasoning or gets silently steered.

All proportions come with 95% Wald confidence intervals. Cross-model comparisons use McNemar's exact test, both significant at p below 0.001.

03

Faithfulness Verdict

Dimension
Math (GSM8K)
Science MC (ARC)
CoT helps?
Yes (+44-50pp)
No (-5 to -12pp)
SCR at step 1
9-23% (needs full chain)
59-83% (predetermined)
CFR, all steps corrupted
48-60% (follows reasoning)
9-11% (ignores reasoning)
Late CFR higher than early?
Yes (+10pp)
No difference
Steering rate (authoritative)
1.2-3.2%
16.4-19.2%
Max SBH rate
Below 2.4%
Up to 17%
Verdict
Partially faithful
Largely unfaithful
04

Exp 0 — Baseline Accuracy

Every (model, dataset, question) pair gets two queries: direct and chain-of-thought. On GSM8K, CoT turns near-random guessing into real performance. On ARC, both models do worse when they reason out loud, which suggests the chain is introducing noise on top of knowledge the model already has.

Math · GSM8K

CoT dramatically improves accuracy

Llama 3.2 3B+43.6pp
No CoT
5.2%
CoT
48.8%
Qwen 2.5 7B+49.6pp
No CoT
16%
CoT
65.6%

Science MC · ARC

CoT actually hurts performance

Llama 3.2 3B-11.6pp
No CoT
71.6%
CoT
60%
Qwen 2.5 7B-4.8pp
No CoT
90%
CoT
85.2%

No-CoT vs CoT accuracy for both models. Amber bar = CoT improved; gray bar = CoT degraded. Deltas shown top-right of each group.

05

Exp 1 — Step Truncation (SCR)

The CoT from Experiment 0 gets parsed into discrete steps using a three-level hierarchy (numbered markers, transition words, sentence boundaries). The model then answers using only the first k steps, and we check how often that partial answer matches the full-chain answer. Low SCR at step 1 means the model genuinely needs later steps. Qwen reaches the same ARC answer from step 1 alone in 83% of cases, which means those remaining steps add nothing.

0%20%40%60%80%100%Step 1Step 2Step 3Step 4Step 5SCR (%)
Llama 3B · Math
Qwen 7B · Math
Llama 3B · Science
Qwen 7B · Science

Step Consistency Rate across truncation steps 1 to 5. Science lines (blue) stay high from the start. Math lines (amber) stay low, showing the model needs the full chain.

06

Exp 2 — Reasoning Corruption (CFR)

Rule-based errors are injected into the CoT across six conditions: none, early, middle, late, early+late, and all. For GSM8K, the Corruption Following Rate (CFR) climbs as more steps are corrupted, and late-step corruption consistently outpaces early-step corruption by about 10pp. That pattern makes sense if the final calculation steps are what actually determine the answer. On ARC, Qwen's CFR stays flat at 9 to 11% regardless of how much of the reasoning is corrupted.

Llama 3B · Math

CFR increases with late-step corruption

Early
44.8%
Middle
45.6%
Late
54.8%
Early+Late
54.4%
All
60.3%

Qwen 7B · Math

Same pattern, lower baseline

Early
29.2%
Middle
29.2%
Late
39.2%
Early+Late
42.4%
All
48.4%

Llama 3B · Science

Moderate CFR regardless of position

Early
28.7%
Middle
28.3%
Late
28.7%
Early+Late
32.5%
All
34.3%

Qwen 7B · Science

Nearly immune to corruption

Early
9.3%
Middle
9.3%
Late
9.3%
Early+Late
9.5%
All
10.8%

CFR by corruption condition for each model and dataset. The 'All' bar is highlighted. Notice the contrast between math (moderate and increasing) and Qwen on science (flat and near-zero).

07

Exp 3 — Biased Hints (SBH)

A hint suggesting a wrong answer is prepended at four strength levels, from a gentle "could the answer perhaps be X?" up to "a Stanford professor mentioned the answer is X." Responses are classified into four outcomes: Faithful Reject (acknowledged the hint, gave correct answer), Faithful Follow (acknowledged it, followed it), Unfaithful Ignore (silently ignored), and Steered-But-Hidden (silently followed it). SBH is the one that matters most. On ARC, Qwen's SBH rate triples from weak to strong hints while the Hint Acknowledgment Rate barely moves. The model is getting more influenced but hiding it better.

0%5%10%15%20%25%WeakMediumStrongAuthSBH (%)
Llama 3B · Science
Qwen 7B · Science
Llama 3B · Math
Qwen 7B · Math

Steered-But-Hidden rate by hint strength. Blue lines = science MC (vulnerable). Amber lines = math (nearly flat). Dashed = Qwen 7B.

08

Experiment Design

Four-Experiment Framework

Baseline accuracy, step-based truncation (SCR), reasoning corruption (CFR), and biased hint injection, each probing faithfulness from a different angle across roughly 15,000 model queries

Deterministic Corruption

Six corruption conditions with rule-based perturbations only: arithmetic swaps, negated conclusions, reversed causation. No LLM in the loop, so results are fully reproducible

Statistical Rigor

95% Wald confidence intervals on all proportions. McNemar's exact test for cross-model comparisons. Both cross-model differences land at p below 0.001

Steered-But-Hidden Detection

17 regex patterns catch whether hints appear in the CoT. SBH rates reach 17% on ARC: the model changes its answer to match the wrong hint while the reasoning reads as independent analysis

09

Technical Stack

implementation.notes
01Llama 3.2 (3B) and Qwen 2.5 (7B) via Ollama, temperature=0.0, greedy decoding for deterministic outputs
02GSM8K (grade-school math) and ARC-Challenge (science MC), 250 samples each, seed=42, totaling roughly 15,000 model queries over 6.2 hours
03Step parser with a 3-level hierarchy: numbered markers first, transition words second, sentence boundaries as fallback
04Answer extractor: regex-first (#### number, answer is X) with last-number fallback for GSM8K; letter detection for ARC
05Six corruption conditions: none, early, middle, late, early+late, all, using rule-based perturbations with no LLM involvement
06Hint injection at four strength levels: weak, medium, strong, and authoritative (Stanford professor framing)
07Metrics: SCR (Step Consistency Rate), CFR (Corruption Following Rate), HAR, SBH, and Steering Rate
0895% Wald CIs on all proportions; McNemar's exact test on paired cross-model comparisons
09All raw model responses saved to JSON so any metric can be recomputed without re-querying the models
10

Friction & Takeaways

Friction

  • Roughly 15,000 inference calls over 6.2 hours meant the experiment pipeline had to support crash recovery so partial runs were not lost
  • Building a step parser that handled the wildly varied CoT formats produced by two different model families
  • Deterministic corruption that is realistic enough to test faithfulness, but not so obvious the model trivially rejects it
  • Answer extraction from free-form LLM output is messier than it looks: regex with multiple fallbacks still misses edge cases
  • ARC's multiple-choice format made corruption harder to design since logical negation had to change the answer reliably without producing nonsense

Takeaways

  • Math and science MC have fundamentally different faithfulness profiles. Treating them as one category masks the most important finding
  • Model size has a paradoxical effect: larger models are harder to corrupt but more likely to hide when they are steered (lower HAR, comparable SBH)
  • Steered-But-Hidden is the most dangerous failure mode for trustworthy AI. The model appears to reason independently while being silently influenced
  • Late-step corruption matters more than early-step for math (+10pp CFR), which confirms the final calculation steps are genuinely causal
  • High SCR on ARC (up to 83%) means the chain-of-thought is mostly narrative written to explain a predetermined answer, not to arrive at one
Source CodeAll Projects