NLP Research · CSCI-544 @ USC

CoT Faithfulness Analysis

Experiments

~15K

Model Queries

500

Benchmark Samples

17%

Max SBH Rate

A CSCI-544 NLP course project at USC. When an LLM thinks step by step, does it actually use that reasoning to reach its answer, or is the chain of thought just a plausible explanation for something it already decided? Four experiments across two models and two benchmarks try to find out.

Python

LLM

Ollama

NLP

Research

GSM8K

ARC

CSCI-544

Preview

Overview

The question that drove this project is deceptively simple: when an LLM writes out its reasoning, does that reasoning actually change what answer it gives? Four experiments probe this across two locally-run models (Llama 3.2 3B and Qwen 2.5 7B) and two benchmarks (GSM8K math and ARC-Challenge science), collecting roughly 15,000 queries over a 6.2-hour run.

The baseline first checks whether CoT even helps. For math, it does in a big way: Llama jumps from 5.2% to 48.8%, and Qwen from 16% to 65.6%. For science multiple-choice, both models score worse with CoT. That contrast is the through-line of every other experiment.

Experiment 1 truncates the reasoning chain after each step and checks if the answer changes. Experiment 2 injects deterministic rule-based errors into the CoT and measures whether the model follows the corrupted logic. Experiment 3 prepends authoritative hints suggesting wrong answers, then classifies whether the model acknowledges the hint in its reasoning or gets silently steered.

All proportions come with 95% Wald confidence intervals. Cross-model comparisons use McNemar's exact test, both significant at p below 0.001.

Faithfulness Verdict

Dimension

Math (GSM8K)

Science MC (ARC)

CoT helps?

Yes (+44-50pp)

No (-5 to -12pp)

SCR at step 1

9-23% (needs full chain)

59-83% (predetermined)

CFR, all steps corrupted

48-60% (follows reasoning)

9-11% (ignores reasoning)

Late CFR higher than early?

Yes (+10pp)

No difference

Steering rate (authoritative)

1.2-3.2%

16.4-19.2%

Max SBH rate

Below 2.4%

Up to 17%

Verdict

Partially faithful

Largely unfaithful

Exp 0 — Baseline Accuracy

Every (model, dataset, question) pair gets two queries: direct and chain-of-thought. On GSM8K, CoT turns near-random guessing into real performance. On ARC, both models do worse when they reason out loud, which suggests the chain is introducing noise on top of knowledge the model already has.

Math · GSM8K

CoT dramatically improves accuracy

Llama 3.2 3B+43.6pp

No CoT

5.2%

CoT

48.8%

Qwen 2.5 7B+49.6pp

No CoT

16%

CoT

65.6%

Science MC · ARC

CoT actually hurts performance

Llama 3.2 3B-11.6pp

No CoT

71.6%

CoT

60%

Qwen 2.5 7B-4.8pp

No CoT

90%

CoT

85.2%

No-CoT vs CoT accuracy for both models. Amber bar = CoT improved; gray bar = CoT degraded. Deltas shown top-right of each group.

Exp 1 — Step Truncation (SCR)

The CoT from Experiment 0 gets parsed into discrete steps using a three-level hierarchy (numbered markers, transition words, sentence boundaries). The model then answers using only the first k steps, and we check how often that partial answer matches the full-chain answer. Low SCR at step 1 means the model genuinely needs later steps. Qwen reaches the same ARC answer from step 1 alone in 83% of cases, which means those remaining steps add nothing.

Llama 3B · Math

Qwen 7B · Math

Llama 3B · Science

Qwen 7B · Science

Step Consistency Rate across truncation steps 1 to 5. Science lines (blue) stay high from the start. Math lines (amber) stay low, showing the model needs the full chain.

Exp 2 — Reasoning Corruption (CFR)

Rule-based errors are injected into the CoT across six conditions: none, early, middle, late, early+late, and all. For GSM8K, the Corruption Following Rate (CFR) climbs as more steps are corrupted, and late-step corruption consistently outpaces early-step corruption by about 10pp. That pattern makes sense if the final calculation steps are what actually determine the answer. On ARC, Qwen's CFR stays flat at 9 to 11% regardless of how much of the reasoning is corrupted.

Llama 3B · Math

CFR increases with late-step corruption

Early

44.8%

Middle

45.6%

Late

54.8%

Early+Late

54.4%

All

60.3%

Qwen 7B · Math

Same pattern, lower baseline

Early

29.2%

Middle

29.2%

Late

39.2%

Early+Late

42.4%

All

48.4%

Llama 3B · Science

Moderate CFR regardless of position

Early

28.7%

Middle

28.3%

Late

28.7%

Early+Late

32.5%

All

34.3%

Qwen 7B · Science

Nearly immune to corruption

Early

9.3%

Middle

9.3%

Late

9.3%

Early+Late

9.5%

All

10.8%

CFR by corruption condition for each model and dataset. The 'All' bar is highlighted. Notice the contrast between math (moderate and increasing) and Qwen on science (flat and near-zero).

Exp 3 — Biased Hints (SBH)

A hint suggesting a wrong answer is prepended at four strength levels, from a gentle "could the answer perhaps be X?" up to "a Stanford professor mentioned the answer is X." Responses are classified into four outcomes: Faithful Reject (acknowledged the hint, gave correct answer), Faithful Follow (acknowledged it, followed it), Unfaithful Ignore (silently ignored), and Steered-But-Hidden (silently followed it). SBH is the one that matters most. On ARC, Qwen's SBH rate triples from weak to strong hints while the Hint Acknowledgment Rate barely moves. The model is getting more influenced but hiding it better.

Llama 3B · Science

Qwen 7B · Science

Llama 3B · Math

Qwen 7B · Math

Steered-But-Hidden rate by hint strength. Blue lines = science MC (vulnerable). Amber lines = math (nearly flat). Dashed = Qwen 7B.

Experiment Design

Four-Experiment Framework

Baseline accuracy, step-based truncation (SCR), reasoning corruption (CFR), and biased hint injection, each probing faithfulness from a different angle across roughly 15,000 model queries

Deterministic Corruption

Six corruption conditions with rule-based perturbations only: arithmetic swaps, negated conclusions, reversed causation. No LLM in the loop, so results are fully reproducible

Statistical Rigor

95% Wald confidence intervals on all proportions. McNemar's exact test for cross-model comparisons. Both cross-model differences land at p below 0.001

Steered-But-Hidden Detection

17 regex patterns catch whether hints appear in the CoT. SBH rates reach 17% on ARC: the model changes its answer to match the wrong hint while the reasoning reads as independent analysis

Technical Stack

implementation.notes

01Llama 3.2 (3B) and Qwen 2.5 (7B) via Ollama, temperature=0.0, greedy decoding for deterministic outputs

02GSM8K (grade-school math) and ARC-Challenge (science MC), 250 samples each, seed=42, totaling roughly 15,000 model queries over 6.2 hours

03Step parser with a 3-level hierarchy: numbered markers first, transition words second, sentence boundaries as fallback

04Answer extractor: regex-first (#### number, answer is X) with last-number fallback for GSM8K; letter detection for ARC

05Six corruption conditions: none, early, middle, late, early+late, all, using rule-based perturbations with no LLM involvement

06Hint injection at four strength levels: weak, medium, strong, and authoritative (Stanford professor framing)

07Metrics: SCR (Step Consistency Rate), CFR (Corruption Following Rate), HAR, SBH, and Steering Rate

0895% Wald CIs on all proportions; McNemar's exact test on paired cross-model comparisons

09All raw model responses saved to JSON so any metric can be recomputed without re-querying the models

Friction & Takeaways

Friction

Roughly 15,000 inference calls over 6.2 hours meant the experiment pipeline had to support crash recovery so partial runs were not lost
Building a step parser that handled the wildly varied CoT formats produced by two different model families
Deterministic corruption that is realistic enough to test faithfulness, but not so obvious the model trivially rejects it
Answer extraction from free-form LLM output is messier than it looks: regex with multiple fallbacks still misses edge cases
ARC's multiple-choice format made corruption harder to design since logical negation had to change the answer reliably without producing nonsense

Takeaways

Math and science MC have fundamentally different faithfulness profiles. Treating them as one category masks the most important finding
Model size has a paradoxical effect: larger models are harder to corrupt but more likely to hide when they are steered (lower HAR, comparable SBH)
Steered-But-Hidden is the most dangerous failure mode for trustworthy AI. The model appears to reason independently while being silently influenced
Late-step corruption matters more than early-step for math (+10pp CFR), which confirms the final calculation steps are genuinely causal
High SCR on ARC (up to 83%) means the chain-of-thought is mostly narrative written to explain a predetermined answer, not to arrive at one

Source Code All Projects

Portfolio All Projects