π Paper Link (arXiv)
Jaeyong KoΒΉ, Pilsung KangΒΉ, Yukyung LeeΒ²β
ΒΉSeoul National University, Β²Boston University
β Corresponding author
Large language models (LLMs) reach high accuracy in mathematical reasoning, but individual traces on the same problem diverge; some arrive at the correct answer while others fail. Prior work analyzes failure at the step, chunk, or sentence level, or at tokens where failure has already occurred. Neither identifies the precise token that triggers the shift toward failure. We introduce the cliff token, a token where the token-wise potential drops significantly under an adaptive threshold that scales with the local token-wise potential, based on a one-sided two-proportion z-test. Across seven models and three mathematical reasoning benchmarks (GSM1K, MATH500, AIME 2025), cliff tokens act as failure triggers; deleting the first cliff token and resampling recovers pass@64 to 1.0, while keeping it limits recovery to 0.71β1.00. We further introduce a cliff taxonomy of deterministic, uncertain, and sampled-off cliffs, defined by greedy choice and token entropy. Each type has distinct probabilistic characteristics, and the taxonomy generalizes across model scales. Finally, we validate the taxonomy via single-token preference optimization at cliff positions (Cliff-DPO). Trained on GSM8K, Cliff-DPO improves accuracy across benchmarks by up to +6.6. Optimizing at uncertain and sampled-off cliffs improves reasoning, while deterministic cliffs do not.
- Token-Wise Potential. The probability that a reasoning process reaches the correct answer, given the partial trace up to token position
t. - Cliff Token. A token whose rollout-estimated potential drops significantly under the adaptive threshold
Ξ_t > 0.1 + 1.645 Β· SE_t. - RQ1. Failure Trigger. Cliff tokens occur more often in incorrect traces; deleting the first cliff token (
Cliff-del) restores reasoning more reliably than continuing from it (Cliff-keep). - RQ2. Cliff Taxonomy. Cliff tokens are categorized by greedy choice and token entropy into deterministic, uncertain, and sampled-off cliffs.
- RQ3. Family and Scale Effects. Deterministic cliffs are largely scale-invariant, uncertain cliffs expose model-specific knowledge gaps, and sampled-off cliffs show scale-asymmetry.
- Cliff-DPO. Single-token preference optimization at cliff positions improves reasoning when trained on uncertain and sampled-off cliffs, while deterministic cliffs do not.
.
βββ src/ # core Python package
βββ scripts/ # experiment entrypoints
βββ figure/ # figure notebooks, reduced data, generated figures
βββ paper_images/ # exact PDF images used in the paper
βββ requirements.txt # tested Python/CUDA dependency pins
βββ README.md
βββ LICENSE
output/
βββ 01_inference/ # sampled reasoning traces
βββ 02_token_stats/ # per-token logprob/rank/entropy stats
βββ 03_rollout/ # token-wise potential rollout outputs
βββ 04_cliff_occurrence/ # cliff occurrence and taxonomy summaries
βββ 05_deletion_ablation/ # Cliff-del / Cliff-keep pass@k results
βββ 06_entropy_rank/ # entropy/rank analyses around cliffs
βββ 07_candidate_replacement/ # candidate replacement at cliff positions
βββ 08_cpm_shift/ # cross-model cliff probability mass shift
βββ 09_cliff_dpo/
βββ 01_candidates/ # top-k candidate rollout at cliff positions
βββ 02_pairs/ # cliff-position preference pairs
βββ 03_training/ # trained Cliff-DPO adapters
βββ 04_eval/ # adapter evaluation outputs
βββ 05_cliff_count/ # post-training cliff-count evaluation
βββ logs/
git clone https://github.com/beaver-22/Cliff-token.git
cd Cliff-tokenconda create -n cliff python=3.10 -y
conda activate cliff
pip install -r requirements.txtexport GPU_IDS=0
export CUDA_VISIBLE_DEVICES="$GPU_IDS"
export HF_TOKEN=hf_xxx # for gated Llama/Gemma models
python -m src.utils.download_models --hf_token "$HF_TOKEN"
python -m src.utils.download_datasets --dataset gsm1k math500 aime25 gsm8k
python -m src.utils.create_subsets --seed 42- Generate sampled reasoning traces for the target model and datasets.
bash scripts/run_inference.sh \
--model qwen3-0.6b \
--dataset gsm1k_100,math500_100,aime25 \
--gpus "$GPU_IDS" \
--output_dir output/01_inference- Compute token-level logprob, rank, and entropy statistics
python3 scripts/_compute_token_stats.py \
--gpu "$GPU_IDS" \
--source output/01_inference \
--output_dir output/02_token_stats \
--skip-existing- Estimate token-wise potential by rollout sampling. The paper uses
N=64rollouts per token position.
bash scripts/run_rollout.sh \
--model qwen3-0.6b \
--dataset gsm1k_100 \
--data_path output/01_inference/Qwen3-0.6B/gsm1k_100_all_paths.json \
--rollout_samples 64 \
--gpus "$GPU_IDS" \
--output_dir output/03_rollout/Qwen3-0.6BRQ1 measures cliff occurrence and tests whether the first cliff token is a failure trigger.
bash scripts/run_exp1_occurrence.sh \
--rollout_dir output/03_rollout \
--datasets gsm1k_100,math500_100,aime25 \
--output_dir output/04_cliff_occurrence/paperbash scripts/run_exp1_deletion.sh \
--rollout_dir output/03_rollout \
--datasets gsm1k_100,math500_100,aime25 \
--gpus "$GPU_IDS" \
--output_dir output/05_deletion_ablation/paper_batchRQ2 analyzes entropy/rank behavior and assigns deterministic, uncertain, or sampled-off cliff categories.
bash scripts/run_exp3_entropy.sh \
--rollout_dir output/03_rollout \
--baseline_dir output/02_token_stats \
--datasets gsm1k_100,math500_100,aime25 \
--gpus "$GPU_IDS" \
--output_dir output/06_entropy_rank/paper_batchRQ2/RQ3 evaluate candidate replacement and cross-model cliff probability mass shift.
bash scripts/run_exp4_candidates_all_models.sh \
--gpus "$GPU_IDS" \
--parallel_mode autobash scripts/run_exp5_cpm_shift.sh \
--sources qwen3-0.6b,qwen3-8b \
--evals qwen3-0.6b,qwen3-8b \
--datasets gsm1k_100,math500_100,aime25 \
--gpus "$GPU_IDS" \
--output_dir output/08_cpm_shift/qwen_small_big_batchbash scripts/run_dpo_rollout.sh \
--model qwen3-0.6b \
--dataset gsm8k \
--data_path output/03_rollout/Qwen3-0.6B/gsm8k_all_paths.json \
--gpus "$GPU_IDS" \
--k_candidates 10 \
--num_samples 64python -m src.dpo.build_dpo_pairs \
--candidates_path output/09_cliff_dpo/01_candidates/Qwen3-0.6B/gsm8k_cliff_candidates.json \
--output_dir output/09_cliff_dpo/02_pairs/Qwen3-0.6B \
--strategy cliff_1N \
--category_ablationsbash scripts/run_dpo_train.sh \
--suite \
--model ./model/Qwen3-0.6B \
--dataset gsm8k \
--gpus "$GPU_IDS" \
--wandb_mode disabledpython -m src.dpo.evaluate \
--model qwen3-0.6b \
--adapter_paths none \
output/09_cliff_dpo/03_training/Qwen3-0.6B/gsm8k/cliff_all \
output/09_cliff_dpo/03_training/Qwen3-0.6B/gsm8k/cliff_deterministic_only \
output/09_cliff_dpo/03_training/Qwen3-0.6B/gsm8k/cliff_uncertainty_only \
output/09_cliff_dpo/03_training/Qwen3-0.6B/gsm8k/cliff_sampled_off_only \
output/09_cliff_dpo/03_training/Qwen3-0.6B/gsm8k/cliff_uncertainty_sampled_off_only \
--labels Baseline Cliff-all Cliff-deterministic Cliff-uncertainty Cliff-sampled-off Cliff-uncertainty-sampled-off \
--full_suite \
--token_profile paper \
--aime_samples 64 \
--gpus "$GPU_IDS" \
--output_dir output/09_cliff_dpo/04_eval/Qwen3-0.6BThe code in this repository is released under the MIT License; see LICENSE.
Downloaded model weights, datasets, and benchmark contents are governed by their original upstream licenses and terms of use. In particular, Llama and Gemma require accepting their HuggingFace license terms before download.
@article{ko2026clifftoken,
title={Cliff Tokens: Identifying Single-Token Failure Triggers in LLM Mathematical Reasoning},
author={Ko, Jaeyong and Kang, Pilsung and Lee, Yukyung},
journal={arXiv preprint arXiv:2606.25524},
year={2026},
eprint={2606.25524},
archivePrefix={arXiv}
}
