Learning Adaptive LLM Decoding

Sequence-level adapter overview — **Overview of Learned Decoding Adapters.** *Left:* The **sequence-level adapter** selects one decoding configuration per prompt (contextual bandit). *Right:* The **token-level adapter** selects a (potentially different) decoding configuration at each generation step (POMDP). Both adapters are layered on top of a frozen language model and trained with verifiable task rewards.

Token-level adapter overview — **Overview of Learned Decoding Adapters.** *Left:* The **sequence-level adapter** selects one decoding configuration per prompt (contextual bandit). *Right:* The **token-level adapter** selects a (potentially different) decoding configuration at each generation step (POMDP). Both adapters are layered on top of a frozen language model and trained with verifiable task rewards.

Abstract

Decoding from large language models (LLMs) typically relies on fixed sampling hyperparameters (e.g., temperature, top-p), despite substantial variation in task difficulty and uncertainty across prompts and individual decoding steps. We propose to learn adaptive decoding policies that dynamically select sampling strategies at inference time, conditioned on available compute resources.

Rather than fine-tuning the language model itself, we introduce lightweight decoding adapters trained with reinforcement learning and verifiable terminal rewards (e.g. correctness on math and coding tasks). At the sequence level, we frame decoding as a contextual bandit problem: a policy selects a decoding strategy (e.g. greedy, top-k, min-p) for each prompt, conditioned on the prompt embedding and a parallel sampling budget. At the token level, we model decoding as a partially observable Markov decision process (POMDP), where a policy selects sampling actions at each token step based on internal model features and the remaining token budget.

Experiments on the MATH and CodeContests benchmarks show that the learned adapters improve the accuracy–budget tradeoff: on MATH, the token-level adapter improves Pass@1 accuracy by up to 10.2% over the best static baseline under a fixed token budget, while the sequence-level adapter yields 2–3% gains under fixed parallel sampling.

Key Results

+10.2%

Pass@1 on MATH
(token-level adapter, mix CoT)

+2–3%

Pass@1 on MATH
(sequence-level adapter)

+5.7%

Pass@1 on CodeContests
(mix CoT, w/ budget)

71.1%

AIME'25 thinking Pass@1
(vs. 65.6% reported baseline)

Contributions

Unified RL framework for adaptive decoding. We formulate decoding-time inference as a policy learning problem at both the prompt and token level, under explicit compute budgets.
No reward models, no preference labels. Decoding adapters are trained tabula rasa from binary task-correctness signals, while the underlying LLM remains frozen.
Strong empirical gains. Up to +10.2% Pass@1 on MATH with the token-level adapter and +2–3% with the sequence-level adapter; out-of-domain generalization to AIME 2025 and CodeContests.
Budget-aware training. Conditioning on inference-time compute budget during training leads to more robust decoding behavior, even at fixed evaluation budgets.

Method

We introduce two complementary decoding adapters that operate over a frozen LLM, trained end-to-end with REINFORCE using only binary task-correctness rewards. For both cases, the adapter uses a lightweight 3-layer MLP.

Sequence-Level: Contextual Bandit

A single decoding configuration is chosen once per prompt before generation begins. The adapter observes a prompt embedding $e$ and a parallel sampling budget $B$, then selects an action $a \in \mathcal{S}$ (e.g., greedy, top-$k$, min-$p$) that is held fixed for the entire rollout.

$$\pi_\theta(a \mid x),\quad x = [e;\, B]$$

The action space is built via a data-driven greedy selection procedure (inspired by AuPair) that maximizes best-of-$S$ coverage across the validation set. Training uses a Monte Carlo policy-gradient estimator with entropy regularization.

Token-Level: POMDP

The adapter is invoked at every decoding step, allowing stochasticity to vary within a single trajectory. At step $t$ it observes a compact state $x_t = [e_t;\, b_t]$ derived from the LLM's last hidden state and the remaining token budget.

$$a_t \sim \pi_\theta(\cdot \mid x_t), \quad b_t = b - t$$

The adapter focuses on temperature-based actions (greedy / 0.5 / 1.0 / 1.25). Two training stabilizations are key: filtering noisy-reward prompts and masking already-concentrated token distributions (max probability > 0.95).

Action Space Selection

From a candidate pool of 180 decoding configurations (combinations of temperature, top-$k$, top-$p$, and min-$p$), we greedily select a compact action set $\mathcal{S}$ that maximizes the coverage objective

$$F(\mathcal{S}) = \frac{1}{N}\sum_{i=1}^{N} \max_{s \in \mathcal{S}} R(x_i, s)$$

Greedy selection consistently outperforms taking the top-$K$ highest-average strategies, because it favors complementary strategies that succeed on different subsets of inputs.

Greedy vs Top-N selection — **Greedy coverage-based selection vs. top-$N$.** Greedy selection achieves higher best-of-$N$ accuracy by preferring diverse, complementary strategies.

Experimental Results

We evaluate on MATH and CodeContests using Qwen3-4B as the base model. All results report mean accuracy ± 95% CI over $k=3$ independent runs.

MATH Benchmark — Sequence-Level Adapter

Metric	Setting	Best (static)	Mixed (static)	LPO	Adapter w/o budget	Adapter w/ budget
Pass@1	w/o CoT	71.70 ± 1.25	71.20 ± 1.26	72.72	72.60 ± 1.24	72.90 ± 1.23
Pass@1	mix CoT	72.10 ± 1.24	71.93 ± 1.25	—	73.60 ± 1.22	74.20 ± 1.21
Pass@8	w/o CoT	76.70 ± 1.17	76.23 ± 1.18	—	78.30 ± 1.14	78.46 ± 1.14
Pass@8	mix CoT	77.10 ± 1.16	76.57 ± 1.17	—	78.80 ± 1.13	79.80 ± 1.11

CodeContests — Sequence-Level Adapter

Metric	Setting	Best (static)	Mixed (static)	LPO	Adapter w/o budget	Adapter w/ budget
Pass@1	w/o CoT	11.43 ± 2.28	10.53 ± 2.19	—	14.53 ± 2.52	14.50 ± 2.52
Pass@1	mix CoT	13.97 ± 2.48	14.80 ± 2.54	—	17.06 ± 2.69	19.70 ± 2.85
Pass@8	w/o CoT	22.80 ± 3.00	22.23 ± 2.97	—	26.08 ± 3.14	23.10 ± 3.02
Pass@8	mix CoT	29.10 ± 3.25	25.63 ± 3.12	—	29.90 ± 3.28	32.50 ± 3.35

MATH Benchmark — Token-Level Adapter

Metric	Setting	Greedy	Mixed (static)	LPO	Adapter w/o budget	Δ	Adapter w/ budget	Δ
Pass@1	w/o CoT	71.33 ± 1.25	71.60 ± 1.25	—	78.28 ± 1.14	+6.68	80.82 ± 1.07	+9.49
Pass@1	mix CoT	72.10 ± 1.24	72.67 ± 1.24	—	78.52 ± 1.14	+5.85	82.33 ± 1.03	+10.23

Generalization: AIME 2025

Sequence-level adapter trained only on MATH-train, evaluated zero-shot on AIME 2025 (30 seeds).

Metric	Setting	Reported (Qwen3-4B)	LPO	Adapter w/ budget
Pass@1	non-thinking	19.1	—	20.1 ± 2.62
Pass@1	thinking	65.6	—	71.1 ± 2.96

Entropy Modulation Analysis

Length distribution — **Generation length distribution.** Distribution of response lengths across decoding strategies.

Lengths sequence adapter pass@1 ratio 0.0 — **Response lengths for sequence-level adapter (Pass@1, ratio 0.0).** Generation lengths produced by the sequence-level adapter at pass@1 ratio 0.0.

Coding Action Probabilities with Val Reward Overlay

Coding action probabilities (all).

Coding action probabilities (CoT).

Coding action probabilities (no CoT).

Math Action Probabilities with Val Reward Overlay

Math action probabilities with val reward overlay.

Token-Level Action Probabilities with Val Reward Overlay

Token-level adapter. Action probabilities overlaid with reward signal.

CoT. Token-level action probabilities overlaid with reward signal.

Non-CoT. Token-level action probabilities overlaid with reward signal.

Citation

If you find this work useful, please cite:

@inproceedings{anonymous2026adaptive,
  title     = {Learning Adaptive {LLM} Decoding},
  author    = {Su, Huangyuan Su; Zhe Ye; Sam Tenka; Aidan Z.H. Yang; Kong, Soonho; Udaya Ghais},
  booktitle = {arxiv},
  year      = {2026},
  note      = {Under review}
}