Adaptive Sampling Networks

Co-authored with Navneel Singhal.

Adaptive Sampling Networks began from a simple but, in my view, underexamined question: should decoding in language models be treated as a fixed heuristic, or as a learned operator over the model’s own uncertainty?

Most current decoding schemes choose a rule such as temperature scaling, top-k, top-p, typical sampling, min-p, epsilon, or eta sampling, and then apply that rule uniformly across all contexts. This is operationally convenient, but it is also structurally rigid. A next-token distribution with low entropy and a clear mode does not present the same decision problem as a flatter or more ambiguous distribution, yet standard decoding exposes both to the same global hyperparameters.

The central claim behind this project is that decoding should itself be viewed as a policy class. If the relevant object is the distribution produced by a frozen base model at each step, then one can try to learn a map from that distribution to a better one, rather than hand-specifying the transformation in advance.

Formal Setup

Let $z \in \mathbb{R}^V$ denote the base model’s next-token logits over a vocabulary of size $V$. Classical decoding heuristics implement some transformation $T(z; \theta)$ before sampling. The parameters $\theta$ may define a temperature, a truncation threshold, or a cumulative-mass cutoff, but in each case the rule is externally chosen.

In this project, the base language model remains frozen. The learned object is a compact sampler $f_\phi : \mathbb{R}^V \to \mathbb{R}^V$ that takes the raw logits $z$ and returns modified logits $z' = f_\phi(z)$. Sampling is then performed from $\operatorname{softmax}(z')$.

This is an important distinction. The sampler is not intended to become a second language model. It is a distribution-level transformation whose task is to reshape the uncertainty profile of the original model at inference time.

The main architectural constraint was permutation equivariance. If the vocabulary indices are permuted, the sampler’s output should permute in the same way. This prevents the sampler from learning token-specific lexical preferences and forces it to respond instead to structural features of the distribution itself: concentration of mass, entropy, relative ordering, and tail behavior. For a general decoding policy, that is the correct invariance class.

Model Family

The repository implements three sampler families of increasing expressive power.

LocalProbabilityTransform is the most local model. It applies a small MLP independently to each log-probability and then adds a learned soft truncation term. This class can represent simple scalar transformations such as temperature-like rescaling, epsilon-style filtering, or other pointwise nonlinear maps.

SimpleDistributionAwareTransform extends the local model by conditioning each token-wise transformation on global statistics of the full distribution, in particular the maximum log-probability and the entropy. It also supports dynamic truncation parameters derived from those global features. This makes it suitable for policies whose aggressiveness should vary with the uncertainty regime rather than remain fixed.

SamplingNetwork is the most expressive model. It first embeds each scalar log-probability, processes the resulting token set using linear-attention blocks, pools a global context vector, and then predicts both a distribution-wide transformation and truncation parameters. Because the computation is permutation-equivariant over the vocabulary dimension, the architecture can, in principle, approximate much richer distributional operations, including ones that depend on relative ordering and mass concentration across many tokens.

Across all three cases, the design objective was the same: keep the sampler lightweight, but expressive enough to represent nontrivial decoding rules.

Data Construction and Supervision

The supervision signal was not drawn directly from final text alone. Instead, the pipeline constructed a search space over decoding heuristics and then distilled the best-performing parts of that search into a learned sampler.

The first stage generated many candidate completions for the same prompts using a large collection of heuristic pipelines. These pipelines were assembled from standard processors such as temperature, top-k, top-p, typical, min-p, epsilon, and eta transformations. Candidate generation was run in parallel through SGLang, which made it possible to build a reasonably broad search over decoding behavior rather than commit early to one heuristic family.

The second stage annotated those generations. Quality was estimated through reward-model comparisons organized as Swiss-style tournaments and converted into latent scores using a Bradley-Terry ranking model. Diversity was measured using statistics such as self-BLEU and embedding entropy. For tasks with objective verification signals, generations were also scored for correctness. The pipeline further included overlap checks against reference corpora. These signals were normalized and combined to retain the strongest generations for each prompt.

This matters because the target of learning was not merely “text that looks good.” The target was a decoding policy implicitly selected from a much larger heuristic search, conditional on multiple behavioral criteria.

Training Objective

After filtering the data, training proceeds token by token.

For each retained completion, the frozen base model is replayed to recover the raw logits at every generation step. The heuristic pipeline associated with that completion is then re-applied to those logits, producing target logits $l_{\text{target}}$. Because truncation-based heuristics often zero out large regions of the support, these target logits are typically sparse and may contain $-\infty$ values for filtered tokens.

The sampler predicts its own logits $l_{\text{pred}}$, and training minimizes a two-part objective:

$$\mathcal{L} = D_{\mathrm{KL}}\!\left(\operatorname{softmax}(l_{\text{target}}) \;\|\; \operatorname{softmax}(l_{\text{pred}})\right) \;+\; \gamma \sum_{i \in \mathcal{T}} \operatorname{softmax}(l_{\text{pred}})_i$$

The first term aligns the learned distribution with the target distribution on the surviving support. The second term penalizes the sampler for assigning residual probability mass to tokens that the target pipeline would have removed. This is important because matching only the non-truncated region is not enough; a usable sampler also has to learn the geometry of exclusion.

Viewed this way, the sampler is learning a differentiable approximation to families of heuristic logit processors, but it is doing so through a compact parametric map rather than a manually assembled inference stack.

Why I Found This Interesting

Decoding is often treated as a minor inference-time detail. I do not think that view is adequate. Decoding is part of the behavior of the system that users actually encounter. If different decoding rules alter coherence, diversity, correctness, calibration, or stylistic control, then decoding belongs in the same analytical frame as post-training and evaluation rather than outside it.

What interested me most here was the possibility of compressing a large heuristic search into a learned, distribution-dependent operator. That reframes decoding from hyperparameter tuning into representation learning over uncertainty itself.

Directions I Still Care About

One direction we discussed, but could not pursue further at the time, was a more explicitly human-centered extension of this work.

Once decoding is treated as a learned policy, the natural next question is not only which sampler maximizes generic quality metrics, but which sampler better serves human use. That includes criteria such as controllable diversity, calibration to user intent, stylistic fit, clarity, and subjective usefulness in tasks where there is no single correct answer. In other words, the relevant reward is often not purely verifiable; it is partly relational to the user, the task, and the interaction setting.

We did not take that pursuit further because of other constraints. I expect to return to it later. I still think the broader idea is important: if post-training shapes model behavior, then learned decoding may become one of the more tractable interfaces between model uncertainty and human preference.

Public Artifact

Adaptive Sampling Networks code