Co-authored with Navneel Singhal.

Adaptive Sampling Networks explore a simple question: can the decoding strategy of a language model be learned, instead of fixed by hand-tuned heuristics like temperature, top-k, or nucleus sampling?

Problem

Most LLM deployments treat decoding as a hyperparameter choice. The same sampling rule is applied across prompts, uncertainty regimes, and output distributions.

That is useful, but rigid. A sampler should be able to respond to the shape of the probability distribution it receives.

Approach

We used a lightweight network that transforms the model’s logits before sampling.

The design keeps the base model frozen and learns a distribution-level transformation over logits. A key constraint is permutation equivariance: the sampler should respond to probability structure, not token identity.

Why It Matters

Decoding is part of model behavior.

If the sampler changes reliability, diversity, correctness, or instruction-following, then it belongs in the same conversation as evaluation and post-training. It is a small part of the system, but it can affect the behavior users actually see.

Public Artifact