Text Generation: Greedy/Beam Search, and Top Sampling
Greedy Decoding and Beam Search are the most fundamental decoding strategies used to generate sequences (like text or translations) from generative models, such as Transformer-based LLMs.
Greedy/Beam Search
Greedy Decoding (or Greedy Search) is the simplest and foundational method for sequence generation. At each step, the model selects the token with the highest conditional probability at that moment:
The model calculates the probability distribution over the entire vocabulary for the next token.
It selects the token yt with the highest probability: yt=argmaxiP(yi∣y1,…,yt−1).
The model appends this token to the sequence and repeats the process until it generates an end-of-sequence token or reaches the maximum length.
It's fast and efficient since it only tracks one path. However, it's suboptimal—it can get stuck in a path that makes generated text repetitive or boring.
Beam Search is a heuristic algorithm that finds sequences with better overall probability than Greedy Decoding—without the computational cost of exhaustive search. At each step, it maintains a fixed number k of the most promising partial sequences (called "beams").
It takes all the k existing candidate sequences.
It extends each of the k sequences with every possible next token from the vocabulary.
It calculates the cumulative probability (or log-probability) for all the new extended sequences.
It prunes the list, keeping only the top k sequences with the highest cumulative scores to become the new "beams" for the next step.
Top-p/Top-k Sampling
Most modern LLMs use sampling methods like Top-p sampling instead of search-based approaches because of their dynamic nature.
Top Sampling (or nucleus sampling) works by a cumulative probability threshold p (where p is a value between 0 and 1):
Calculate probs per in its vocabulary token by the LLM.
All tokens are sorted in descending older based on their probs.
The model starts summing the probs from the most likely downwards forming a group called nucleus.
In top-p sampling, tokens are added to the nucleus until their cumulative probability exceeds the threshold p. In top-k sampling, tokens are added until the number of tokens reaches k.
The model considers only the tokens within this nucleus and randomly samples the next token from this restricted set—this process is called “sampling”.
By using temperature T to soften or sharpen the probability distribution, LLMs can balance between logical continuation and creative output.
Last updated