Text Generation: Greedy/Beam Search, and Top Sampling

Greedy Decoding and Beam Search are the most fundamental decoding strategies used to generate sequences (like text or translations) from generative models, such as Transformer-based LLMs.

Greedy Decoding (or Greedy Search) is the simplest and foundational method for sequence generation. At each step, the model selects the token with the highest conditional probability at that moment:

  1. The model calculates the probability distribution over the entire vocabulary for the next token.

  2. It selects the token yty_t with the highest probability: yt=arg maxiP(yiy1,,yt1)y_t = \argmax_i{P(y_i | y_1, …,y_{t-1})}.

  3. The model appends this token to the sequence and repeats the process until it generates an end-of-sequence token or reaches the maximum length.

It's fast and efficient since it only tracks one path. However, it's suboptimal—it can get stuck in a path that makes generated text repetitive or boring.

Beam Search is a heuristic algorithm that finds sequences with better overall probability than Greedy Decoding—without the computational cost of exhaustive search. At each step, it maintains a fixed number kk of the most promising partial sequences (called "beams").

  1. It takes all the kk existing candidate sequences.

  2. It extends each of the kk sequences with every possible next token from the vocabulary.

  3. It calculates the cumulative probability (or log-probability) for all the new extended sequences.

  4. It prunes the list, keeping only the top kk sequences with the highest cumulative scores to become the new "beams" for the next step.

Top-p/Top-k Sampling

Most modern LLMs use sampling methods like Top-p sampling instead of search-based approaches because of their dynamic nature.

Top Sampling (or nucleus sampling) works by a cumulative probability threshold pp (where pp is a value between 00 and 11):

  1. Calculate probs per in its vocabulary token by the LLM.

  2. All tokens are sorted in descending older based on their probs.

  3. The model starts summing the probs from the most likely downwards forming a group called nucleus.

  4. In top-p sampling, tokens are added to the nucleus until their cumulative probability exceeds the threshold pp. In top-k sampling, tokens are added until the number of tokens reaches kk.

  5. The model considers only the tokens within this nucleus and randomly samples the next token from this restricted set—this process is called “sampling”.

By using temperature TT to soften or sharpen the probability distribution, LLMs can balance between logical continuation and creative output.

Last updated