One Minute NLP

Model Context Protocol

Dasha Herrmannova — Mon, 25 Aug 2025 19:31:30 GMT

Hi everyone 👋

Before today’s topic, a quick note—I stepped away from the newsletter in April after a family loss since I needed time to slow down and just be. I’m back and excited to restart One Minute NLP: one slide, one concept, once a week to help you keep up with NLP/GenAI.

I really appreciate you sticking around. If you’d rather unsubscribe, you can do that anytime at the bottom of this email (or click here if you’re reading the web version).

Model Context Protocol

Model Context Protocol (MCP) is an open standard for giving language models structured access to tools, data, and prewritten prompts. Instead of relying on ad-hoc prompt engineering or custom retrieval pipelines, tools and data sources are exposed to models through MCP servers in a standardized format. Developers can plug in existing tools (both cloud e.g., GitHub, Slack and local e.g., company database) rather than reinvent integrations each time. This approach promotes reusability across models and projects. An MCP client (usually inside an AI app or agent—an MCP Host) is used to query these servers to retrieve information or invoke actions.

Relationship between MCP Host, Clients, and Servers.

For MCP to work, the model (or its agent wrapper) must support tool use. MCP doesn’t overcome model limits like context window size or reasoning errors; it merely provides a standard for managing model context.

In-context learning

Dasha Herrmannova — Fri, 11 Apr 2025 03:02:39 GMT

In-context learning

In-context learning is a prompting technique where LLMs learn how to handle new tasks using examples in the prompt, without any weight updates. Unlike training or fine-tuning, which changes model parameters, in-context learning utilizes input-output pairs provided in the prompt to guide LLM behavior, even for tasks the LLM wasn’t explicitly trained on.

Simple few-shot prompt example

This allows rapid task adaptation without the compute or data needed for fine-tuning. Supplying examples is called few-shot learning; using none is zero-shot. Few-shot prompting often outperforms zero-shot on many tasks.

Performance depends heavily on the task, the number and order of examples, and how well they’re chosen. A few well-curated examples can match the performance of a fully fine-tuned model, while poorly selected ones can significantly degrade it.

Reflection agents

Dasha Herrmannova — Sun, 30 Mar 2025 22:01:27 GMT

Reflection agents

Reflection is a prompting strategy used in agent workflows to iteratively improve an LLM’s reasoning or outputs. Instead of directly returning a final answer, the LLM is first prompted to reflect on its initial response and identify errors, gaps, or opportunities for improvement. It then uses this feedback to generate a revised response. This can be repeated multiple times until some stopping condition is met (e.g., max iterations, token limit, or some evaluation criteria).

Common modules used in reflection agents.

This process has been shown to improve performance over direct response generation on a variety of tasks. Some agents add an explicit evaluation step which is called prior to reflection. Evaluation can be done via an LLM judge, external tools (e.g., a code compiler or a knowledge base) or other methods (e.g., heuristic score). Reflection can be used in combination with other agent design patterns such as the ReAct framework.

Top-k and top-p sampling

Dasha Herrmannova — Thu, 20 Mar 2025 04:26:15 GMT

Top-k and top-p sampling

Sampling is a method LLMs use to generate output tokens, with the next token selected randomly based on learned output token probabilities. Higher-probability tokens are more likely to be chosen, but random sampling can generate odd or nonsensical outputs if low-probability tokens are picked.

Comparison of top-k and top-p sampling.

Top-k sampling truncates the distribution to the k most probable tokens, then randomly picks from this subset using renormalized probabilities. Top-1 sampling is equivalent to greedy decoding which always picks the most probable token. Top-k sampling can struggle if the top k tokens include low-probability tokens or omit likely tokens. Top-p sampling (nucleus sampling) addresses this by keeping the top p percent of the probability mass instead.

Top-k and top-p sampling can be used with or as alternatives to temperature sampling.

Reasoning Models

Dasha Herrmannova — Wed, 12 Mar 2025 01:05:08 GMT

Reasoning Models

Reasoning models are a new class of LLMs designed to solve complex problems like math and coding. Unlike standard LLMs that generate an answer directly, reasoning models are designed and trained to first produce intermediate “thinking” steps (similar to chain-of-thought reasoning, but longer and more detailed) before finalizing a response. This makes them strong at multi-step logic tasks (e.g., math proofs, coding challenges), but less efficient for simpler tasks like translation. Reasoning models are trained using reinforcement learning (RL). For example, DeepSeek-R1 was trained using a combination of RL from human preference (RLHF) and RL using verifiable rewards (e.g., math correctness).

Simplified representation of DeepSeek-R1 training.

Reasoning models are typically slower and more expensive compared to standard LLMs because they require more output tokens for thinking steps.

Group Relative Policy Optimization

Dasha Herrmannova — Mon, 03 Mar 2025 05:19:05 GMT

Group Relative Policy Optimization

Group Relative Policy Optimization (GRPO) is a new reinforcement learning algorithm that can be used in Reinforcement Learning (RL) for LLMs in place of Proximal Policy Optimization (PPO). In GRPO, multiple responses are generated from the same prompt using the LLM (also called policy) being trained. GRPO uses their average reward (their ‘quality’) as a baseline for computing each answer’s advantage relative to the average reward of the group; the advantage is used to calculate gradients for the policy LLM. This eliminates the need for a separate critic model (like in PPO) to calculate advantage, thereby using much less memory and compute. Like PPO, GRPO constrains the size of weight updates to prevent drastic behavior changes.

Steps involved in GRPO.

GRPO was used to train DeepSeek R1 in math and coding-based RL (using rule-based rewards) and RL from Human Feedback (RLHF). GRPO, like PPO, still depends heavily on the quality of the reward signal.

Proximal Policy Optimization

Dasha Herrmannova — Tue, 25 Feb 2025 03:39:20 GMT

Proximal Policy Optimization

Proximal Policy Optimization (PPO) is a reinforcement learning (RL) algorithm that is frequently used in Reinforcement Learning from Human Feedback (RLHF) to further train LLMs after supervised fine-tuning. In RLHF, an LLM output receives a numerical reward that represents human preference; the goal is to maximize this reward. PPO uses an actor-critic setup, training two models simultaneously. The actor (also called policy) is the LLM that generates tokens; the critic estimates the expected final reward at each token—these estimates are used to calculate gradients for the policy LLM. PPO constrains the size of weight updates on the policy LLM at each training step to prevent drastic and unpredictable behavior changes.

Simplified workflow depicting models involved in PPO (actor, critic, and reward model) and how they interact.

PPO is the standard algorithm for RLHF; it was used to fine-tune InstructGPT and ChatGPT. PPO doesn’t prevent reward hacking—achieving high rewards through learning to exploit imperfections in the reward model.

RLHF

Dasha Herrmannova — Sat, 15 Feb 2025 23:43:40 GMT

RLHF

Reinforcement Learning from Human Feedback (RLHF) is a training phase used on LLMs after supervised fine-tuning to further improve LLM responses. In contrast to supervised fine-tuning, where a model learns to mimic responses for given prompts, in RLHF a model learns by generating a response and then receiving a score indicating how good that response is. Because using humans to assign response scores is expensive, scores are typically generated using an LLM reward model (RM) trained on human preference pairs (prompt, winning response, losing response). RLHF uses a reinforcement learning algorithm like proximal policy optimization (PPO) to update model parameters based on the scores from the RM.

Typical steps involved in RLHF fine-tuning.

RLHF was one of key innovations behind ChatGPT. Collecting high-quality data for training a RM can be very resource intensive. RLHF is often used to improve helpfulness and minimize harmfulness of model responses.

ReAct Agent Model

Dasha Herrmannova — Thu, 06 Feb 2025 03:45:23 GMT

ReAct Agent Model

ReAct is an agent design pattern that incorporates reasoning (thinking through steps logically) and action (interacting with the environment). Prior prompting strategies either generate reasoning (e.g., chain-of-thought) or take actions (e.g., tool use) separately. In ReAct, an LLM is prompted to plan its next action(s), take action(s), and reflect on results – this is repeated until the agent considers a task complete. ReAct has been shown to perform well on complex tasks like multi-hop question answering and has become a common agent pattern built upon by other approaches like Reflexion.

Typical steps implemented in ReAct.

ReAct may struggle if outcomes of actions it takes are incomplete or incorrect. It can have higher cost/latency than simpler methods due to more steps/tokens required. It has been shown to perform best with fine-tuning.

Knowledge Distillation

Dasha Herrmannova — Wed, 29 Jan 2025 05:01:52 GMT

Knowledge Distillation

Knowledge Distillation (KD) is a form of model compression used to transfer knowledge from a large, powerful teacher model to a (typically) smaller, more efficient student model. In contrast to supervised learning where a model is trained using labeled data (inputs and expected outputs — also known as hard targets), in KD, the student is typically also trained using the teacher’s reasoning (soft targets). Different methods exist for extracting the teacher’s reasoning, such as using its weights, analyzing the probabilities it assigns to possible outputs, or generating a rationale.

Two example KD methods — extracting a teacher’s reasoning from Chain of Thought prompting and from the teacher’s output distribution.

A popular Transformer model DistilBERT is a distilled version of BERT. It’s 40% smaller and 60% faster, while retaining over 95% of BERT’s performance. Some speculate that GPT-4o is a distilled version of some larger model.

Mixture of Experts

Dasha Herrmannova — Sun, 19 Jan 2025 04:40:19 GMT

Mixture of Experts

Mixture of Experts (MoE) is an ensemble learning technique that trains multiple “expert” (sub-)models and a “gate” network that determines which expert(s) to use for a particular input. In Language Models, MoE is typically implemented as a sparse layer composed of multiple expert sub-layers and a router; the experts can be a simple feed forward network (FFN) or a more complex network, and the router is typically a linear layer with a softmax function. MoE enables larger models (more parameters) while still being cheap during training and inference because only a subset of the experts is active for a given input (this is called conditional computation).

Comparison between a dense layer and a sparse layer used in MoE.

Mixtral 8x7B is an example of a MoE. In Mixtral, the feed forward layer of each transformer block is replaced by a MoE layer. Each token in a given input sequence activates a different set of experts.

Low-Rank Adaptation

Dasha Herrmannova — Thu, 25 Jul 2024 20:04:15 GMT

Low-Rank Adaptation

Low-Rank Adaptation (LoRA) is a popular method for Parameter-Efficient Fine-Tuning (PEFT) of Large Language Models. Fine-tuning a LLM can significantly improve performance but can be prohibitively costly due to model size. Instead of updating all weights, LoRA freezes the original weights W and only trains a weight update ΔW represented as two smaller rank decomposition matrices A and B which are much more efficient to train and store. During inference, this weight update is added to the original weights.

Comparison between full fine-tuning and LoRA adaptation.

For example, LoRA can reduce the number of trainable parameters of GPT-3 from 175B to 37.7M while performing as well as if all weights were fully fine-tuned. LoRA can be applied to any subset of weight matrices of a model (most commonly the attention and/or the feedforward layers). Multiple LoRA modules can be trained for different tasks and swapped during inference.

Temperature Sampling

Dasha Herrmannova — Mon, 15 Jul 2024 03:48:50 GMT

Sampling is a common method used by LLMs for generating output tokens: the next token is randomly chosen based on the token probabilities learned by the LLM. Temperature sampling affects the shape of the probability distribution of tokens by introducing a scaling factor τ (temperature):

i.e., the raw model scores are divided by τ before normalizing with softmax.

If τ=1.0, the probabilities are unchanged. When τ is closer to 0, the probability of high-probability words is increased and probability of low-probability words is decreased (i.e., randomness is reduced and the model is more likely to pick a high-probability word). Setting τ to a value greater than 1 has the opposite effect. Top-k and top-p sampling are alternatives to and can be used in conjunction with temperature sampling.

Example probabilities based on different temperature settings

Byte-Pair Encoding Algorithm

Dasha Herrmannova — Mon, 08 Jul 2024 02:58:52 GMT

Byte-Pair Encoding Algorithm

Byte-Pair Encoding (BPE) is a tokenization algorithm which is used by most LLMs (e.g., GPT, Llama) to build their vocabularies. Tokens created by BPE are called subwords because they are often smaller than words – these are typically the most common subwords that appear in the pretraining set. To determine these subwords, BPE starts with individual characters as the token set and then iteratively merges the two most common consecutive tokens to form a new token until a target vocabulary size is reached. BPE can deal with unknown words during inference because new words can be represented by some sequence of existing subwords. WordPiece and SentencePiece tokenization are popular alternatives to BPE and work similarly.

Consider an example corpus: “llama llama red pajama“.

In BPE tokenization, the starting vocabulary consists of individual characters. The corresponding corpus representation would look like this (dashes indicate token boundaries):

Iteration: 0
Vocabulary: l a m r e d p j
Corpus representation: l-l-a-m-a l-l-a-m-a r-e-d p-a-j-a-m-a

Vocabulary and corpus representation after the first iteration (a+m were the most frequent consecutive tokens and are merged into a new token am):

Iteration: 1
Vocabulary: l a m r e d p j am
Corpus representation: l-l-am-a l-l-am-a r-e-d p-a-j-am-a

Vocabulary and corpus representation after the second iteration:

Iteration: 2
Vocabulary: l a m r e d p j am ama
Corpus representation: l-l-ama l-l-ama r-e-d p-a-j-ama

Vocabulary and corpus representation after the fifth iteration:

Iteration: 5
Vocabulary: l a m r e d p j am ama ll llama re
Corpus representation: llama llama re-d p-a-j-ama

Perplexity

Dasha Herrmannova — Fri, 28 Jun 2024 12:03:28 GMT

Perplexity

Perplexity (usually abbreviated PP or PPL) is a metric commonly used to evaluate language models. Perplexity measures how uncertain (or “perplexed”) a model is about the predictions it makes. The lower the perplexity, the better a model predicts the test set. Perplexity usually correlates well with improvements on real world tasks, but it is not a guarantee of better task performance. The perplexity of two models is only comparable if they use the same vocabularies.

Given an example text sequence X: The quick fox jumps over the lazy dog, we can calculate PP using the probability (log-likelihood) of predicting the next word given the words that came before it:

where

and N is the number of words in the sequence (N=8 for our test sequence).

Formally:

Do you want to learn more NLP concepts?

Each week I pick one core NLP concept and create a one-slide, one-minute explanation of the concept. To receive weekly new posts in your inbox, subscribe here:

Subscribe now

Reach out to me:

Connect with me on LinkedIn
Read my technical blog on Medium
Or send me a message by responding to this post

Is there a concept you would like me to cover in a future issue? Let me know!

One Minute NLP

Model Context Protocol

Model Context Protocol

Further reading

Download complete One Minute NLP

Do you want to learn more NLP concepts?

In-context learning

In-context learning

Further reading

Do you want to learn more NLP concepts?

Reflection agents

Reflection agents

Further reading

Do you want to learn more NLP concepts?

Top-k and top-p sampling

Top-k and top-p sampling

Further reading

Do you want to learn more NLP concepts?

Reasoning Models

Reasoning Models

Further reading

Do you want to learn more NLP concepts?

Group Relative Policy Optimization

Group Relative Policy Optimization

Further reading

Do you want to learn more NLP concepts?

Proximal Policy Optimization

Proximal Policy Optimization

Further reading

Do you want to learn more NLP concepts?

RLHF

RLHF

Further reading

Do you want to learn more NLP concepts?

ReAct Agent Model

ReAct Agent Model

Further reading

Do you want to learn more NLP concepts?

Knowledge Distillation

Knowledge Distillation

Further reading

Do you want to learn more NLP concepts?

Mixture of Experts

Mixture of Experts

Further reading

Do you want to learn more NLP concepts?

Low-Rank Adaptation

Low-Rank Adaptation

Further Reading

Do you want to learn more NLP concepts?

Temperature Sampling

Further Reading

Do you want to learn more NLP concepts?

Byte-Pair Encoding Algorithm

Byte-Pair Encoding Algorithm

Further Reading

Do you want to learn more NLP concepts?

Perplexity

Perplexity

Further Reading

Do you want to learn more NLP concepts?