Author: Andrej Karpathy Source: karpathy.bearblog.dev Date Read: December 25, 2025


Download Anki Deck

17 cards • Last updated: Dec 2025


Flashcards

1. Modern LLM Production Stack

What are the four stages of the modern LLM production stack as of 2025?
  1. Pretraining
  2. Supervised Finetuning (SFT)
  3. RLHF (Reinforcement Learning from Human Feedback)
  4. RLVR (Reinforcement Learning from Verifiable Rewards)

Tags: #LLM #training #RLVR

2. RLVR Definition

What is RLVR (Reinforcement Learning from Verifiable Rewards)?

Training LLMs against auto-verifiable rewards (math/code puzzles), causing them to spontaneously develop reasoning strategies through optimization rather than human prescription.

Tags: #RLVR #reasoning #training

3. Computational Intensity Comparison

How does RLVR differ from SFT and RLHF in terms of computational intensity?

SFT and RLHF are relatively thin/short stages (minor finetunes), while RLVR allows much longer optimization runs because it trains against objective, non-gameable reward functions. This shifted compute from pretraining to RL, resulting in similar-sized LLMs but longer RL runs.

Tags: #RLVR #compute #optimization

4. Test-Time Compute Scaling

What new "knob" did RLVR introduce for controlling LLM capability?

Test-time compute scaling—the ability to increase capability by generating longer reasoning traces and increasing "thinking time" during inference, with its own associated scaling law.

Tags: #RLVR #scaling #inference

5. Emergent Reasoning Strategies

What reasoning strategies emerged spontaneously from RLVR training?

Models spontaneously developed strategies like:

  • Reflection and double-checking answers
  • Trying multiple approaches
  • Breaking down complex problems
  • Backtracking when stuck

These emerged through optimization rather than being explicitly programmed.

Tags: #RLVR #reasoning #emergence

6. Impact on Training Compute Allocation

How did RLVR change the allocation of compute between pretraining and RL stages?

RLVR shifted significant compute from pretraining to the RL stage. While pretraining used to dominate compute usage, RLVR allows for much longer RL optimization runs, rebalancing the compute distribution across the training pipeline.

Tags: #RLVR #compute #training

7. Key Requirement for RLVR

What is the key requirement for effective RLVR training?

Auto-verifiable rewards that are objective and non-gameable. Examples include math puzzles and code problems where correctness can be automatically verified without human judgment.

Tags: #RLVR #verification #rewards

8. RLVR vs Human Feedback

How does RLVR differ from RLHF in terms of feedback mechanism?

RLHF relies on human preferences and feedback, while RLVR uses automatically verifiable rewards from objective tasks (like math/code puzzles) that don't require human judgment.

Tags: #RLVR #RLHF #feedback

9. Scaling Laws Evolution

What new scaling law did RLVR introduce to the LLM ecosystem?

Test-time compute scaling law—the relationship between inference-time compute (thinking time) and model capability, adding to the existing pretraining scaling laws.

Tags: #scaling #RLVR #inference

10. Year of RLVR Breakthrough

According to Karpathy, what year marked the breakthrough for RLVR in LLM development?

2025—Karpathy identifies this as the year when RLVR became a major component of the LLM production stack, fundamentally changing how models are trained and scaled.

Tags: #RLVR #timeline #2025

11. OpenAI Model Inflection Points

Which OpenAI models marked the inflection point for RLVR capabilities?

OpenAI o1 (late 2024) was the first demonstration of an RLVR model, but o3 (early 2025) was the obvious inflection point where the difference became intuitively noticeable.

Tags: #OpenAI #RLVR #models

12. Ghosts vs. Animals Metaphor

What is the "Ghosts vs. Animals" metaphor for understanding LLM intelligence?

We're not "evolving/growing animals" but "summoning ghosts." LLMs have fundamentally different architecture, training data, algorithms, and optimization pressures than biological intelligence—optimized for imitating text and collecting rewards rather than survival. This produces entities inappropriate to think about through an animal lens.

Tags: #metaphor #intelligence #LLM

13. Jagged Intelligence

What is "jagged intelligence" in the context of LLMs?

LLMs display uneven performance characteristics—they "spike" in capability in verifiable domains (where RLVR applies) while remaining weak elsewhere. They can simultaneously be a "genius polymath" and a "confused grade schooler" who can be tricked by simple jailbreaks.

Tags: #intelligence #capabilities #RLVR

14. LLM App Layer Functions

What are the four key functions of an "LLM app" layer like Cursor?
  1. Context engineering
  2. Orchestrating multiple LLM calls in complex DAGs while balancing performance/cost
  3. Application-specific GUI for human-in-the-loop
  4. An "autonomy slider" for user control

Tags: #LLMapp #architecture #Cursor

15. LLM Labs vs Apps Division

How does Karpathy predict the division between LLM labs and LLM apps will evolve?

LLM labs will graduate "generally capable college students," while LLM apps will organize, finetune, and animate teams of them into "deployed professionals" in specific verticals by supplying private data, sensors, actuators, and feedback loops.

Tags: #ecosystem #labs #apps

16. Claude Code Paradigm

What paradigm shift does Claude Code represent according to Karpathy?

Claude Code represents AI that "lives on your computer"—a loopy agent combining tool use and reasoning for extended problem solving, running locally with access to your private environment, data, and context. It's a "little spirit/ghost" paradigm distinct from web-based AI.

Tags: #ClaudeCode #paradigm #localAI

17. Google Gemini “Nano Banana” and LLM GUI

What is Google Gemini "Nano banana" and why is it paradigm-shifting?

A model representing the beginning of "LLM GUI"—moving beyond text-based chat (like 1980s console commands) toward LLMs communicating via images, infographics, animations, and web apps. Its power comes from joint text generation, image generation, and world knowledge tangled in model weights.

Tags: #Gemini #GUI #multimodal


Key Insights

  • RLVR represents a paradigm shift: Moving from human-prescribed behaviors to emergent reasoning through optimization
  • Compute reallocation: The industry is shifting compute from pretraining to RL stages
  • Dual scaling: We now have both training-time and test-time scaling laws to optimize
  • Objective verification is key: The success of RLVR depends on having tasks with clear, verifiable correct answers