Author: Andrej Karpathy Source: karpathy.bearblog.dev Date Read: December 25, 2025
Download Anki Deck
17 cards • Last updated: Dec 2025
Flashcards
1. Modern LLM Production Stack
What are the four stages of the modern LLM production stack as of 2025?
- Pretraining
- Supervised Finetuning (SFT)
- RLHF (Reinforcement Learning from Human Feedback)
- RLVR (Reinforcement Learning from Verifiable Rewards)
Tags: #LLM #training #RLVR
2. RLVR Definition
What is RLVR (Reinforcement Learning from Verifiable Rewards)?
Training LLMs against auto-verifiable rewards (math/code puzzles), causing them to spontaneously develop reasoning strategies through optimization rather than human prescription.
Tags: #RLVR #reasoning #training
3. Computational Intensity Comparison
How does RLVR differ from SFT and RLHF in terms of computational intensity?
SFT and RLHF are relatively thin/short stages (minor finetunes), while RLVR allows much longer optimization runs because it trains against objective, non-gameable reward functions. This shifted compute from pretraining to RL, resulting in similar-sized LLMs but longer RL runs.
Tags: #RLVR #compute #optimization
4. Test-Time Compute Scaling
What new "knob" did RLVR introduce for controlling LLM capability?
Test-time compute scaling—the ability to increase capability by generating longer reasoning traces and increasing "thinking time" during inference, with its own associated scaling law.
Tags: #RLVR #scaling #inference
5. Emergent Reasoning Strategies
What reasoning strategies emerged spontaneously from RLVR training?
Models spontaneously developed strategies like:
- Reflection and double-checking answers
- Trying multiple approaches
- Breaking down complex problems
- Backtracking when stuck
These emerged through optimization rather than being explicitly programmed.
Tags: #RLVR #reasoning #emergence
6. Impact on Training Compute Allocation
How did RLVR change the allocation of compute between pretraining and RL stages?
RLVR shifted significant compute from pretraining to the RL stage. While pretraining used to dominate compute usage, RLVR allows for much longer RL optimization runs, rebalancing the compute distribution across the training pipeline.
Tags: #RLVR #compute #training
7. Key Requirement for RLVR
What is the key requirement for effective RLVR training?
Auto-verifiable rewards that are objective and non-gameable. Examples include math puzzles and code problems where correctness can be automatically verified without human judgment.
Tags: #RLVR #verification #rewards
8. RLVR vs Human Feedback
How does RLVR differ from RLHF in terms of feedback mechanism?
RLHF relies on human preferences and feedback, while RLVR uses automatically verifiable rewards from objective tasks (like math/code puzzles) that don't require human judgment.
Tags: #RLVR #RLHF #feedback
9. Scaling Laws Evolution
What new scaling law did RLVR introduce to the LLM ecosystem?
Test-time compute scaling law—the relationship between inference-time compute (thinking time) and model capability, adding to the existing pretraining scaling laws.
Tags: #scaling #RLVR #inference
10. Year of RLVR Breakthrough
According to Karpathy, what year marked the breakthrough for RLVR in LLM development?
2025—Karpathy identifies this as the year when RLVR became a major component of the LLM production stack, fundamentally changing how models are trained and scaled.
Tags: #RLVR #timeline #2025
11. OpenAI Model Inflection Points
Which OpenAI models marked the inflection point for RLVR capabilities?
OpenAI o1 (late 2024) was the first demonstration of an RLVR model, but o3 (early 2025) was the obvious inflection point where the difference became intuitively noticeable.
Tags: #OpenAI #RLVR #models
12. Ghosts vs. Animals Metaphor
What is the "Ghosts vs. Animals" metaphor for understanding LLM intelligence?
We're not "evolving/growing animals" but "summoning ghosts." LLMs have fundamentally different architecture, training data, algorithms, and optimization pressures than biological intelligence—optimized for imitating text and collecting rewards rather than survival. This produces entities inappropriate to think about through an animal lens.
Tags: #metaphor #intelligence #LLM
13. Jagged Intelligence
What is "jagged intelligence" in the context of LLMs?
LLMs display uneven performance characteristics—they "spike" in capability in verifiable domains (where RLVR applies) while remaining weak elsewhere. They can simultaneously be a "genius polymath" and a "confused grade schooler" who can be tricked by simple jailbreaks.
Tags: #intelligence #capabilities #RLVR
14. LLM App Layer Functions
What are the four key functions of an "LLM app" layer like Cursor?
- Context engineering
- Orchestrating multiple LLM calls in complex DAGs while balancing performance/cost
- Application-specific GUI for human-in-the-loop
- An "autonomy slider" for user control
Tags: #LLMapp #architecture #Cursor
15. LLM Labs vs Apps Division
How does Karpathy predict the division between LLM labs and LLM apps will evolve?
LLM labs will graduate "generally capable college students," while LLM apps will organize, finetune, and animate teams of them into "deployed professionals" in specific verticals by supplying private data, sensors, actuators, and feedback loops.
Tags: #ecosystem #labs #apps
16. Claude Code Paradigm
What paradigm shift does Claude Code represent according to Karpathy?
Claude Code represents AI that "lives on your computer"—a loopy agent combining tool use and reasoning for extended problem solving, running locally with access to your private environment, data, and context. It's a "little spirit/ghost" paradigm distinct from web-based AI.
Tags: #ClaudeCode #paradigm #localAI
17. Google Gemini “Nano Banana” and LLM GUI
What is Google Gemini "Nano banana" and why is it paradigm-shifting?
A model representing the beginning of "LLM GUI"—moving beyond text-based chat (like 1980s console commands) toward LLMs communicating via images, infographics, animations, and web apps. Its power comes from joint text generation, image generation, and world knowledge tangled in model weights.
Tags: #Gemini #GUI #multimodal
Key Insights
- RLVR represents a paradigm shift: Moving from human-prescribed behaviors to emergent reasoning through optimization
- Compute reallocation: The industry is shifting compute from pretraining to RL stages
- Dual scaling: We now have both training-time and test-time scaling laws to optimize
- Objective verification is key: The success of RLVR depends on having tasks with clear, verifiable correct answers