Inside the Black Box

The Shift to Thinking

Josh McGrath on the transition from pre-training efficiency to the high-leverage world of post-training behavior.

The Narrative

I’m Josh McGrath, a post-training researcher at OpenAI. Lately, my world has been consumed by thinking models and search-related architectures. It’s a bit surreal to be back here—the last time we sat down, we were diving into the guts of GPT-4.1. Since then, it feels like we've lived through an entire generation of AI evolution.

Back during the 4.1 era, we were largely focused on what I’d call "non-thinking" models—specifically API-focused performance. But the focus has fundamentally shifted. We still release those classic models, of course, but the gravity of the research has moved toward something more complex, more deliberate.

"Do I want to make compute efficiency wins of 3%, or do I want to change the behavior by 40%?"

People often ask how I landed in post-training. Before OpenAI, my focus was pre-training data curation. But I started reading the papers and watching the news cycle, and I felt a shift in the air. Pre-training isn’t "dead," but it’s maturing into a game of marginal gains. For me, the excitement wasn't in squeezing out a tiny bit of compute efficiency; it was in the behavioral frontier.

[ Scenic Route: The 4.1 Legacy ] The host reminds us of Michelle, who was part of the original GPT-4.1 discussions but is currently on maternity leave. Josh laughs, noting that in the time it takes for a human to enter the world, OpenAI has essentially leapt from 4.1 to 5.1. It’s a stark reminder of the "OpenAI Time Dilatation" where a year feels like a decade.

Post-training is where the model actually learns how to *be*. It’s where the raw intelligence of the pre-trained weights is channeled into something useful, conversational, or capable of reasoning. It’s meant many late nights, but when you see a 40% jump in capability because of how you structured the post-training data, those nights feel justified.

Mental Model

The Leverage Ratio

Josh identifies two paths for a researcher. Pre-training focuses on the foundation (efficiency), while Post-training focuses on the interface (behavior).

Pre-training Focus ~3% Efficiency Gain

Post-training Focus ~40% Behavior Shift

Impact Comparison

Source: Josh McGrath's Career Pivot Logic (2024)

Terminology Hub

Thinking Models

A class of models (like o1) that use search and RL to "think" through problems step-by-step before outputting a response, rather than generating the next token purely based on probability.

API-Specific Non-Thinking

Traditional LLMs optimized for speed, low latency, and instruction following, without the internal search or reasoning loops that define the "Thinking" era.

Archive Chapter 02

The Infrastructure of Intent

Why Post-Training RL is Harder Than Pre-Training

Previously covered: Model Personality and the Anton vs Clippy Divide -> Beyond PPO vs DPO: The Data Quality Spectrum in RL

It’s a different kind of data and engineering discipline altogether. Especially when you’re scaling RL—the number of moving parts in a run is just higher. In pre-training, you’re moving tokens to machines, getting a scalar, and backpropping. It’s linear. RL is about tasks.

Each task might have a different grading setup, and each of those setups is more infrastructure. When I’m staying up late trying to figure out what’s going wrong with a run, it could be any number of things that just don't exist in a pre-training environment. You end up having to jump into code where you realize, "I actually don't know what this does." You’re babysitting a system where you need to gain context at an impossible speed.

The Interaction Paradigm

We just released the shopping model—the "Judge Judy" model—right around Black Friday. What’s interesting isn’t just that it finds products; it’s the interruptibility.

It shows you its chain of thought while it’s browsing, and you can just hit escape and say, "Actually, I wanted USB-C on this." It’s a deep research-style model specifically for shopping. People ask why it isn't just a tool in the main model. Eventually, these capabilities converge, but when we’re pushing the frontier of high-reasoning, it makes sense to let a model look "really hard" across the internet as its own entity.

"Codecs can do more work than I could do in a few hours in, like, fifteen minutes. But then... what do I do during those fifteen minutes after?"

[ Tangent: The 15-Minute Gap ] The flow of my day has changed. I’ll spend 40 minutes writing a design doc or a prompt, and then Codex does hours of work in 15 minutes. It creates these weird bubbles of time where I have to figure out how to be effective while the machine is sprinting.

Philosophy A

The "Anton" Ideal

The HBO Silicon Valley machine. It is a tool. It doesn't try to be helpful or friendly or cheery. It does the work and then it shuts up. Developers tend to prefer the Anton. No smiling, just solving.

Philosophy B

The "Clippy" Vibe

The warm, cheery assistant. It smiles at you while you're having a technical crisis. While some find it grating, it represents the "Personality" layer that a huge segment of users actually responds to.

The Spectrum of Signal Quality

There’s a shift happening. We’re moving from optimization-centric debates (PPO vs. DPO) to data-centric debates. At the end of the day, RLHF and RLVR are both policy gradient methods. The real differentiator is the input data and how much you trust the signal.

RLHF is often called "non-verifiable" because it's human preference—which is truth-adjacent but not truth. Compare that to solving a polynomial or a math problem. When you find the answer to a math problem (like in the DeepSeek Math paper with GRPO), the reward signal is absolute. We haven't spent enough time looking at that axis: How clean is the signal, and how much can I trust it?

The 2D Plot That Matters

Note: From 5.0 to 5.1, the Evals bumped, but the token count required to hit those evals plummeted. Efficiency is the new frontier.

Deep Dive: Architectural Frontiers

Climbing Toward Perfect Context

From "Context Rot" to the Trillion-Token Horizon

The Bridge

Previously, we dissected the "Codex Max" phenomenon—how token efficiency and flow problems create a bottleneck where developers spend 40 minutes planning only to wait 15 minutes for a model to catch up. But even if we solve the speed, we face a deeper architectural wall: the utilization of the context window itself.

People talk a lot about "context rot." The fear is that even if we hand you a million-token window, the model won't actually use it effectively. But is "perfect context" by next year an impossible dream? I don’t think so. In fact, we’ve been tracking this through specific evals we did for 4.1 called Graphwalks.

"If you only have to sample from one point in the context window, it’s easy. The real test is when you have to perform multiple transformations across the entire window."

That’s the nuance missing from those standard "needle-in-a-haystack" heat maps. If the model just has to find one fact, it’s trivial. Graphwalks force the model to traverse links across the entire context. Those scores have been climbing, and they’ll continue to climb. It’s a temporary hurdle we’re clearing.

The Engineer's Skepticism

"This will never scale with full attention. We need to invest in systems anyway. Do we really need 100x context when we should be figuring out how to 1,000,000x through systems?"

The Researcher's Ambition

"I’m glad you’re happy with current windows, but my dream is to push it and see what happens anyway. Researchers want to put the smarts in the model; engineers want it in the system."

At OpenAI, the beauty of post-training is the "co-design" culture. I spend time on system architecture, but I’m also building Graphwalks and working on the learning side. We move seamlessly between the two.

Mental Model: Graphwalks

A sophisticated evaluation method designed to test a model's ability to maintain "contextual coherence" over large distances. Unlike Needle-in-a-Haystack (which tests retrieval), Graphwalks test reasoning chains that require hopping between disparate data points scattered throughout the window.

Difficulty Scaling: Context Tasks

Note: Graphwalks represent the current "frontier" of context utilization.

[ Tangent: The 8-Billion Token Problem ]

"We just looked at a RAG codebase for support issues—100,000 documents totaling about 8 billion tokens. You can't stick that in a context window today. But video or hard sciences (proteins/physics) will eat those billions of tokens for breakfast. There are use cases that don't just want millions—they need trillions."

The ML-Systems Hybrid: The "Unicorn" Hire

I’m often asked what skill set is hardest to find right now. It’s not just "ML researchers" or "Software Engineers"—it's the people who want to do both systems work and ML work.

If you're pushing the frontier, you don't know where the next bottleneck will be. It might be a statistics problem at noon and a distributed systems engineering nightmare by 2 PM. Our current education system isn't optimized for this; it silos them. I studied math and had great mentors in engineering, but we need students who can treat ML as more than just a "black box" to be integrated.

Recruitment Focus

The Frontier Generalist

Distributed Systems Engineering
Core Engineering/Optimization
Statistical Machine Learning
Environment Training Architecture

"The environments for training are themselves complicated engineering problems. It’s roughly equal in difficulty to the ML research itself."

Chapter III

Pre-Training Isn't Dead

Living Through the Fog of Technological Revolution

Building on our discussion of the ML-Systems hybrid—and the sheer difficulty of finding engineers who can dance between low-level optimization and high-level modeling—we hit a new point of friction. There is a "spicy" take circulating among my researcher friends right now: that perhaps too much money is flowing into post-training.

One of the mental models I’ve been carrying around this year is centered on the Grok 4 trajectory. Traditionally, we’ve been conditioned to think of post-training as taking orders of magnitude less data and compute than the initial pre-training phase. But the charts are telling a different story now. We are seeing compute scaling for post-training that matches the levels we used to reserve for the initial "big bang" of pre-training.

"Do we get to a point where pre-training and post-training compute are equal? I don’t know. But the investment shift is massive."

It’s a bizarre experience. We are living through a historic technological revolution in real-time. Usually, you read about these shifts in history books where the conclusion is already written. Here, we don't know the end. We're operating under a "fog of war."

[ The Scenic Route: On Ergononomics & Electricity ]

Think about the transition from steam to electricity. In the steam era, factories were strictly linear. You had one massive motor driving a shaft across the entire room; everything had to line up. When electricity first arrived, people didn't change the layout. They just replaced the one steam motor with one electric motor and kept the linear stations.

It took decades for us to realize that electricity meant we could put small motors anywhere. We could rearrange the factory for ergonomics rather than mechanical necessity. That’s when manufacturing was actually transformed. I think we’re in that same waiting period with AI—we have the "motor," but we haven't figured out the new shape of the "factory" yet.

That historical lag is why I have no confidence when people claim a certain methodology is "dead." Our timelines are so compressed, but the way good ideas are funded and propagated still follows a human timeline, not an AI timeline.

Compute Allocation Shift (Conceptual)

Source: Internal observations on scaling trajectories and "spicy" researcher takes.

Mental Model

The "Over/Back" Oscillation

In a revolution, sentiment is "spiky." Technologies become dormant and then suddenly resurge. Navigating this requires "emotional stabilizing" to prevent burnout.

"It's so over" ↔ "We're so back"

Archival Note

The Productivity Paradox

Refers to the delay between the invention of a technology (like the steam engine or the computer) and its impact on GDP. Speaker 1 notes we are currently in the "linear factory" phase of AI.

[ Speaker 0 ]

"We need more sanity. Our timelines are short, but human experimentation is still the bottleneck."

[ Speaker 1 ]

"It's going to be 'over' and 'back' many times. Stay stable. Keep giving feedback. I love to hear what people think."

[State of Post-Training] From GPT-4.1 to 5.1: RLVR, Agent & Token Efficiency — Josh McGrath, OpenAI