On Friday, OpenAI engineer Michael Bolin published a detailed technical breakdown of how the company's Codex CLI coding agent works internally, offering developers insight into AI coding tools that can write code, run tests, and fix bugs with human supervision. It complements our article in December on how AI agents work by filling in technical details on how OpenAI implements its "agentic loop."
AI coding agents are having something of a "ChatGPT moment," where Claude Code with Opus 4.5 and Codex with GPT-5.2 have reached a new level of usefulness for rapidly coding up prototypes, interfaces, and churning out boilerplate code. The timing of OpenAI's post details the design philosophy behind Codex just as AI agents are becoming more practical tools for everyday work.
These tools aren't perfect and remain controversial for some software developers. While OpenAI has previously told Ars Technica that it uses Codex as a coding tool to help develop the Codex product itself, we also discovered, through hands-on experience, that these tools can be astonishingly fast at simple tasks but remain brittle beyond their training data and require human oversight for production work. The rough framework of a project tends to come fast and feels magical, but filling in the details involves tedious debugging and workarounds for limitations the agent cannot overcome on its own.
Bolin's post doesn't shy away from these engineering challenges. He discusses the inefficiency of quadratic prompt growth, performance issues caused by cache misses, and bugs the team discovered (like MCP tools being enumerated inconsistently) that they had to fix.
The level of technical detail is somewhat unusual for OpenAI, which has not published similar breakdowns of how other products like ChatGPT work internally, for example (there's a lot going on under that hood we'd like to know). But we've already seen how OpenAI treats Codex differently during our interview with them in December, noting that programming tasks seem ideally suited for large language models.
It's worth noting that both OpenAI and Anthropic open-source their coding CLI clients on GitHub, allowing developers to examine the implementation directly, whereas they don't do the same for ChatGPT or the Claude web interface.
An official look inside the loop
Bolin's post focuses on what he calls "the agent loop," which is the core logic that orchestrates interactions between the user, the AI model, and the software tools the model invokes to perform coding work.
As we wrote in December, at the center of every AI agent is a repeating cycle. The agent takes input from the user and prepares a textual prompt for the model. The model then generates a response, which either produces a final answer for the user or requests a tool call (such as running a shell command or reading a file). If the model requests a tool call, the agent executes it, appends the output to the original prompt, and queries the model again. This process repeats until the model stops requesting tools and instead produces an assistant message for the user.
That looping process has to start somewhere, and Bolin's post reveals how Codex constructs the initial prompt sent to OpenAI's Responses API, which handles model inference. The prompt is built from several components, each with an assigned role that determines its priority: system, developer, user, or assistant.
The instructions field comes from either a user-specified configuration file or base instructions bundled with the CLI. The tools field defines what functions the model can call, including shell commands, planning tools, web search capabilities, and any custom tools provided through Model Context Protocol (MCP) servers. The input field contains a series of items that describe the sandbox permissions, optional developer instructions, environment context like the current working directory, and finally the user's actual message.
As conversations continue, each new turn includes the complete history of previous messages and tool calls. This means the prompt grows with every interaction, which has performance implications. According to the post, because Codex does not use an optional "previous_response_id" parameter that would allow the API to reference stored conversation state, every request is fully stateless (that is, it sends the entire conversation history with each API call rather than the server retrieving it from memory). Bolin says this design choice simplifies things for API providers and makes it easier to support customers who opt into "Zero Data Retention," where OpenAI does not store user data.
The quadratic growth of prompts over a conversation is inefficient, but Bolin explains that prompt caching mitigates this issue somewhat. Cache hits only work for exact prefix matches within a prompt, which means Codex must carefully avoid operations that could cause cache misses. Changing the available tools, switching models, or modifying the sandbox configuration mid-conversation can all invalidate the cache and hurt performance.
The ever-growing prompt length is directly related to the context window, which limits how much text the AI model can process in a single inference call. Bolin writes that Codex automatically compacts conversations when token counts exceed a threshold, just as Claude Code does. Earlier versions of Codex required manual compaction via a slash command, but the current system uses a specialized API endpoint that compresses context while preserving summarized portions of the model's "understanding" of what happened through an encrypted content item.
Bolin says that future posts in his series will cover the CLI's architecture, tool implementation details, and Codex's sandboxing model.
