Agent First
OpenAI recently published a striking case study on its official site: a team started from an empty Git repository and built an internal beta product entirely using Codex—application logic, tests, CI configuration, documentation, observability tooling, and even internal developer tools. Crucially, they enforced a strict rule: no human wrote any production code.
The goal wasn’t just to test whether AI could generate code—it was to confront a deeper question: What does software engineering become when the team’s primary work is no longer writing code, but designing environments, clarifying intent, and building feedback loops?
In five months, they shipped a working internal beta—1 million lines of code, 1,500 PRs.
- Early progress was slow—not because the model was weak, but because the environment wasn’t ready.
The agents lacked clear abstractions, usable tools, and internal structure. So engineers pivoted: their job became enabling agents, not directing them. They broke down goals, added missing capabilities, built new tools, and hardened system boundaries—not to replace humans, but to make agents reliably useful.
The team’s core mission shifted: “How do we restructure our engineering system so agents can operate stably?”
OpenAI shared concrete examples:
- They modified the app to launch per
git worktree, letting each agent change drive an isolated runtime instance. - They embedded Chrome DevTools Protocol into the agent runtime—so Codex could inspect DOM snapshots, take screenshots, and navigate pages.
- They exposed logs, metrics, and traces via a local observability stack—enabling agents to query system state with LogQL and PromQL, reproduce bugs, verify fixes, observe outcomes, and iterate autonomously. Some Codex runs lasted over six hours.
- What matters most isn’t “generation power”—it’s readability and verifiability.
OpenAI devoted significant space to explaining why the codebase itself must become a recording system—not just a delivery artifact.
Their first attempt—a monolithicAGENTS.mdfile—failed. Why?- Context windows are scarce; large files crowd out task instructions and code.
- Declaring everything as “important” makes nothing stand out.
- Such documents rot quickly—becoming graveyards of outdated rules.
- They’re nearly impossible to validate mechanically.
So they replaced it: a lean, ~100-line AGENTS.md acting as a table of contents, while real knowledge lived in a structured /docs directory. The repo became a living record of decisions, constraints, and evolution.
A line from the post stands out:
You shouldn’t give an agent a 1,000-page manual—you should give it a map.
That captures a foundational shift: No system scales on prompt-stuffing alone. You need maps—boundaries, indexes, layered knowledge structures that let agents navigate depth, not just breadth.
- Beyond engineering, what becomes critical is taste—and rule-crafting.
Documentation alone can’t sustain coherence in a fully agent-generated codebase. What works is encoding invariants: architectural boundaries, enforced mechanically.
They built applications around strict domain-layer models—each business area had fixed tiers; dependency directions were validated; only certain cross-layer edges were allowed—enforced by custom linters and structural tests. Even log formats, naming conventions, file-size limits, and platform reliability requirements were codified as rules.
One sentence resonates deeply:
In human workflows, these rules feel bureaucratic. In agent environments, they become force multipliers.
Many engineering teams instinctively resist rigid rules—seeing them as constraints on creativity. But in high-throughput agent systems, clarity is velocity. The sharper the boundary, the faster the agent moves—and the less drift, rework, or ambiguity accumulates.
- Sharing is systemic.
When OpenAI says “the codebase was generated by Codex,” they mean everything: product code, tests, CI configs, release tooling, internal dev tools, docs, design history, evaluation frameworks, review comments and replies—even scripts that manage the repo itself and definitions for production dashboards.
Humans remained deeply involved—but at a higher abstraction layer: setting priorities, translating user feedback into acceptance criteria, validating outputs. Agents executed. When stuck, humans diagnosed missing tools, guidance, constraints, or docs—and committed those gaps back into the repo, prompting Codex to write its own patches.
-
So what did this experiment actually demonstrate?
At first, they asked: Can AI build a product alone?
Soon, they realized the hard part wasn’t generation—it was environmental readiness.
Then came systemic openness: making apps, logs, docs, architecture, reviews, and tests all legible and actionable by agents.
Then rule embedding: baking taste, consistency, and boundaries directly into the system—so agents could move fast within guardrails.
Finally, end-to-end agency: Codex now drives new features from spec to PR—reproducing bugs, recording issue videos, applying fixes, running validations, recording resolution videos, opening PRs, responding to feedback, fixing broken builds—and only escalates to humans when judgment (not execution) is required. -
By this point, software engineering has fundamentally changed flavor.
Code is no longer the center—it’s the output layer. Higher-leverage work lives in environment design, contextual scaffolding, architectural constraint, and feedback-loop architecture. -
This isn’t utopian. The post honestly surfaces friction.
Fully autonomous agents tend to replicate patterns already in the codebase—including suboptimal ones. Over time, this creates “AI residue”: accumulated technical debt disguised as consistency. Early on, the team spent every Friday manually cleaning it up—unsustainable. So they encoded “golden principles” directly into the repo and launched background Codex tasks to continuously scan for deviations, update quality scores, and auto-generate refactoring PRs.
The authors call this process garbage collection. I think that’s spot-on. -
A high-output system naturally generates entropy.
Let agents generate at scale, and you must pair that with an equally robust entropy-reduction mechanism. Otherwise, efficiency gains just accelerate chaos. -
OpenAI admits open questions remain:
- How does architectural coherence evolve long-term in a fully agent-generated system?
- Where should human judgment be most tightly focused—and how do we encode that judgment?
- How will this whole stack adapt as models grow more capable?
One thing is clear: discipline remains essential—not in individual lines of code, but in the supporting infrastructure: the environment, the constraints, the feedback channels, the knowledge architecture.
When agents become engineering’s primary workforce, scarcity migrates—from syntax mastery to environment design, rule encoding, and feedback-system craftsmanship.
Original blog (in Chinese): openai.com
From AI Coding to AI Native
A recent observation about how engineers use AI: many now rely heavily on AI for coding—and yes, it boosts output in countless scenarios. But a widespread pattern emerges: they deploy AI only at the code-generation layer, never pulling it into engineering design, product strategy, UX thinking, or system architecture.
AI improves how fast they write code—but doesn’t change what they build or why. The result? Still traditional IT-era software: interfaces cluttered with buttons, flows burdened by steps, experiences built for human navigation—not for intent fulfillment.
What matters more is redefining experience, engineering, and architecture for the AI era. That’s the heart of “AI Native.”
Most current AI usage stops at “AI writes code.” It accelerates old-world software construction—but doesn’t reconstruct new-world software logic.
AI’s deeper value lies in transforming product shape, engineering structure, interaction models, and system architecture. True AI Native means the system is conceived from the ground up to be AI-first—not merely AI-assisted.
This layer is vastly harder. I’ve felt it acutely lately: generating snippets, filling docs, or wiring small features is now trivial. But designing an elegant, coherent, AI-native architecture—one where flow collapses, interface fades, and intent becomes the sole interface—that demands a different kind of rigor.
I suspect three emerging roles:
- AI Code Users: Highly productive, but still shipping legacy-pattern software.
- AI Workflow Designers: Understand how to chain models, tools, knowledge bases, state flows, and feedback mechanisms. Their products feel meaningfully more adaptive than traditional software.
- AI Native Architects: Focus shifts entirely—to what humans should do, what AI should own, how much UI to retain, how far to compress process, and how to organize the system around “intent” instead of “feature menus.”
This third group is the rarest—and most consequential.
The Essence of Fortune-Telling
Fortune-telling likely satisfies a deeper psychological need: a sense of being placed. Our hunger for existential grounding runs deeper than we assume.
Consider a child seeing a map for the first time. Almost universally, they do the same thing: find themselves. “Look—we’re right here!” In that moment, something settles. It’s not just location—it’s being anchored.
Physical maps solve spatial positioning. Psychological maps solve existential positioning—answering questions like:
- Who am I?
- Why do I act this way?
- What do these experiences mean?
- Where am I, in the arc of my life?
Without answers, life feels unstructured—events random, disconnected, arbitrary. Fortune-telling systems—whether astrology, Bazi, or MBTI—offer a framework. They slot your story into a category, assign traits, outline life stages. Accuracy is secondary. What matters is narrative coherence: turning fragmentation into story.
Humans are narrative animals. A string of events without structure is just noise. Insert a frame—“this is your karmic test,” “that’s your Saturn return”—and suddenly it becomes part of your life. Suffering remains hard—but if it’s meaningful suffering, it becomes bearable.
Fortune-telling’s real power lies in stitching experience together. It takes scattered moments and presents them as pieces of a single, intelligible mosaic. People don’t go for predictions—they go for confirmation: “Is what I’m living meaningful?”
Another subtle mechanism: fortune-telling language almost always emphasizes uniqueness. Real-world metrics are reductive—grades, KPIs, net worth—flattening complexity into single dimensions.
Fortune-telling flips that: “Your chart is rare,” “Your path is unconventional,” “You’re destined for late bloom.” Even hardship is framed as distinctive—not generic misfortune. For many, the unbearable isn’t bad luck—it’s meaningless ordinariness.
But there’s risk. Fortune-telling quietly reshapes how we understand existence itself.
Existence has two roots:
- Action-based: I create meaning through what I do.
- Position-based: Meaning comes from where the world places me.
Fortune-telling leans heavily into the second. The more one relies on such explanations, the more action recedes—and interpretation advances. Explanation soothes; action risks.
A quiet loop can form:
- The less grounded someone feels, the more they seek explanation.
- The more they seek explanation, the less they act.
- The less they act, the more ungrounded they feel.
In this cycle, fortune-telling functions as psychological suturing: it doesn’t alter reality—but temporarily binds fragmented experience into something legible. It delivers three things: recognition, distinctiveness, and structural coherence. When those arrive, a person feels: “My story is understood. My life has shape.” That feeling alone carries profound comfort.
The question isn’t whether it helps—it does. It’s whether the suture is temporary (restoring strength to act) or permanent (becoming the only lens through which reality is interpreted). The latter erodes agency.
Wolf: Not Just “Fierce”
We often misread wolves—fixating on “ferocity,” “wildness,” or “pack tactics.”
But what’s truly rare—and evolutionarily vital—for wolves isn’t aggression. It’s endurance.
In the wild, their prey—deer, elk, bison—are large, fast, and strong. Raw aggression would get wolves injured—or killed. And injury, in nature, is often fatal. So “charging in” isn’t strategy—it’s suicide.
What makes wolves formidable operates on four calibrated layers:
- Restraint: They don’t pounce when hungry or upon first sighting. They observe, probe, shadow, exhaust—assessing whether this target has a weakness.
- Patience: Real hunting isn’t the final lunge—it’s the preceding hours or days of tracking, waiting, filtering. Wolves target the old, the weak, the stragglers—the odds they can tilt.
- Discipline: Even in packs, coordination isn’t just “unity”—it’s precise, synchronized restraint. Knowing when to close in, when to fall back, when to encircle, when to pause—all governed by shared, implicit timing.
- Decisiveness: Wolves endure—but the moment advantage crystallizes, their strike is lightning-fast. This isn’t hesitation; it’s leverage-building for a high-probability outcome.
Sleep First
For the past two years, I’ve kept an early rhythm: my watch vibrates at 6 a.m., and I’ve risen consistently—most days, without fail.
Reviewing last year’s sleep data, my average was ~7.5 hours nightly.
Over Spring Festival, back home, I dropped the alarm. Sleep stretched to 8–9 hours daily. Within days, my energy, focus, and mood noticeably lifted.
Back in Beijing, even with the 6 a.m. buzz, I stayed in bed—rising around 6:50 instead. After sustaining that, I formalized the shift: moving the alarm to 6:40 a.m. That adds ~40 minutes nightly—pushing my annual average toward 8 hours.
For years, I prioritized exercise above all. Now I see clearly: sleep is the deeper variable. Recovery capacity, training quality, emotional resilience, decision clarity—all rest upon it.
I’ve updated my personal health equation:
Sleep + Nutrition + Movement + Emotional Regulation + Medication
—with sleep now at the top of the priority stack.