The Trust Threshold: Senior Engineers Are Already Crossing It
On February 10th, 2026, Spotify's co-CEO Gustav Söderström got on a quarterly earnings call, the kind of call where executives talk about subscriber growth and gross margin, and said something that stopped a lot of engineers mid-scroll.
"When I speak to my most senior engineers — the best developers we have — they actually say that they have not written a single line of code since December."*
He wasn't talking about junior developers handed a new toy. He was talking about Spotify's best engineers, the Senior and Staff-level people who set the standards, run the reviews, and decide what goes to production. They stopped writing code two months ago. They've been shipping features ever since, more than 50 in 2025 alone, using an internal AI system called Honk, built on Claude Code.
Söderström called December a "singular event in terms of AI productivity." And then, the line I haven't been able to stop thinking about, he said: "we crossed the threshold where things just started working."
I've been arguing for a while that the real tipping point in AI-assisted development wouldn't be a model release or a benchmark score. It would be the moment senior engineers started trusting the output. Really trusting it. The kind of trust that changes how you work, not just how fast you work.
Söderström just told his investors that moment arrived. And Spotify isn't alone.
At Anthropic, the engineer who built Claude Code hasn't written a single line of code manually in over two months. He's submitting 22 to 27 pull requests a day, every one of them written entirely by Claude. Company-wide, between 70 and 90 percent of all code at Anthropic is now AI-generated. At OpenAI, a three-person Harness Engineering team set one rule on day one: no manually written code. Five months later, they'd shipped roughly a million lines across application logic, infrastructure, and CI configuration, about 1,500 pull requests, all written by Codex agents. Production software, in daily use by hundreds of internal users. Zero human keystrokes on the code itself.
The threshold isn't a prediction anymore. It's a news story.
I Felt It Too
I want to be honest about my own experience here, because I think it maps to what a lot of senior engineers have quietly gone through over the past year or so.
The progression wasn't sudden. It went: helpful. Good. Really good. Oh, wait.
I've been building projects with AI coding tools for the past year and a half. For a long time, I was the classic skeptic, using the output as a starting point, rewriting the parts that were wrong, staying in control of the code the way you would with a junior developer whose work you don't quite trust yet. That was the right call for a long time. The output earned skepticism.
Then Claude Code 4.5 hit. And something shifted that I didn't fully anticipate.
I was running three projects simultaneously. I set up a separate desktop environment for each one, all the tools, all the context, and I'd start a prompt in one, swipe to the next, prompt there, swipe again, keep moving. At one point I genuinely lost track of what I was building in which window. My instinct was to context-switch mentally, to reconstruct where I was. Instead, I just asked Claude: what were we working on, where did we leave off, and what should we do next?
It answered. Completely. Accurately. And we kept going.
That was the moment I understood what Söderström was describing. Not a feature. Not a benchmark. A shift in the working relationship, from tool to collaborator. And once that shift happens, you don't go back.
What Real Trust Actually Requires
Here's what I don't want to do: wave a flag and say trust everything. That's not what the engineers at Anthropic, OpenAI, and Spotify are doing, and it's not what I'm doing either.
Trust isn't binary. Senior engineers don't decide "I trust AI" in the abstract. They make calls across specific dimensions, the same dimensions they've always used to evaluate whether code is actually good. What's changed is how often AI is clearing those bars.
Security
AI coding agents (Claude Code, GitHub Copilot, Cursor, Amazon Q) have historically generated code with real vulnerabilities: SQL injection openings, hardcoded secrets, insecure defaults. The gap is closing, but it's closing because teams are building security scanning into the agentic loop, not because models are magically safe. Tools like Sonar exist precisely for this. Trust here is earned at the pipeline level, not the prompt level.
Architectural Integrity
AI is excellent at the local solution. It still struggles with global context; a function can be elegant in isolation and wrong at the system level. The OpenAI Harness team addressed this directly: they encoded architecture documentation into the agent's operating environment, enforced module boundaries, and made structural constraints part of every prompt. They didn't hope the agent would get it right. They built the context so it couldn't easily get it wrong. That's the model.
Readability and Simplicity
Code gets read far more than it gets written. Readability is a maintenance strategy, a bus-factor reducer, and a measure of craft. AI-generated code can be verbose or just off, technically correct but visually noisy. The question I ask: does this look like it belongs in this codebase? Would a developer who didn't write it understand it immediately? This bar is rising, but it's sensitive to how well you've encoded your standards into the context you're feeding the agent.
Adherence to Well-Known Patterns
When a senior engineer reaches for a Strategy pattern, an Observer, or a Factory, they're not being academic. They're communicating intent to every developer who touches that code later. It's a shared vocabulary. AI tools land on the right pattern naturally more often than they used to, but not consistently. Getting this right requires being explicit about what you want. Encode your pattern preferences. Don't assume.
Reliability: Close to Zero Bugs
The hardest bar. Senior engineers have instincts about edge cases and failure modes that only come from years of things breaking at the wrong time. AI is optimized for the happy path. The teams that have crossed the threshold are closing this gap not by trusting the model blindly, but by building verification into the loop: agents reviewing their own output, running tests, iterating until automated checks pass. The human job shifts from checking the code to designing the system that checks the code.
The Models Are Getting Better. That's Not a Small Thing.
Here's the part that should give every skeptic pause: the five dimensions above aren't a static target. The models clearing them are improving at a rate that's hard to fully internalize.
Opus 4.5 was the specific breakthrough Söderström named on that earnings call. That was November. By February, Anthropic had already shipped Opus 4.6, with agent teams built in. Not one agent working through a task sequentially, but multiple agents splitting the work, coordinating in parallel, each owning its piece. One writes. One reviews. One runs the tests. All simultaneously.
Think about what that does to the five dimensions I just laid out. Security gets a dedicated review pass baked into the loop. Architectural consistency gets an agent whose only job is to hold the system-level view. Reliability improves because the feedback cycle compresses from hours to minutes. And all of it compounds, because a better model producing better output inside a better harness isn't additive. It's multiplicative.
We're not at the ceiling. We're somewhere on an accelerating curve, and the teams that built their harnesses around Opus 4.5 are already inheriting the benefits of 4.6 without changing a thing. That's a fundamentally different dynamic than any previous generation of developer tooling.
The Harness Is the Product
What connected December's breakthrough across Anthropic, OpenAI, and Spotify wasn't just better models (though Opus 4.5 was the specific trigger Söderström named). It was that the harnesses around those models had matured enough to make the output consistently trustworthy.
The harness is the scaffolding: architecture documentation, security scanning, pattern conventions, feedback loops, self-review mechanisms. A powerful model in a bare prompt is still unreliable. A well-harnessed model starts to look like a collaborator worth trusting. The senior engineers who crossed the threshold didn't just start trusting the model. They built the environment that made the model worth trusting. That distinction matters enormously.
What Comes Next
When senior engineers genuinely trust the output across these dimensions, not perfectly, but consistently, the shift that's already happening at Spotify and Anthropic becomes mainstream. English stops being a way to describe what you want coded. It becomes the way you deliver it.
This doesn't end programming. It completes a transition that's been underway for seventy years, from assembly to high-level languages to natural language. Each step didn't eliminate engineers. It elevated what engineers could do.
The engineers who thrive in what comes next won't be defined by how much code they wrote. They'll be the ones who built the harnesses, encoded the standards, and developed the judgment to know when the system needs a human in the loop and when it doesn't.
Söderström said it on a public earnings call with Wall Street listening: "we crossed the threshold."
The question isn't whether you believe him. It's what you're building on the other side.