Field notes on AI deployment, Level-3 agents, MCP servers, and the gap between wanting AI and running it in production.

Two of the biggest enterprise AI deployments in 2025 hit the same wall: access without architecture. The companies that get AI right are not spending less. They know where the money goes.
Read article →
Every notable launch-day reaction with receipts: Karpathy, Mollick, Willison, the eval data, the safety fight, and the question nobody can answer yet.
Read article →
The Claude Code team put it plainly after Fable 5 launched: we used to verify that Claude did the work right. Now we verify it is doing the right work. That is not a threat. That is a promotion.
Read article →
Most teams are running Claude like a task runner. Fable 5 is designed for goals. The difference is not just workflow. It is the gap between Level 2 and Level 3.
Read article →
Most teams dump a brief into Claude and wait for output. The Anthropic team changed how they work with Fable 5: ask Claude to interview you first. Here is why that changes everything.
Read article →
"Keep it simple" is a constraint. "This feature might be deleted in a month" is context. The Anthropic team's Fable 5 insight: context lets Claude catch things you did not think of. Constraints just limit it.
Read article →
Claude Fable 5 can run autonomously for hours, test its own work, and produce better output than human reviewers. Most teams are still watching every step. That is not safety. It is a bottleneck.
Read article →
Claude Fable 5 just landed. Models keep getting smarter. And yet there is a category of enterprise work that gets more valuable, not less, as models improve. Private context. Permission. Accountability.
Read article →
For every dollar spent on software, companies spend six on services. AI does not eliminate that six dollars. With Fable 5, it lets smart operators capture both sides. Here is how the math changes.
Read article →
Claude Fable 5 hits 80.3% on SWE-bench Pro while OpenAI kills Verified for contamination and FrontierCode resets every frontier model below 30%.
Read article →
Claude Fable 5 leads the GDPval-AA leaderboard at 1932 Elo. What expert parity on real deliverables means, and the perfect-brief catch in the fine print.
Read article →
The industry's most-cited single number for model capability: what the ten component evals measure, why v4 cut the top score from 73 to 50, and what a composite hides.
Read article →
How the AI coding benchmark works, why SWE-bench Verified died, what SWE-bench Pro and FrontierCode actually measure, and how to read a score.
Read article →
OpenAI benchmark for AI on real economic deliverables across 44 occupations. How it is graded, what expert parity means, and what the scores hide.
Read article →
The CRM owned thirty years of enterprise value by owning the database. The orchestration layer is the new gravity well. Switching costs migrate to accumulated reasoning.
Read article →
Wedge, suite, platform used to take ten years. AI collapsed it to eighteen months. Cursor replaced VS Code at seed stage. Ambition beats timing now.
Read article →
540,000 lines of code plus 276,000 lines of tests equals a cage built for a model that no longer needs one. The economics flipped. Most codebases didn't.
Read article →
AI raises the floor and floods the zone with close-but-not-right output. Demand for expert judgment goes up, not down. The paradox every scaling company hits.
Read article →
You can outsource your thinking but never your understanding. Karpathy's agentic engineering thesis, and why taste is recognizing failure before it ships.
Read article →
Claude Opus 4.8 introduced dynamic workflows — Claude writes its own orchestration script, then runs hundreds of agents in parallel for migrations, audits, and tasks too large for any single conversation.
Read article →
Codex is named after its coding origins but it has become something broader: a tool-using agentic workspace powered by GPT 5.5 that handles email, research, writing, planning, and operations alongside code.
Read article →
The Roman legion was the best management technology of its time. Most companies today are organised the same way: humans as conduit for information at every layer. AI removes the need for the conduit. Here is what changes.
Read article →
Cloudflare laid off 20% of its workforce while growing at 30%. The people let go were not underperformers. They were measurers — people whose primary work was moving information between layers that could not talk directly.
Read article →
AI adoption gives people better tools. The company stays the same. AI transformation redesigns the company around what AI makes possible. Most companies are doing the first and calling it the second.
Read article →
Not a theory. YC, Browserbase, Airtable, Every.to. Real companies doing this right now. Here is what it looks like in practice — the systems, the structure, and what it produces.
Read article →
McKinsey samples your organisation, delivers a roadmap, and exits. AI transformation touches every function, every workflow, every role. You cannot sample your way to a transformation. Here is why the method has to change.
Read article →
Every layer in your company exists because information needed a human to carry it. Meetings. Reports. Middle management. That constraint is lifting. Here is what changes when information moves itself.
Read article →
A tool does what you ask, then stops. An agent teammate takes ownership of a task, makes decisions within defined boundaries, and reports back. Here is the difference — and why it matters for your company.
Read article →
When you run multiple AI agents, they each start from scratch. They do not know what the others know. They do not follow the same rules. An Agent OS fixes this. Here is what it is and why it matters.
Read article →
OpenAI benchmarked AI on real professional tasks across 44 occupations. The models are approaching expert quality. The three things that unlock that performance are context, scaffolding, and oversight. Your company has none of them.
Read article →
Enterprise AI projects fail because they're built on the org-chart version of your company. The agent needs the real one. That version only exists in the field.
Read article →
Most founders ask where to start with AI. The wrong first function wastes 3–6 months. Here's the ranked list: seven functions, ordered by payback speed, deployment difficulty, and compliance overhead.
Read article →
Anthropic says 90% of its code is AI-written. Google says 75%. A founder built 1,000+ PRs with no engineering team. Here's what a software factory actually is — and why most companies are nowhere close.
Read article →
Early factories replaced steam engines with electric motors and kept the same floor plan. Marginal gains. The ones that redesigned around electricity got 10x. Most companies are making the same mistake with AI.
Read article →
For 30 years, the CRM was where enterprise value lived. AI agents don't need the UI. They need structured data at the API layer. The value is moving — and the window to position above it is open.
Read article →
Claude Opus 4.8 hits 69.2% on SWE-bench Pro, open-weights models close within 6 points at 8x lower cost, and Verified becomes a zombie metric.
Read article →
Opus 4.8 takes the lead at 1890 Elo, Grok 4.3 jumps 321 points, and Gemini 3.5 Flash beats Google's own Pro tier on real work.
Read article →
Most AI pilots fail for one reason: the workflow they're trying to automate was never instrumented. No machine-readable artifacts, no queryable state, no closed loop. You cannot automate what you cannot observe.
Read article →
76% of organizations now have a Chief AI Officer. Most haven't shipped a single agent to production. The hire who will get AI into your systems in week one is not the hire who needs six months to understand your company.
Read article →
On August 2, GPAI enforcement goes live and high-risk AI system obligations activate. Most B2B SaaS internal agents are limited-risk — but one category catches almost every founder off guard.
Read article →
GPT-4 became GPT-4o, o1, o3, 4.1. Claude 3 became 3.5, 3.7, 4. The models kept improving. The workflows never got built. The gap isn't capability — it's assembly.
Read article →
ChatGPT made AI accessible. Vibe coding made it fast. Agentic engineering makes it useful. Most companies are still in wave 1. Here is what wave 3 actually looks like — and what it takes to get there.
Read article →
Most companies think they're deploying AI. They're running Level-1 tools at best. Here's the full capability spectrum and what it takes to reach Level-3 — where AI closes operational loops without human intervention.
Read article →
The models are good. The APIs are accessible. So why isn't your AI pilot in production? The blockers are data access, permission architecture, and the absence of someone who owns the outcome after handoff.
Read article →
Model Context Protocol is the infrastructure layer that connects AI agents to your live internal systems. Without it, agents are isolated from the data that makes them useful. Here's what it is and how it works.
Read article →
GDPR and data residency aren't the blocker most people assume — if you architect for them from the start. A practical guide to on-prem AI for European scaling companies, including why Claude and Codex beat open-weight models for most use cases.
Read article →
Berkeley researchers break 8 agent benchmarks with a 10-line exploit, Mythos Preview exposes the Verified-vs-Pro gap, and GPT-5.5 lands at 58.6%.
Read article →
GPT-5.5 launches at 84.9% expert parity, economists start writing about AI eating analyst work, and Grok 4.3 enters beta.
Read article →
GPT-5.4 leads the standardized SWE-bench Pro set at 59.1%. Post-Verified, the honest-low scores show where deployment work actually lives.
Read article →
GPT-5.4 moves to the top of GDPval-AA at 1674 Elo with three labs within 70 points. The differentiator shifts to price, context, and your workflows.
Read article →
OpenAI deprecates SWE-bench Verified after models reproduce gold patches from task IDs alone. The 80% cluster made it meaningless anyway.
Read article →
Opus 4.6 retakes #1 at 1606 Elo, then Sonnet 4.6 tops it at 1633 for $3/$15. Gemini 3.1 Pro proves exam brilliance does not transfer to deliverables.
Read article →
Artificial Analysis rebuilds its Intelligence Index around work-shaped evals. The top score falls from 73 to 50. The models did not get worse.
Read article →
GPT-5.2 hits 70.9% win/tie against professionals and Artificial Analysis launches independent Elo grading. Vendors stop marking their own homework.
Read article →
Four frontier releases in twelve days. Claude Opus 4.5 becomes the first model over 80% on Verified, and the 35-point Pro spread is the warning.
Read article →