Untrainable

From the Editor

You cannot write a how-to for software released this morning. The pedagogy of AI is impossible by construction.

The pace is the entire problem. METR's task-horizon paper from March 2025 measured how long an autonomous task a frontier AI agent could complete, and found that horizon was doubling roughly every seven months. SWE-bench Verified, the most-watched coding benchmark, gained sixty percentage points between early 2025 and April 2026. In the single calendar year of 2025, the umbrella term that practitioners used for "how to work with AI" changed several times. Prompt engineering. Building effective agents. Vibe coding. Vibe engineering. Each had a four-to-seven-month run as the canonical framing, then it got replaced or folded into the next one. None of those terms now describes the workflow that the head of Claude Code uses to ship twenty-two production pull requests in a day.

Two years ago, prompt libraries were everywhere. PromptBase, FlowGPT, the awesome-prompts repos on GitHub, the courses that promised to teach you the "right structure" of a prompt. There was a job title on LinkedIn called Prompt Engineer. Most of that has gone quiet. The libraries are unmaintained, the marketplaces have stopped growing, and the job listings dried up. Modern frontier models obsoleted the techniques those libraries were built around by being good enough out of the box. That is the cadence. That is the problem.

A structured course takes nine to eighteen months to develop, certify, and roll out. Frontier models change, fundamentally, faster than that. The development cycle of any "AI training" curriculum is longer than the half-life of the techniques it would teach. By the time the course ships, the regime it describes has aged out. This is not a problem better training programs would solve. It is a problem nobody can solve, because the thing being taught is moving faster than the act of teaching by an order of magnitude. There is no stable body of knowledge to transfer. There is only the tool.

The argument of this report is in two halves. First, the capability has moved dramatically in the past twelve months. Earlier drafts of this issue used 2024 figures and were retired by exactly the amount the field has moved. Second, and harder: the only path to fluency, given the pace, is to use the tool constantly. Make AI write every line of code, including the one-character changes. Make it draft every email, produce every analysis, summarize every document. Build the workflow. Ship the work. Let your sense of the model's capability update in real time, faster than any curriculum could ship. The roughly four-percentage-point productivity advantage that Anthropic measured in March 2026 for high-tenure users (once controlled for model, language, geography, and use case) is the only direct empirical signal we have of what time-on-tool buys you. The rest is vendor claims, LinkedIn opinions, and guesses dressed in slides.

The METR follow-up, published February 2026, is itself a perfect demonstration. The same outfit whose July 2025 study found experienced developers were nineteen percent slower with AI re-ran the study with a second cohort and the year's newer models. They could not produce a clean answer. The confidence intervals straddled both signs. Their stated belief is that developers are now sped up, and they are redesigning the experiment because the world moved faster than their methodology could measure. The leading research group on AI productivity is, in early 2026, telling you that the field they study is moving faster than they can study it. Believe them, and act accordingly.

Tristan Chiappisi, ed. · Field Report No. 2 · Updated thesis · Updated data · Skip the training

Six Numbers, One Thesis

87.6%

Top model SWE-bench Verified score, April 2026 (Claude Opus 4.7). Up from ~65% in early 2025. The capability has surged even as the measurement got harder.

SWE-bench leaderboard · Apr 2026

~7mo

Doubling time of AI's task-completion time horizon. How long a task an AI agent can complete autonomously at 50% success.

METR · Mar 2025 · arXiv:2503.14499

59%

Share of daily work that Anthropic's own engineers now do with Claude, up from 28% one year prior. Self-reported productivity gain rose from 20% to 50% in the same window.

Anthropic · Dec 2025 · n=132 + 53 interviews

49%

Share of jobs where AI is now used for at least 25% of tasks. Roughly 4% of jobs have reached 75% AI task coverage.

Anthropic Economic Index · Jan 2026

Productivity advantage that high-tenure (6+ month) Claude users have over new users, once controlled for model, language, geography, and use case. Tenure correlates with fluency.

Anthropic Economic Index · Mar 2026

Share of developers who "highly trust" AI accuracy in 2025. Adoption is up. The trust gap has not closed.

Stack Overflow · 2025 · n=33,244

I.

The capability is doubling. The measurement is having a hard time keeping up.

The Capability Surge. What 2024 data missed.

The story of AI in 2026 is a curve that is moving faster than any methodology built to study it. Two complementary measurements (SWE-bench Verified scores and METR's task-horizon) both show roughly the same thing: exponential capability gain.

SWE-bench Verified scores and the task-horizon doubling.

SWE-bench Verified, the most-watched coding benchmark, has gone from a top score around 25–30% at its August 2024 launch to 87.6% in April 2026. METR's parallel measurement of how long an autonomous task an AI agent can complete shows roughly seven-month doubling. Two different yardsticks, the same shape of curve. The chart pairs them.

SOURCES: SWE-bench Verified leaderboard snapshot, April 2026 (marc0.dev/en/leaderboard). METR, Measuring AI Ability to Complete Long Tasks, Kwa et al., arXiv:2503.14499 (March 2025); claimed doubling time ≈7 months. Caveat: OpenAI has stopped reporting Verified scores citing data-contamination concerns; Claude Opus 4.5 scores 80.9% on Verified but 45.9% on the harder SWE-Bench Pro with standardized scaffolding. Treat 80%+ as benchmark performance, not as "the model finishes 80% of real software work."

The benchmark is partially compromised, and the curve still bends up. Whether you read the SWE-bench numbers as gospel or as marketing-adjusted, both metrics agree that capability in 2026 is a different regime than in 2024. Reports that anchored on 2024 data are stale by exactly the amount the curve has moved.

Adoption is broadening at the same time it's deepening.

Three independent signals tell the same story. Stack Overflow's 2025 survey: 84% of working developers use or plan to use AI tools, 51% of pros use them daily. Anthropic's Economic Index: employee AI use at work doubled from 20% in 2023 to 40% in mid-2025. GitHub Octoverse 2025: 80% of new GitHub developers use Copilot in their first week; the number of public repos using an LLM SDK grew 178% year over year.

SOURCES: Stack Overflow 2025 Developer Survey, n=33,244 in AI section (survey.stackoverflow.co/2025/ai). Anthropic Economic Index, Sept 2025 and Jan 2026 reports (anthropic.com/research). GitHub Octoverse 2025 (github.blog/news-insights/octoverse). Anthropic figures are vendor data and reflect Claude users specifically.

Trust did not move with adoption. Eighty-four percent use AI; three percent highly trust it. Sixty-six percent of developers cite "AI is almost right but not quite" as their biggest frustration. The literacy gap is the verb that connects those two numbers. People are using a tool whose output they don't trust, because they have not yet developed the craft of working with it well.

II.

Why no curriculum can keep up

There Is No Manual. The pedagogy is impossible by construction.

Two complementary pictures of the daily use of AI in 2026. The first shows what people actually do with it, and how the people who use it most use it differently. The second shows why the umbrella terms describing "how to use it" turn over every five months, faster than any course can ship.

How the tool is being held, and how high-tenure users hold it differently.

A treemap of Claude.ai conversation breakdowns from Anthropic's January 2026 Economic Index, with a panel below comparing how high-tenure users (six months or more on platform) behave versus new users. The differences are small in any single category but consistent in pattern: experienced users iterate more, command less, and hand over more of the rules-heavy work to the model.

SOURCES: Anthropic Economic Index, January 2026 (n=1M Claude.ai conversations + 1M API transcripts, Nov 2025). Tenure breakdown reported by Anthropic's March 2026 Economic Index, summarized by Built In (anthropic.com primary URL not directly verified at fetch time (flagged)). Vendor data; generalize beyond Claude users with care.

The behavioral split is the most underrated finding in the data. High-tenure users issue 29% directive commands; low-tenure users issue 38%. High-tenure users iterate at 28%; low-tenure users at 24%. The fluent population is not just using AI for different tasks. They are conducting different conversations with it. That is the craft.

The vocabulary treadmill. Every five months, a new "how to" replaces the last one.

A timeline of the dominant umbrella term that practitioners used to describe "working with AI" between January 2023 and April 2026. Each had a four-to-seven-month run as the canonical framing, then was replaced or merged. The cadence is faster than the development cycle of any structured curriculum, which is the central, mechanical reason that AI training programs ship obsolete. The course doesn't keep up because nothing keeps up.

SOURCES: Term emergence dates compiled from primary publications and practitioner discourse. Anthropic, Building Effective Agents (Dec 19, 2024). Andrej Karpathy, "vibe coding" tweet (Feb 2, 2025). Simon Willison, Vibe Engineering essay (Oct 7, 2025). Pre-2024 terminology windows reconstructed from research-paper publication dates and Stack Overflow / blog-tag usage trends. "Course development cycle ≈12 months" is corporate-L&D rule-of-thumb, not a single sourced statistic; flagged accordingly.

The half-life of practice in this field is roughly five months. The development cycle of a structured course is roughly twelve. By the time a "How to Use AI" curriculum makes it through internal review, vendor approval, and rollout, the era it describes has been replaced. This is not a problem better training programs would solve. There is no stable body of knowledge to teach. The only honest pedagogy is "go use the tool, watch what happens, adjust." Anyone selling you anything else is selling something the field will obsolete before it ships.

III.

What time-on-tool actually buys you

What Separates the Fluent. Tenure, iteration, and the absence of training.

No published study compares structured AI training against unstructured heavy use. There is no head-to-head RCT. The honest empirical answer is: nobody knows. The signals we do have all point in the same direction. Fluency tracks tenure, frequency, and pattern of use, not pedagogy.

Six measured signals from heavy users versus light users.

A grid of dot-pair plots. Each panel compares two groups on a single measured axis, every dataset taken from a public study or a vendor-released analytics report. The pattern is consistent: people who use AI more, in more contexts, for longer, use it differently and (on the metrics we can observe) more successfully. The causal arrow cannot be settled with this data (selection effects are real), but the correlation is everywhere.

SOURCES: Anthropic Economic Index, March 2026 (tenure success and behavioral splits) and Sept 2025 (employee AI use). Stack Overflow 2025 (daily-use share). Anthropic internal, How AI Is Transforming Work at Anthropic, Dec 2025 (n=132 surveyed engineers). Cursor blog, Sarkar/U. Chicago observational study, Nov 2025 (24 Cursor orgs vs 8 baseline). Caveat: every measurement here is observational, not causal. Heavy users self-select.

Hours-of-use does not directly predict who got faster in the only RCT we have. METR's July 2025 data showed a single developer with 50+ hours of Cursor experience who saw a 38% speedup. That is n=1, not a pattern. The case for sustained use is supported by the broader observational evidence, not by a controlled experiment. The prescription is empirically defensible; it is not yet empirically proven.

The money is real. The capture is now starting to materialize.

Investment in enterprise generative AI ran to $30–40 billion through mid-2025 with a 95-percent null-result rate (MIT NANDA). Anthropic's January 2026 Economic Index estimates that AI could contribute 0.7–2.6 percentage points of annual productivity growth over the next decade. That is a range, not a forecast. The earlier numbers are unchallenged. The newer numbers are the first hint that capture is starting to show up at the macro level.

SOURCES: MIT NANDA, The GenAI Divide: State of AI in Business 2025, August 2025 (300+ deployments, 52 interviews, 153 leader survey responses). Anthropic Economic Index, January 2026; productivity-growth estimates: base case 1.8pp/year, success-adjusted 1.2pp/year, range 0.7–2.6pp/year. Brynjolfsson, Li, Raymond, Generative AI at Work, NBER 31161, 14% avg / 34% novice / minimal experienced-worker gain.

The gain in the existing literature is unevenly distributed. Brynjolfsson et al. found AI mostly helps novices, a finding that complicates "heavy use predicts proficiency." A defensible reading: AI compresses skill at the bottom and rewards craft at the top, with a wide and confusing middle where most measurements live. Enterprise capture is starting to appear, two years after the spending peaked.

IV.

What the measurements actually say in 2026

The Forecast. Predictions, measurements, and where the curve is going.

Plotting predictions against measurements across the public studies of 2025–2026 reveals a wide and honest range. Capability gains are real. Productivity gains are real-but-conditional. Anyone offering a single number is selling something.

The full landscape of predictions and measurements, 2025–2026.

Every public, methodologically defensible measurement of AI's productivity impact this report could verify, plotted on a single axis. Predictions cluster between +24% and +39%. Measurements span from a 19% slowdown to a 55% lab-task speedup, with vendor-reported and self-reported numbers landing high and randomized-controlled numbers landing all over. The range is itself the finding.

SOURCES: METR (forecasts and main RCT, July 2025; follow-up Feb 2026 with redesigned methodology and ambiguous CIs). Peng et al., arXiv:2302.06590 (single-task lab RCT, 2023). Cui, Demirer, Jaffe et al., MIT Economics 2024 (Microsoft and Accenture enterprise field experiments). Cursor / Sarkar, Nov 2025 (observational, 24 vs 8 orgs). Anthropic internal, Dec 2025 (self-report). Brynjolfsson, Li, Raymond, NBER 31161 (RCT, customer support agents).

Only one measurement on this chart was an RCT on experienced software engineers doing real work in their own production codebases. That measurement is METR's −19%. The +55.8% from the GitHub lab study is a single isolated task. The +50% from Anthropic's own engineers is self-report. The +39% from Cursor is observational and selection-confounded. Cherry-picking any of these tells the story you wanted to tell. The honest answer is that the productivity story is contingent and unsettled, even as capability is racing.

The doubling curve, projected forward.

METR's task-horizon doubling implies that the autonomous-task duration of frontier AI agents grows about ten-fold every two years. The chart plots the measured points (March 2025: ≈1 hour at 50% success; February 2026: ≈12 hours, second-hand) and extends the trend three more years. Whether or not the trend holds exactly, the implications of even a slower-doubling version are dramatic. Marked explicitly as projection, not measurement.

SOURCE: METR, Measuring AI Ability to Complete Long Tasks, March 2025 (Kwa et al., arXiv:2503.14499). 2026 data point reported second-hand (Claude Opus 4.6 ≈ 719 minutes); flagged for re-verification. Projection: linear extension on log scale assuming the 7-month doubling holds. Not a forecast, an extrapolation. The METR authors themselves note the trend "may have accelerated in 2024" and could decelerate.

The window for "wait and see" is closing. If the doubling holds, frontier AI agents will be capable of week-long autonomous work by 2028. If the doubling halves to fourteen months, they will be capable of multi-day autonomous work by 2028. Either timeline is incompatible with a workforce strategy of "we'll do training programs once the field settles down." It will not settle down before the workforce has to be ready.

What this means if you build, lead, hire, or learn

Three implications worth holding onto.

1. The capability is exponential. The pedagogy is impossible.

A field whose canonical practices have a five-month half-life cannot have a curriculum, by construction. The development cycle of any "AI training" course is longer than the era any technique it would teach lasts. Anyone selling a curriculum is selling something that ships obsolete. This is not a critique of training as a category. It is a statement about what is teachable, and "how to use a tool that doubles in capability every seven months" is not.

2. Don't pay for AI training. Pay for time on the tool.

If you have a budget for AI capability-building, spend it on tools, GPU credits, and protected hours for your team to use AI on real work. The single empirical signal we have for fluency development, namely Anthropic's roughly four-percentage-point high-tenure advantage from March 2026 (once controlled for model, language, geography, and use case), does not come from a course. It comes from people who have been using the tool every day for six months. The "curriculum" is your own shipped work. Skip the certificate.

3. Use AI for every keystroke. Especially the small ones.

Make AI write every line of code, including the one-character changes. Make it draft every email. Make it produce every analysis, summarize every document, write every test. Especially when typing it yourself would be faster. Typing it does this typing it builds nothing. Volume is the only documented input to fluency. The friction of switching is the literacy gap. Refuse to do anything the old way, and the gap closes for you while it grows for everyone else.

Sources

Every claim, where to check it.

METR: Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity · Becker, Rush, Barnes, Rein. July 10, 2025. RCT, n=16, 246 issues. metr.org/blog/2025-07-10

METR: We are Changing our Developer Productivity Experiment Design · February 24, 2026. Follow-up, redesign, ambiguous CIs. metr.org/blog/2026-02-24-uplift-update

METR: Measuring AI Ability to Complete Long Tasks · Kwa et al., March 19, 2025. ~7-month doubling time. arXiv:2503.14499. metr.org

Anthropic Economic Index: January 2026 Report · n=2M conversations. 49% of jobs at ≥25% AI task coverage; 1.2–1.8pp/yr productivity growth est. anthropic.com

Anthropic Economic Index: September 2025 Report · 40% employee AI use (up from 20% in 2023). anthropic.com

Anthropic Economic Index, March 2026 Report · High-tenure (6mo+) users 5% more successful, controlled for everything. Reported by Built In. builtin.com

Anthropic: How AI Is Transforming Work at Anthropic · Dec 2, 2025. n=132 engineers + 53 interviews + 200K Claude Code transcripts. 28% → 59% daily work share. anthropic.com

Anthropic: Building Effective Agents · December 19, 2024. Vendor's own admission: start simple, complexity only when needed. anthropic.com

Stack Overflow Developer Survey 2025: AI section · n=33,244. 84% adopt; 51% pros use daily; 3% highly trust; 66% "almost right but not quite." survey.stackoverflow.co/2025/ai

GitHub Octoverse 2025 · 180M+ devs, 80% of new devs use Copilot in first week, +178% YoY in LLM-SDK repos. github.blog/news-insights/octoverse

Andrej Karpathy: "Vibe coding" tweet · Feb 2, 2025; one-year retrospective Feb 2026. Direct primary URL gated; reconstructed via secondary sources (CodeRabbit, Klover).

Simon Willison: Vibe Engineering · October 7, 2025. "AI tools amplify existing expertise." simonwillison.net

Cursor / U Chicago: Sarkar productivity study · November 2025. +39% PRs merged after agent default; senior devs accept agent more. Observational, not RCT. cursor.com/blog/productivity

Sean Goedecke: Reading the METR study · July 11, 2025. "Hours of Cursor experience didn't show a difference" in METR data. seangoedecke.com

Pragmatic Engineer (Orosz): Cursor and the AI learning curve · July 24, 2025. newsletter.pragmaticengineer.com

Brynjolfsson, Li, Raymond: Generative AI at Work · NBER WP 31161. 14% avg / 34% novice / ~0% experienced worker gain. nber.org/w31161

MIT NANDA: The GenAI Divide: State of AI in Business 2025 · August 2025. 95% of GenAI pilots show no measurable P&L impact. Reported by Fortune. fortune.com

Microsoft Research / CHI 2025: The Impact of Generative AI on Critical Thinking · Lee, Sarkar et al. n=319. Higher AI confidence ↔ less critical thinking. dl.acm.org/CHI2025

Errica et al.: LLM prompt sensitivity · NAACL 2025. Prompt fragility is an open empirical problem, not a settled craft. aclanthology.org

Fortune: 100% of code at Anthropic / OpenAI is AI-written (Cherny, Roon) · January 29, 2026. Anthropic spokesperson: "70–90%" company-wide. fortune.com

LessWrong: Is 90% of code at Anthropic written by AIs? · October 22, 2025. Critical reading; estimates closer to 50% of merged code on average. lesswrong.com

Lenny's Newsletter: Simon Willison: AI State of the Union · April 2, 2026. "November 2025 was when AI coding agents crossed from 'mostly works' to 'actually works.'" lennysnewsletter.com

SWE-bench Verified leaderboard snapshot, April 2026 · Top: Claude Opus 4.7 at 87.6%. marc0.dev/en/leaderboard

About this field report.

Field Report No. 2 in an irregular dispatch from Tristan Chiappisi, an engineer who works in data, builds in the AI space, gives talks about both, and writes things down when the data points somewhere interesting. The writing is set in Source Serif 4 and Inter Tight. There are no advertisements, sponsored sections, or affiliate links.

Every numerical claim in this issue is sourced to a 2025 or 2026 primary or near-primary source. Where data was thin, the claim is omitted; where projections appear, they are explicitly labeled. The earlier draft of this issue used 2024 figures and was retired for that reason.

If you want to know more, or have me speak, reach out on LinkedIn.

Reach me on LinkedIn →

What's next

No. 3 · The Synthetic Data Audit (Q3 2026)
No. 4 · The Vector DB Reckoning
No. 5 · TBD
Follow on LinkedIn →

Untrainable.