We read OpenAI's and Anthropic's data-agent posts so you don't have to

OpenAI and Anthropic each shipped a writeup on the internal data agent they run in production. Read together, they show two frontier labs solving the same problem independently, and agreeing on most of it: the hard part is finding the right data, not writing SQL, and the agent stays simple while the context around it does the work. They differ on how context is delivered and how much they verify. The short version of both posts.

  • Dmitry Petrov
  • June 04, 20265 min read
Hero Picture

Two frontier labs, the same problem: how to make an agent reliable over real company data.

In January, OpenAI published how it built an internal data agent that thousands of employees now use to ask questions of company data in plain English. Yesterday, Anthropic published how it built one too. Two frontier labs, the same problem, five months apart.

We build the unstructured-data side of this, so we read both closely. Both posts are worth reading in full. But if you only have ten minutes, here is the honest digest: what they agree on, where they diverge, and the one assumption they share.

Side by side

OpenAIAnthropic
Post"Inside our in-house data agent" (Jan 2026)"How Anthropic enables self-service data analytics with Claude" (Jun 2026)
Scale600+ PB, 70,000 datasets, thousands of users; built by 2 engineers in 3 months, ~70% AI-writtenPowers 95% of business analytics queries at ~95% accuracy
Stated core problem"isn't writing SQL - it's finding the right tables … and understanding … how to use data""map a user's question to specific and up-to-date entities in our data model"
ArchitectureSimple agent (one LLM + runtime + ~13 curated tools) over 6 context layers4-layer stack: data foundations, sources of truth, skills, validation
How context is deliveredSix layers assembled offline, retrieved at runtimeMarkdown skills the agent reads on demand: a "knowledge" router + an "unbook" workflow
Form factorStandalone agent (Slack bot, own runtime + tools)Skills loaded into Claude Code (no new agent)
Context kept fresh byScheduled jobs rebuild an index (nightly Codex code crawl)Markdown committed in the same PR as the data model
Context sourcesusage metadata, human annotations, Codex code analysis, institutional knowledge, memory, live warehouse queriessemantic layer (consulted first), lineage graph, query corpus, business context
Headline number~100 fixes/day; ran months autonomously; no accuracy figure given21% → 95%+ accuracy once skills were added
Sharpest findingfewer tools win (40 → 13); rank queries by trust, don't embed them allraw query access added <1 point ("structure, not access"); skills rot 95% → 65% in a month if unmaintained
Verificationdata foundations, tool curation, a memory of past correctionssemantic-layer-first, offline evals, adversarial review, a provenance footer on every answer
Built ona warehouse: schema, lineage, query surface for freethe same, plus a curated semantic layer

What they agree on

The bottleneck is discovery, not SQL. Both lead with the same point. OpenAI: the hard part "isn't writing SQL - it's finding the right tables to use in the first place." Anthropic: the central problem is the "ability to map a user's question to specific and up-to-date entities in our data model." Generating the query was never the hard part.

Keep the agent simple; spend the effort on context. OpenAI's agent is a single model with a small set of curated tools; the reliability comes from the engineering around it. Anthropic's accuracy jumped from 21% to over 95% not from a better model but from adding context (skills). In both cases the model was already good enough; the surrounding context was the variable.

Code is a source of truth about data. OpenAI crawls its pipeline code with Codex because "pipeline logic captures assumptions, freshness guarantees, and business intent that never surface in SQL or metadata." Anthropic compiles a semantic layer and colocates its skill files with the transformation models that produce the data. Both treat the code that builds a table as primary evidence for what the table means.

Curated-and-structured beats raw-and-large. OpenAI cut its tool set from 40 to 13 because overlapping tools confused the model, and it ranks historical queries by trust rather than dumping them all in. Anthropic found that giving the agent raw retrieval over thousands of past queries "moved accuracy by less than a point" - even though the right query was usually there. The bottleneck was structure, not access.

Where they differ

A standalone agent vs. a skill. OpenAI built a bespoke agent: its own runtime and a curated tool set, reachable as a Slack bot you ask in plain English. Anthropic built no new agent at all. The capability ships as skills loaded into Claude Code, the same coding agent engineers already use, split into a "knowledge" skill that routes to the right reference docs and an "unbook" skill that encodes the analyst workflow. One is a product you operate; the other is context you load into a tool you already have.

A generated index vs. committed markdown. They also keep that context fresh in opposite ways. OpenAI compiles its layers offline into a searchable index that scheduled jobs rebuild: Codex crawls the pipeline code on a schedule, and a periodic pass refreshes the per-table index the agent retrieves from. Anthropic's context is plain markdown checked into the same repo as the data models, updated in the same pull request whenever a model changes (about 90% of data-model PRs now carry the skill edit). One regenerates an index out of band; the other versions hand-written context in git, inline with the code.

What they choose to report. OpenAI leads with scale and throughput (petabytes, dataset counts, roughly 100 fixes a day, multi-month autonomous runs) and notably gives no single accuracy number. Anthropic leads with accuracy and automation share. The two posts are almost mirror images in what they consider the headline.

How much they invest in verification. This is the biggest gap. Anthropic builds an explicit validation layer: offline evals, an adversarial-review pass (which bought +6% accuracy at the cost of 32% more tokens and 72% higher latency), and a provenance footer on every answer. OpenAI leans more on solid data foundations, tool curation, and a memory layer that replays past corrections.

The lesson each chose to headline. OpenAI's is that foundations beat agent sophistication - the monorepo, the conventions, the annotations matter more than the cleverness of the loop. Anthropic's is that skills are the lever, and that unmaintained context decays fast: accuracy drifted from 95% to 65% in a single month until they required data-model changes to update the matching context in the same pull request (about 90% of data-model PRs now do).

Our take

A useful lens here is the "code as harness" idea: coding agents are productive because of the harness around them (versioned files, git history, durable context read as premises), not because of the model. By that measure Anthropic's is the more harness-native: its context is markdown in the same repo as the data models, updated in the same pull request, while OpenAI's lives in a separate index that scheduled jobs rebuild. The harness view bets on the first: context that lives where the work lives compounds, while a separately-maintained layer drifts, which is exactly the 95%-to-65% decay Anthropic hit.

There is nothing wrong with OpenAI's choice. Theirs is simply the heavier system, and that likely reflects harder constraints: an older, larger, more sprawling data estate that may have forced a bespoke agent and pipeline rather than folding everything into a harness they already had. Less a different philosophy than a different starting point.

The bigger point is the direction both lean toward: more of the developer experience, and now the data experience, gets pulled into the harness itself rather than bolted on beside it. The data platform may not survive as its own thing next to the agent; it becomes part of the harness.

The assumption they share

Both agents run on a warehouse. Schema, lineage, and a queryable surface come for free, and Anthropic adds a curated semantic layer on top. That warehouse foundation is what every win above is built on. It is so foundational that neither post spends much time on it - which is exactly why it is easy to miss.

That warehouse assumption is where our day job starts. We work the same problem for unstructured data (video, sensor logs, images, PDFs in S3, GCS, and Azure), where no warehouse hands you schema, lineage, or a query surface, so those foundations have to be built first. We ended up building a skill for it, and it landed on much the same shape as Anthropic's: a skill the agent loads rather than a separate agent, backed by a knowledge base of markdown docs about your datasets. The main difference is that the docs describe data derived from files, and they are generated from the data rather than hand-curated. We wrote up the files-side version earlier, and we are building it in the open at github.com/datachain-ai/datachain.

Back to blog