How Leni beat Perplexity, Gemini and OpenAI on the DRACO deep research benchmark in 2026

How Leni beat Perplexity, Gemini and OpenAI on the DRACO deep research benchmark in 2026

Leni hit 71.6% on DRACO, beating Perplexity (70.5%), Gemini, and OpenAI. The entire margin lives in delivery, not retrieval. What investment teams should ask before trusting any AI research tool.

Key findings

  • Leni hit 71.6% on the DRACO benchmark, first place globally. Ahead of Perplexity (70.5%), Gemini Deep Research (59.0%), and OpenAI Deep Research (52.1% / 41.9%).
  • The entire margin lives in delivery, not retrieval. Factual accuracy is at parity (+0.2pp), but citation quality is +8.4pp and presentation quality is +11.5pp ahead of #2.
  • The architectural lesson: add a research validator that audits the finished output before the user sees it, then send specific feedback back to the search agent. Closed-loop iterative research instead of single-pass generation.

If you review AI-generated research as part of underwriting, IC prep, or due diligence, the dimensions that determine whether output is usable are diverging fastest right now.

Leni, an AI agent built for finance and real estate workflows, just beat Perplexity, Gemini, and OpenAI at deep research. And the reason it won has almost nothing to do with finding better information.

Leni hit 71.6% on the DRACO benchmark, the production-grounded deep research evaluation built by Perplexity AI and Harvard. That's first place globally, above Perplexity's own deep research agent (70.5%) and well ahead of Gemini (59.0%) and OpenAI's two deep research models (52.1% and 41.9%).

DRACO benchmark leaderboard: Leni 71.6%, Perplexity Opus 4.6 70.5%, Perplexity Opus 4.5 67.2%, Gemini Deep Research 59.0%, OpenAI Deep Research o3 52.1%, OpenAI Deep Research o4-mini 41.9%
DRACO benchmark leaderboard. Normalized score across 100 deep-research tasks, with Leni leading the field by 1.1pp over Perplexity (Opus 4.6).

DRACO isn't a soft benchmark. It's 100 complex research tasks pulled from real user queries across 10 domains: law, medicine, finance, academic, technology, UX, shopping, general knowledge, personalization, and "needle in a haystack" retrieval. Every output is evaluated on four dimensions: factual accuracy, breadth and depth of analysis, citation quality, and presentation quality.

About the DRACO benchmark

DRACO (deep research agents comprehensive evaluation) is a benchmark for evaluating deep research AI agents, developed by Perplexity AI in collaboration with Harvard. It contains 100 complex research tasks across 10 domains and grades outputs on four dimensions: factual accuracy, breadth and depth, citation quality, and presentation quality. It is the most production-grounded public evaluation of deep research agents in 2026.

So how did a non-research product beat three purpose-built research agents on the hardest public benchmark for the category? The answer reframes what "good" looks like in this category. And it has implications for any team that depends on AI-generated research as part of their workflow.

The problem isn't retrieval. It's delivery.

Every major lab has shipped a deep research agent in the last 18 months. They search the web, synthesize sources, produce multi-page reports. On the retrieval side (finding the right information and surfacing the right facts) the differences between the top systems are small.

Look at the numbers. Leni scores 66.7% on factual accuracy. Perplexity scores 66.5%. The gap is 0.2 percentage points.

On breadth and depth of analysis, it's 64.5% vs 64.0%. These are dimensions where the top systems are statistically indistinguishable. Everyone is finding roughly the same information at roughly the same depth.

Now look at the other two dimensions. Presentation quality: Leni 94.0%, Perplexity 82.5%. That's a +11.5 point gap. Citation quality: Leni 86.6%, Perplexity 78.2%. That's a +8.4 point gap.

Bar chart comparing Leni vs Perplexity Opus 4.6 across DRACO's four evaluation dimensions: Factual Accuracy (66.7% vs 66.5%), Breadth & Depth (64.5% vs 64.0%), Citation Quality (86.6% vs 78.2%), Presentation Quality (94.0% vs 82.5%)
Where the gap lives. The retrieval dimensions (factual accuracy, breadth & depth) are statistically tied. The entire DRACO lead is concentrated in the delivery dimensions.
RankModelNormalized scorePass rate
1Leni71.6%74.3%
2Perplexity (Opus 4.6)70.5%73.1%
3Perplexity (Opus 4.5)67.2%70.8%
4Gemini Deep Research59.0%64.2%
5OpenAI Deep Research (o3)52.1%56.9%
6OpenAI Deep Research (o4-mini)41.9%49.7%

The retrieval race has converged. The delivery race has barely begun. And the entire margin between #1 and #2 on DRACO lives in the delivery dimensions, not the retrieval dimensions.

If you work in investment finance, this should make you uncomfortable. You almost certainly use a deep research tool. You probably trust it. And the dimensions where these tools are diverging the most (whether a citation is actually traceable, whether the analysis is presented in a way you can defend in a committee meeting) are precisely the dimensions that determine whether output is usable.

Three failure modes most deep research agents share

Three failure modes explain why most deep research agents bleed accuracy on delivery.

The single-pass problem. Most agents follow a linear pipeline: search, synthesize, output. There's no quality gate between "I found the information" and "I delivered the research." If the draft has a citation mismatch, a missing caveat, or an analysis that doesn't address the trade-offs in the question, the user gets the broken output. There's no second pass.

The presentation afterthought. Retrieval-first architectures treat formatting as cosmetic. They're wrong. Professional research presentation is structural. The decision to lead with a conclusion versus build to one. The choice of terminology that signals domain literacy. Where citations get placed. None of that is cosmetic. It's what determines whether the work is read or ignored.

The citation integrity gap. Correctly attributing claims to sources sounds simple. In practice it's one of the hardest open problems in deep research. The agent has to track which claim came from which source, across multiple synthesis passes, often through paraphrase. Most systems hand-wave it. The output looks cited. It often isn't, not in the way that survives a fact-check.

Every production team building research tools lives in this gap. The retrieval works. The delivery doesn't. And the user is the one paying for it.

How Leni closed the gap

Leni didn't solve this by building a better search engine. It solved it with two tools that were already part of its production harness, plus a specific choice about when each tool runs.

The web search agent. Standard tool in Leni's stack. Given a complex query, it decomposes the question, searches across multiple sources, evaluates source credibility, and synthesizes findings. What makes it effective for deep research isn't that it was designed for deep research. It's that it was designed for iterative use: producing output that another tool can evaluate, critique, and send back for revision.

The delivery layer. Before validation, Leni structures the output. Headers, terminology, formatting, citations placed inline, register adapted to the domain: academic language for academic queries, financial language for financial ones. The order matters. The delivery layer runs before the validator. The validator needs to judge the complete, formatted output, not a raw draft that will be cleaned up later. What the validator sees is what the user would see.

The research validator. This is where Leni's architecture diverges from every single-pass system on the leaderboard. After the answer is synthesized and presented, an independent validator reads the finished output and grades it across the same dimensions DRACO measures. The validator isn't a spell-checker. It's a quality gate. It checks whether claims are actually supported by the cited sources, whether the analysis addresses the trade-offs the query implies, whether the structure serves the reader, whether the register matches the domain.

And critically, when something fails, the validator returns specific, actionable feedback. Not "this is bad." But "citation 4 doesn't support claim 7," or "the analysis omits the regulatory caveat the query implied." The search agent re-runs with that feedback. Leni calls this iterative research. The user only sees the version that passed the quality gate.

What the leaderboard actually tells you

At the top, the Leni vs Perplexity gap is narrow. Just 1.1 points. On any given run, variance could close it or reverse it. The more interesting story isn't the overall margin. It's where the points come from.

Every dimension that rewards retrieval is at parity. Every dimension that rewards delivery is a blowout. That's not a coincidence. It's the architecture working exactly the way it's designed to: take retrieval that's already good and refuse to ship until the delivery is too.

The domain breakdown reinforces this. Leni leads in 7 of 10 domains. Its strongest results are in law (91.4%) and academic research (85.5%). Those are domains where professional presentation and citation traceability matter most. The three domains where Perplexity leads (medicine, UX design, needle-in-a-haystack) lean more heavily on retrieval precision and less on delivery structure. Exactly what the architecture predicts.

What this means if you consume deep research for a living

If you're a PE associate, an investment analyst, or anyone who reviews research outputs as part of underwriting, market sizing, or IC prep: the DRACO results matter to you directly. The tool you use to produce a deal memo or a market sizing isn't getting graded on factual accuracy alone. It's getting graded on whether you can defend the output in a meeting.

Presentation quality is a trust signal, not a luxury. An 11.5-point advantage on this dimension isn't about pretty formatting. DRACO evaluates precise terminology, structured argument, appropriate register. Output that scores high on this dimension is output a partner reads and trusts. Output that scores low is output that gets handed back for cleanup. If your current research tool produces correct but poorly structured output, you're paying twice: once for the tool, once for the time you spend cleaning up after it.

Citation quality is the moat. Getting citations right requires traceability from source to claim, end-to-end. You can't add this with a formatting pass after the fact. The tool either tracks provenance through every synthesis step or it doesn't. When the validator catches a citation error before delivery and the search agent corrects it, you get a deliverable you can actually defend in a meeting. Every claim is traceable back to a verifiable source.

The retrieval ceiling is real. The marginal return on better search is diminishing. The marginal return on better delivery is wide open. The next generation of research tools won't compete on who finds more sources. They'll compete on who delivers usable answers.

Two questions to ask any deep research tool before you trust it. First: is there a quality gate between generation and delivery, or am I the first reviewer? If the answer is "you are the first reviewer," you are doing free QA for an unfinished product. Second: when something fails, does the tool know it failed and try again, or does it just hand you the broken output? Iterative architectures fix their own mistakes. Single-pass architectures hand them to you.

Frequently asked questions

What is the DRACO benchmark?

DRACO (deep research agents comprehensive evaluation) is a benchmark for AI deep research agents, built by Perplexity AI and Harvard. It evaluates agents on 100 complex research tasks across 10 domains and four dimensions: factual accuracy, breadth and depth of analysis, citation quality, and presentation quality. It is the most production-grounded public evaluation of deep research agents in 2026.

How does Leni score 71.6% on DRACO when it's not a deep research product?

Leni is an AI agent built for finance and real estate workflows: underwriting models, lease abstraction, IC memos, market research, deal comps. It was evaluated on DRACO using the same production harness it ships to customers. Its architecture, designed for iterative deal work, transferred directly to deep research because both tasks reward the same property: refusing to deliver a draft that hasn't passed an internal quality gate.

What's the difference between retrieval and delivery in AI research?

Retrieval is everything that happens before the answer is written: searching the web, evaluating source credibility, extracting information, reasoning about it. Delivery is everything that happens at output time: how the answer is structured, what terminology it uses, how citations are placed, whether the register matches the domain. On DRACO, the top deep research agents are statistically tied on retrieval but diverge sharply on delivery. Leni leads citation quality by +8.4 percentage points and presentation quality by +11.5 percentage points.

Why did Perplexity score lower than Leni despite being a purpose-built deep research product?

The gap is concentrated in the validation step. Perplexity's deep research is a strong retrieval-first architecture, but DRACO measures whether the delivered output meets professional standards. Leni's iterative loop (search → present → validate → re-run if needed) catches and fixes delivery issues before the user sees the final version. Perplexity ships a draft. Leni ships a draft that has passed an internal quality gate.

How can investment teams evaluate AI research tools beyond a generic demo?

Two questions cut through marketing claims. First, is there a quality gate between generation and delivery, or is the user the first reviewer? Single-pass architectures make the user responsible for catching errors. Second, when something fails, does the tool detect and re-run, or does it just hand you the broken output? Iterative architectures fix their own mistakes.

Is Leni's lead over Perplexity statistically significant?

The overall margin is 1.1 percentage points (71.6% vs 70.5%), narrow enough that variance on any given run could shift the order. But the lead isn't random. It's structural: the entire margin sits in two dimensions (presentation +11.5pp, citation +8.4pp) where the iterative validator architecture has a clear mechanism for outperforming single-pass systems.

The takeaway

Deep research AI has a delivery problem. The race to retrieve is largely won. The race to deliver has barely started.

What makes the DRACO result counterintuitive is that the winner isn't a deep research product at all. Leni is a business analyst, built for finance and real estate workflows, evaluated on the same production harness it uses for real customer work. The architecture that wins on DRACO is the same one that produces investment memos that hold up in a partner meeting.

Give your agent a validator. The quality will follow.

These benchmark results matter because Leni is built with real estate and investment teams in mind. Teams that require accurate, source-linked work across underwriting, research, memos, and reporting.

See iterative research in action on your next deal memo. Try Leni →

Feeling behind on AI?

You're not alone. Techpresso is a daily tech newsletter that tracks the latest tech trends and tools you need to know. Join 500,000+ professionals from top companies. 100% FREE.

Discover our AI Academy
AI Academy

Want to do this faster with AI?

These guides cover the AI tools and workflows that save hours on writing, studying, and research.

Or browse Toolradar for 9,000+ AI tools across 400+ categories with verified pricing.