How Leni beat Genspark and Manus on the GAIA benchmark in 2026

Leni hit 77.6% on GAIA validation, leading every difficulty tier. The 17-point uplift is architectural: planner-executor split, cross-provider routing, per-step verification.

Key findings

Leni just hit 77.6% on GAIA validation, Meta and Hugging Face's gold-standard benchmark for whether an AI agent can actually run a multi-step workflow. 128 out of 165 tasks. Above Genspark (75.4%), above Manus (73.4%), above OpenAI Deep Research (67.4%). The three most-discussed agentic systems of the past year.

And it leads on every difficulty tier, including the hardest one. On level 3 (long-horizon tasks with 10+ tool calls in sequence, where most agents fall apart) Leni hits 61.5%. Genspark hits 58.8%. Manus, 57.7%. OpenAI Deep Research, 47.6%.

GAIA validation leaderboard by difficulty tier. Leni leads all three tiers: Level 1 (≤5 steps): Leni 88.6%, Genspark 87.8%, Manus 86.5%, OpenAI Deep Research 74.3%. Level 2 (5-10 steps): Leni 75.5%, Genspark 72.7%, Manus 70.1%, OpenAI 69.1%. Level 3 (10+ steps): Leni 61.5%, Genspark 58.8%, Manus 57.7%, OpenAI 47.6%.
GAIA validation leaderboard by difficulty tier. Leni leads at every level, with the largest absolute gap on Level 3, the longest-horizon tasks where most agents fall apart.

Here's the part that should make every team building or buying agentic AI pay attention. Leni didn't get there with a custom browser stack. It didn't fine-tune for GAIA. It doesn't run on a proprietary model. It used commodity tools, a production system prompt that wasn't modified for the benchmark, and frontier models that are available to everyone.

The entire 17-point gap between a bare frontier model and Leni's score is architecture. And the most important piece of that architecture is a layer that single-provider agents (including Genspark and Manus) structurally can't replicate.

The problem isn't tool use. It's orchestration.

Frontier models in 2026 are extremely good at one-shot reasoning. Give a single tool to Opus 4.6 or GPT-5 and ask it to solve a 5-step problem, and most of the time it gets there. The wheels come off when the task requires sustained orchestration across multiple tools, multiple data types, and multiple rounds of intermediate results that have to be tracked, verified, and re-planned around.

Four failure modes every agentic system hits

Four failure modes dominate this gap, and every team building agentic systems hits them.

The planning-execution conflation. One model trying to plan a six-step trajectory while also executing each step has to context-switch on every turn. It loses sight of the overall goal. It rebuilds the plan implicitly at each step, which means the plan drifts. Step three quietly forgets what step one was looking for.

The model-task mismatch. Different sub-tasks reward different model strengths. A page-relevance classifier wants a fast cheap model with low latency. A multi-hop reasoning step over five scraped sources wants a frontier reasoner with extended thinking. A vision step wants a model with strong image grounding. Forcing one model to do all of these means accepting the worst-case profile on every step, paying frontier prices for trivial classification and getting trivial-model accuracy on hard reasoning.

The cascade problem. Errors don't stay local. A misread cell on step two becomes a wrong filter on step three becomes a wrong final answer on step six. The model has no checkpoint to ask "wait, does this intermediate result even make sense before I commit to the next step?" By the time the answer is wrong, the chain has buried the cause four turns deep.

The tool surface problem. GAIA pulls from heterogeneous data: web pages, YouTube transcripts, PDFs, Excel files, images, audio, code. Each tool has its own quirks. JavaScript-heavy pages defeat naive scrapers. Scanned PDFs need OCR. Multi-sheet workbooks hide answers on the third tab. A monolithic agent treats every tool the same. Tasks fail not because the model couldn't reason about the answer, but because the tool layer never delivered usable input.

Every demo of an AI agent doing something impressive in one or two steps lives upstream of this problem. By step five, the chain is bleeding accuracy. By step ten, it's broken.

The architecture: three layers

Leni didn't solve this by writing a better agent loop. It solved it by separating the agent into roles with different cognitive profiles and giving each role the freedom to route to whichever model is best suited to its work.

Layer 1: the planner

The first layer is a dedicated agent whose only job is to maintain the global trajectory. It doesn't run tools. It doesn't parse PDFs or browse pages. It reads the GAIA task, decomposes it into typed sub-steps (web-search, vision, table-lookup, computation, cross-reference), and emits a step graph with explicit dependencies.

The planner re-plans. After each executor returns, the planner inspects the intermediate result against the original task and decides whether the trajectory is still valid. Empty page? Insert a recovery step. Ambiguous result? Cross-check it. Bad source? Fall back to an archived version. This happens on roughly 38% of tasks: recoveries from stale URLs, drift corrections triggered by intermediate results that don't match what the planner expected.

Planning and execution have fundamentally different cognitive profiles. Planning rewards holding the full goal in working memory and reasoning about dependencies. Execution rewards focused, tool-grounded action on a narrow sub-problem. Asking one model to do both means doing both worse.

Layer 2: the executor pool

Each step in the planner's graph is handed to an executor configured per step type. A web-research step gets browser tools, a search API, and a model tuned for fast result classification. A vision step gets image-analysis tools and a model with strong visual grounding. A table-lookup step gets the same spreadsheet harness Leni uses for SpreadsheetBench, with closed-loop recalculation feedback baked in. A code step gets sandboxed Python.

Executors return structured artifacts back to the planner. Typed values, citations, confidence flags, error classes. Not raw text dumps. Typed returns make re-planning tractable. The planner can see exactly what came back and decide what to do next.

Layer 3: cross-provider model routing

This is the layer that matters most, and the one most single-provider agents can't replicate.

Leni's router doesn't lock to one model family. It picks which model handles each step from a multi-provider pool that spans both Anthropic and OpenAI. A page-classification step that just needs to decide "is this article relevant?" Runs on a Haiku-class model. A multi-hop reasoning step runs on Opus 4.6 with extended thinking. Structured output extraction runs on a dedicated parser model whose only job is producing well-formed JSON. Some steps route to OpenAI where it has measurable strengths in tool-call structure or multimodal grounding.

The choice of provider isn't ideological. It's empirical. For each step type, Leni routes to whatever model benchmarks best on that specific cognitive profile.

Genspark and Manus run on what's publicly described as primarily single-family stacks. Leni beats them on level 1 and level 2 with cross-provider routing, and the lead widens on the hardest tier. The savings on cheap steps fund deeper reasoning on the hard ones. Lower cost where cheap models work equivalently, higher accuracy where a different family is measurably stronger.

If you're locked to one provider, you're paying for it on every step where the other one would win.

The numbers

Rank Agent L1 L2 L3 Overall
1Leni88.6%75.5%61.5%77.6%
2Genspark87.8%72.7%58.8%75.4%
3Manus86.5%70.1%57.7%73.4%
4OpenAI Deep Research74.3%69.1%47.6%67.4%

128 out of 165 tasks passed. Evaluated using Leni's production harness, with no benchmark-specific configuration. The prompt includes directives for spreadsheets, documents, and real estate analytics that have nothing to do with GAIA. Which means the score isn't inflated by benchmark tuning, and there's likely headroom from a dedicated configuration.

The uplift over a bare frontier model decomposes cleanly:

Chart showing the GAIA score decomposition: bare Opus 4.6 around 60%, plus planner-executor split around 70% (+10pp), plus cross-provider routing around 74% (+4pp), plus per-step verification 77.6% (+3.6pp). Total architectural uplift: about 17 percentage points, same models, no fine-tuning.
How architecture closes the gap. Each layer adds measurable score uplift on the same underlying models. No fine-tuning, no proprietary infrastructure.

The planner-executor split does most of the heavy lifting. Routing closes the gap that single-provider agents can't reach. Verification catches what survives both, and produces a trajectory where each intermediate step is traceable and auditable rather than buried in a single opaque chain.

Six lessons from the result

The GAIA leaderboard reads like a model comparison. It isn't. Read it as an architecture comparison and six lessons fall out. Every one of which generalizes to any team building or evaluating agentic AI.

1. The orchestration ceiling is higher than most teams realize.

Seventeen percentage points from how the agent is structured. Same models. No fine-tuning. No proprietary infrastructure. Just a well-designed harness. If your team is building agentic AI and hasn't invested in the orchestration layer, this is the highest-leverage thing you can do. The model isn't the bottleneck. The wrapping is.

2. Orchestration is an architecture problem, not a model problem.

The gap from 60% to 77.6% won't be closed by waiting for a stronger model. There are tasks where the trajectory genuinely has to be owned by something other than the executor. A planner. Typed returns. A router. The question is how much infrastructure you're willing to build.

3. Cross-provider routing is the real moat.

This is the deepest lesson and it generalizes well beyond GAIA. The accepted wisdom is "use the strongest model and pay the bill." Leni found a different lever: the strongest model is the right model for some steps. A fast cheap model from a different provider is the right model for others.

Genspark and Manus run on what's publicly described as primarily single-family stacks. Leni beats them on every tier with cross-provider routing. Lower latency on easy steps funds deeper reasoning on the hard ones. If you're locked to one provider, you're paying for it on every step where the other one would win.

4. The lead is biggest where orchestration matters most.

Level 1 (≤5 steps): Leni 88.6%, Genspark 87.8%. Close. These are tasks where one good executor with one good tool gets the answer. Orchestration barely matters.

Level 2 (5–10 steps): Leni 75.5%, Genspark 72.7%. +2.8 points. This is where the planner-executor split earns its keep. Tasks at this length benefit most from re-planning between steps.

Level 3 (10+ steps): Leni 61.5%, Genspark 58.8%, Manus 57.7%, OpenAI Deep Research 47.6%. +2.7 points over Genspark, +13.9 over OpenAI Deep Research. Long-horizon trajectories with eight-plus tool calls are where purpose-built browser stacks usually claim the biggest advantage. Leni leads with a generic, commodity tool surface. Cross-provider routing and per-step verification compound on long chains in a way that single-provider, single-loop systems can't replicate.

Level 1 is where models matter most. Level 2 is where orchestration matters most. Level 3 is where everything matters at once. Leni leads all three.

5. The long-horizon ceiling is real.

61.5% on level 3 is the strongest score on the leaderboard. It's still not solved. The structural problem is multiplicative. Even with 95% per-step reliability, ten dependent steps compound to roughly 60% end-to-end. Closing this gap is a research problem in robust intermediate verification, not just an engineering problem.

6. Production prompts should be the default eval.

Leni evaluated its production harness. The same one that handles DOCX, PPTX, PDF, and spreadsheet editing for real users. Not a GAIA-specific variant. The prompt includes directives for file types that don't appear in GAIA. That's context window waste. But it also means the score isn't inflated by benchmark tuning. What you see is what users get.

Two questions to ask any agent before you trust it

The most uncomfortable finding from this result is that the model isn't the bottleneck. Bare Opus 4.6 with tool access lands in the low 60s. The same model, wrapped in a planner-executor architecture with adaptive routing, lands at 77.6%. The 17-point gap is architectural, and it's achievable with commodity tooling.

If you're evaluating AI agents for serious workflows (underwriting, market research, deal comps, automated reporting), there are two questions worth asking that almost nobody asks.

How does it orchestrate? "What model does it use?" is the question every vendor wants you to ask, because every vendor has an answer to it. The better question is "what happens between steps?" Is there a planner that owns the trajectory and re-plans when an intermediate result doesn't match the goal? Does the agent treat each sub-task with the right tool and the right model, or does it use one model for everything? Are intermediate results verified before they cascade into the next step? These are the questions that separate agents that demo well from agents that survive a long workflow.

Is it locked to one provider? Single-provider agents look fine on simple tasks. They hit a ceiling on multi-step workflows because they're paying frontier prices on every step where a cheaper model would do equivalently. And they're stuck with a worse model on every step where a different family is measurably stronger. Cross-provider routing isn't a feature checkbox. It's a structural advantage that compounds over long chains.

Frequently asked questions

What is the GAIA benchmark?

GAIA (general AI assistants) is a benchmark for evaluating multi-step AI agents, created by Meta and Hugging Face. The validation set is 165 tasks across three difficulty tiers: level 1 (53 tasks, ≤5 steps), level 2 (86 tasks, 5–10 steps), and level 3 (26 tasks, 10+ steps). Tasks combine web browsing, document parsing, image analysis, audio transcription, code execution, and multi-hop reasoning. Evaluation is exact-match. Human accuracy on GAIA is 92%. It's the benchmark every major agentic AI system uses to report progress.

How did Leni beat Genspark and Manus on GAIA?

Leni didn't use a better model or proprietary infrastructure. The entire 17-point gap over a bare frontier model is architectural: three layers stacked on top of the same models everyone else has access to. Layer 1 is a dedicated planner that owns the trajectory and re-plans on roughly 38% of tasks. Layer 2 is a pool of executors configured per step type (web, vision, table, code). Layer 3 is cross-provider model routing across Anthropic and OpenAI, the layer that single-family stacks structurally can't replicate. Leni leads every tier, with the largest absolute gap on level 3.

What is cross-provider model routing?

Cross-provider model routing is the technique of selecting, on a per-step basis, which AI model handles each sub-task in an agentic workflow. Choosing from a pool that spans multiple providers (Anthropic, OpenAI, others). A page-relevance classification might run on a fast cheap model from one provider. A multi-hop reasoning step might run on a frontier reasoner with extended thinking from a different provider. The selection is empirical: for each step type, whichever model benchmarks best on that cognitive profile gets the work. Single-provider agents like Genspark or Manus can't do this, which is why their per-step cost-accuracy tradeoff is structurally worse on long workflows.

Why don't single-provider agents perform as well?

Single-provider agents pay frontier prices on every step where a cheaper model would do equivalently. And they're stuck with a worse model on every step where a different family is measurably stronger. On a workflow with ten tool calls, that compounds in both cost and accuracy. On GAIA level 3 (10+ steps), the leaderboard gap reflects this: Leni 61.5%, Genspark 58.8%, Manus 57.7%, OpenAI Deep Research 47.6%.

What's the difference between L1, L2, and L3 tasks on GAIA?

GAIA classifies tasks by the number of steps a competent agent needs to solve them. Level 1 tasks (53 in the validation set) require ≤5 steps and are typically solved with one tool. Level 2 tasks (86 in the set) require 5–10 steps and benefit heavily from re-planning between steps. Level 3 tasks (26 in the set) require 10+ steps and stress every part of an agentic system: planning, tool selection, error recovery, intermediate verification, long-horizon coherence. Level 3 is where most agents collapse and where architectural decisions show up clearest.

Is Leni 77.6% the best score on GAIA?

Among the three most-cited agentic AI systems of the past year (Genspark at 75.4%, Manus at 73.4%, OpenAI Deep Research at 67.4%), Leni's 77.6% is the highest reported score. The GAIA leaderboard has additional submissions from research teams and individual researchers, some of which exceed 78% with custom configurations or specialized harnesses. Leni's score was achieved with a generic production harness used for real users, not a benchmark-specific build, which means there's likely headroom from a dedicated configuration.

The takeaway

LLMs are powerful operators. Given a single tool and a clear instruction, they perform impressively. They're weak orchestrators. Given five tools and a goal that requires sequencing, planning, and recovery, they fall apart in predictable ways.

The conventional response to this is "wait for a better model." Three benchmarks now point to a different lever. SpreadsheetBench: #2 globally at 91.25%. The Bullshit Benchmark v2: #1 globally at 98%. GAIA validation: #1 with 77.6%, leading every difficulty tier. Same architecture every time. Same thesis every time. Orchestration beats raw intelligence.

The model you start with is not the model your users experience. What your users experience is the orchestration. Give your agent a planner. Give it specialized executors. Give it the freedom to route across providers.

The accuracy will follow.

These benchmark results matter because Leni is built with real estate and investment teams in mind. Teams that require accurate, source-linked work across underwriting, research, memos, and reporting.

See cross-provider orchestration in action

Try it on your next agentic workflow.

Try Leni →