13 Frontier Models, One Vertical Task
A controlled benchmark across four labs.
Published May 7, 2026 · 8 min · Originally on LinkedIn
What we learned running a controlled benchmark across four labs, and the methodology change that produced a 142x acceptance delta.
A small change to how we structured prompts produced a 142x improvement in training-data acceptance, held constant across 13 frontier models from four labs.
The interesting part is not the number.
The interesting part is what stayed the same and what did not when we ran the same vertical task across the entire frontier field. The rankings for this task did not match those on public leaderboards. The variance was wider than I expected. The cheapest model in the field outperformed the most expensive on several dimensions. And the methodology change that produced the 142x delta was, in retrospect, embarrassingly simple.
This is a field report from work we have been doing at SLPR Labs over the past several months. The goal here is not to publish a leaderboard or to claim a winner. It is to share what a real, controlled, multi-model evaluation on a vertical enterprise-shaped task actually looks like, what it tells you that public benchmarks do not, and why it matters for anyone making serious decisions about which models to build on.
Why we ran this
We run an applied AI lab focused on intelligence systems for investors and operators in private capital and digital infrastructure. The systems we build are vertical. They reason about specific kinds of entities, sources, and decisions, not generic chat.
That gave us a problem most teams will recognize. Public benchmarks measure general capability. The models that win on MMLU, GPQA, or SWE-Bench are not necessarily the models that win on the workflows we actually build. We needed to know, with our own eyes and our own data, how the frontier field performed on the work we were going to ship.
So we built a controlled harness. Same task class, same scoring rubric, same evidence pool, run across 13 frontier models from four labs, evaluated on multiple dimensions including factual fidelity, claim-level fabrication, ownership accuracy, recommendation discipline, and several others. The harness is what we use internally to make architecture decisions. Sharing the methodology and the patterns we observed is a safe and useful thing to publish. The exact eval set, the prompt patterns, and the model-by-model scores remain proprietary.
The methodology
Most of what made this evaluation useful was the discipline around the comparison, not the comparison itself.
Three principles shaped the harness:
Hold methodology constant across models. The same prompt structure, the same evidence delivery, the same task framing, the same output schema, the same scoring rubric, the same number of runs per model. If we wanted to see how the models actually differed, the rest of the system had to be invariant.
Measure variance, not just mean. A model that scores 4.5 on average with 0.2 standard deviation is a different production proposition than a model that scores 4.5 with 1.1 standard deviation. The second model is not deployable in any workflow that demands consistency. Public benchmarks rarely surface this. Our harness reports both, and the standard deviation column changed our conclusions more than once.
Score on multiple dimensions, weighted by what production actually needs. A vertical enterprise task is not one thing. It is a stack of small judgments. Did the model identify the right entity? Did it use the evidence faithfully? Did it avoid claim-level fabrication? Did it produce defensible recommendations or vague ones? Did it follow ownership and timing rules? A single composite score collapses all of that. We score each dimension and look at the shape, not just the height.
The methodology change that produced the 142x delta was a structured preamble pattern we developed internally and applied uniformly across every model in the field. We are not publishing the pattern. We will say this much: it is a small intervention, transferable across labs, and the size of the effect made us look at our pre-pattern results and conclude that most enterprise prompt engineering is leaving an order of magnitude on the table.
Findings
I will share five observations from the work. Each is anonymized. Lab letters and model numbers are deliberately scrambled and do not map to alphabetical order.
Finding 1: Vertical-task variance was wider than the public-benchmark gap between consecutive models.
On our task, the spread between the top model and the field median was substantially larger than the spread you see between top frontier models on standard benchmarks. The implication is not that the public benchmarks are wrong. It is that they are weakly predictive of vertical performance. Teams making model selection decisions on public scores alone are using a coarser instrument than they realize.
Finding 2: The model that won on average was not the model that won on the hardest dimension.
Average scores rank models in one order. Per-dimension scores rank them in a different order. On the dimension we care most about for production safety (claim-level fabrication on adversarial inputs), the leading model on average was a mid-tier performer. The model that led on that dimension was not in the top three on average. This is one of those findings that sounds obvious in retrospect and changes how you architect a system the moment you take it seriously.
Finding 3: Variance correlated more with lab family than with parameter count.
We expected larger models to produce more consistent outputs. The data did not support that cleanly. Standard deviation across runs clustered by lab more than by size. Models from one lab showed tight variance across the size tiers we tested. Models from another lab showed wide variance even at the top end. We have hypotheses about why (training distribution, RLHF policy, decoding defaults), but the practical implication is that "use a bigger model" is not a reliable lever for output stability.
Finding 4: A cheaper model in the field outperformed a more expensive model on several production-relevant dimensions.
When we held methodology constant, one of the lower-cost models in our field beat one of the most expensive on multiple dimensions, including a critical safety dimension. The cost-per-useful-task math on a serious enterprise workflow is not what most teams assume from list prices. Methodology investment compounds. Model price does not always.
Finding 5: Prompt-pattern sensitivity was non-uniform across labs.
The same prompt pattern produced different magnitude effects across model families. Our 142x preamble effect was the largest single intervention, but its impact varied considerably across the field. Some labs' models gained more from it than others. The implication is that there is no universal prompt pattern that lifts every frontier model equally. There are model-family-specific sensitivities, and they matter for vertical deployment.
What this means for builders
If you are an applied AI team building production systems, three things are worth taking seriously.
Public benchmarks are necessary and insufficient. You need your own vertical eval harness to make defensible model decisions. The cost of building one is real. The cost of not having one is making architecture decisions on data that does not match your workload.
Methodology compounds. The 142x effect was not a model upgrade. It was a prompt engineering discipline applied across the field. Frontier model improvements arrive every few months. Methodology improvements arrive every week if you are running a real harness. Over a year, the methodology side of the curve produces more lift than the model side, in our experience.
Standard deviation matters more than mean. Most production failures we have investigated are not bad mean performance. They have wide variance, which makes the system unreliable in user-facing settings. Optimize for tight variance first, raise the mean second.
What this means for buyers and investors
If you are evaluating AI companies, AI infrastructure, or AI-driven enterprise software, vertical eval discipline is one of the strongest signals available.
Most operators cannot tell you which frontier models they tested, on what task, with what controls, and what they observed. The ones who can tell you in detail, including the variance and per-dimension breakdown, are operating at a different level than the ones who cannot. The harness is itself an asset. It compounds across model generations. It produces a defensible, repeatable, IP-protected basis for product decisions.
The other thing worth understanding is that vertical AI systems are not winner-take-all, unlike general-purpose AI. The model layer is converging. The methodology layer, the data layer, and the eval layer are diverging. The defensibility of enterprise AI over the next several years will live in the vertical stack: the data the operator controls, the methodology they have refined, the evals they have built, and the harness that lets them swap models as the field moves without rebuilding the system. That stack is hard to copy. It compounds. It is licensable. It is the asset.
Closing
We are publishing more from this work over the coming weeks. The next pieces will go deeper into variance patterns, the cost-per-useful-task math at production scale, and what controlled vertical evaluation looks like as a discipline rather than a one-off project.
If you are a researcher studying frontier-model variance on vertical tasks, an operator evaluating model strategy for a real production workflow, or an investor looking at how defensibility works in enterprise AI, I am open to conversations. The work is more interesting when other serious people push on it.
The model matters.
The system around the model is where the durable value lives.
About
SLPR Labs is an applied AI lab focused on intelligence systems for investors and operators in private capital and digital infrastructure.