Lab
Inside the Lab.
The Lab is where we de-risk the hard stuff — fast experiments on new models, agent patterns and evals, so the systems we ship to clients are already battle-tested. It’s the reason our pilots don’t die in a slide deck.
We’d rather break it here than break it on your customers.
Most AI fails in the gap between a convincing demo and a system people rely on every day. The Lab exists to close that gap before a project ever starts.
Every pattern we use in production — an agent, a retrieval pipeline, a voice flow — earned its place by surviving a deliberately hostile test here first. That’s what lets us be honest on the first call about what will and won’t work.
How it works
Spike
A throwaway prototype in days, not weeks. We get a rough version of the hardest part working first, so we’re arguing about something real instead of a slide.
Pressure-test
We build a task-specific eval set and try to break it — adversarial inputs, edge cases, the queries that embarrass a demo. If it can’t survive the Lab, it doesn’t leave.
Productionise
Only what passes gets hardened: monitoring, human-in-the-loop gates, fallbacks and documentation. The thing your team uses on Monday has already been through the wringer.
What’s on the bench
Model research
We benchmark and fine-tune frontier and in-house models — including our own Lux foundation model.
Agent prototyping
Rapid spikes on agent architectures, tool-use and human-in-the-loop patterns before they hit production.
Evals & safety
Task-specific eval suites and guardrails so we can prove a system works before you trust it.
Voice & realtime
Low-latency phone agents tuned for natural turn-taking and a Kiwi ear.
Retrieval
Hybrid search and citation pipelines that keep answers grounded in your sources.
On-device & on-prem
Capable models that run inside a customer’s own boundary, no data leaving the country.
Read what we’ve learned, or put us to work.
Dig into the research notes, browse the blog, or bring us a problem worth pressure-testing.
Talk to us →