Lab · Research
Notes from the work of making AI reliable.
We don’t do research for papers — we do it so the systems we ship hold up in production. These are the questions we keep returning to, and what we’ve learned answering them on real projects.
What we research
Reliability & evaluation
How do you know an agent is good enough to ship? We build task-specific eval suites so every system is measured against real work before it touches a customer — and watched after it does.
Retrieval that grounds
Getting a model to answer from your documents, not its imagination. We research chunking, hybrid search and citation so every answer can be traced back to a source a human can check.
Agents & tool-use
Narrow agents that take real actions safely — calling APIs, booking jobs, drafting replies — with clear limits and a human in the loop on anything that matters.
Sovereign & on-prem
Running capable models inside a customer’s own boundary, so regulated industries keep their data in New Zealand. The work behind Lux 1.0 and private deployment lives here.
Voice
Latency, turn-taking and accent handling for phone agents that sound natural in a Kiwi context — and route cleanly to a person when they should.
Cost & latency
Making production AI economical: routing, caching, distillation and smaller fine-tuned models that hit the bar without the bill.
Recent notes
- Lux 1.0 → Building our own foundation model in New Zealand — why, how, and what’s next.
- Evals before features Why we write the test set before we write the agent — and what a good one looks like.
- Citations are a safety feature Grounding every answer in a source is the cheapest way to keep a model honest.
- Small, tuned, owned When a fine-tuned model on your own hardware beats a frontier API call.
Read the deeper write-ups on the blog, or see how it ships in our work.