Lab · Research

Notes from the work of making AI reliable.

We don’t do research for papers — we do it so the systems we ship hold up in production. These are the questions we keep returning to, and what we’ve learned answering them on real projects.

What we research

Reliability & evaluation

How do you know an agent is good enough to ship? We build task-specific eval suites so every system is measured against real work before it touches a customer — and watched after it does.

Retrieval that grounds

Getting a model to answer from your documents, not its imagination. We research chunking, hybrid search and citation so every answer can be traced back to a source a human can check.

Agents & tool-use

Narrow agents that take real actions safely — calling APIs, booking jobs, drafting replies — with clear limits and a human in the loop on anything that matters.

Sovereign & on-prem

Running capable models inside a customer’s own boundary, so regulated industries keep their data in New Zealand. The work behind Lux 1.0 and private deployment lives here.

Voice

Latency, turn-taking and accent handling for phone agents that sound natural in a Kiwi context — and route cleanly to a person when they should.

Cost & latency

Making production AI economical: routing, caching, distillation and smaller fine-tuned models that hit the bar without the bill.

Recent notes

Lux 1.0 → Building our own foundation model in New Zealand — why, how, and what’s next.
Evals before features Why we write the test set before we write the agent — and what a good one looks like.
Citations are a safety feature Grounding every answer in a source is the cheapest way to keep a model honest.
Small, tuned, owned When a fine-tuned model on your own hardware beats a frontier API call.

Read the deeper write-ups on the blog, or see how it ships in our work.

Got a hard problem worth researching together?

Talk to us →