Case study
open-defence-radar
A clearance-safe retrieval system over open government data, exposed as an MCP tool, a CLI, and a web console. Every claim traces to a fetched, licensed source, and a CI-gated evaluation harness holds retrieval and grounding to a floor on every commit.
- Role
- Architect and engineer (AI-paired)
- Year
- 2026
- Client
- Personal applied-AI build
- Outcome
- v1.0.0, public, CI-gated
What it does
open-defence-radar answers questions about UK defence and security using only open, published data, and it shows its working.
You ask a question. It retrieves the relevant passages from public procurement notices, tenders, and government announcements, synthesises an answer, and attaches a citation to every claim. Each citation resolves to a fetched, licensed record: the source, the date, the title, the URL. If nothing in the ingested data supports an answer, it says so rather than inventing one.
That last sentence is the whole point. A language model will answer anything, confidently. This one is only allowed to say what the retrieved sources support, and it carries the receipts.
It runs four ways from the same core: a read-only query tool inside an MCP client, which is the headline surface, a command-line tool, a web console with a trust dashboard, and an odr agent that breaks a broad question into focused sub-questions and recombines the answers into one cited brief.
The problem
Language models have quietly become the first draft of analysis. Ask one a question over a pile of documents and it answers instantly, fluently, and often wrongly. Over messy, high-stakes material like government and defence data, that last word is the whole problem.
For anything decision-shaped, a fluent answer with no source is worse than no answer. It invites a trust it has not earned, and the cost of acting on a confident hallucination is paid downstream, by whoever assumed the machine had done the reading. The bottleneck in applied AI is no longer generation. It is grounding: can you trust the output, and can you prove it.
So the product question is narrow and unglamorous. Can you build a system that answers from a real corpus, shows its working on every claim, says so plainly when the evidence is not there, and keeps proving it does all of that on every change. open-defence-radar is that system, built end to end with the messy parts left in.
One constraint shaped every decision: it had to be useful in a defence and security context and safe to build and share in the open. So, open sources only, analytic rather than operational, nothing that could resolve to a person, a site, or an operation. That reads like a limitation. It is the opposite. It forced provenance, honesty about what the data can and cannot say, and the kind of data-governance discipline any serious deployment has to do anyway.
System overview
The shape is a single grounded path from open data to a cited answer.
Open sources are fetched, normalised into one Document model, deduplicated by content hash, chunked, and embedded. Embeddings run on a local model, so ingest costs nothing and works offline. Everything lands in one SQLite file that holds three things at once: the vectors, a keyword index, and the relational metadata that carries provenance. One file, no service to stand up, and a backup is a copy.
Retrieval is hybrid. A semantic search and a keyword search run against the same filtered candidates, and their rankings are fused with Reciprocal Rank Fusion, which is deterministic and needs no model. Synthesis takes the top passages and is allowed to use nothing else.
Every provider is a swappable seam behind a typed interface. The embedder, the generator, and the store are each chosen at the composition root from the environment, and the tests inject fakes for all three, so the suite runs with no network and no database. Generation defaults to a free-tier model; change one variable and it runs against a paid model or a local one. None of the surfaces know or care which.
How it stays grounded
Grounding is checked twice, deliberately at two different costs.
At synthesis time, on every single answer, the system parses the generated text and checks that each factual claim carries a citation marker that resolves to a passage that was actually retrieved. An uncited claim, or a citation to a passage that was never fetched, is flagged. This is free, it is parsing, and it runs every time.
At evaluation time, on a curated question set, an LLM judge does the more expensive check: for each cited claim, does the cited passage actually entail it? The judge runs at temperature zero and caches its verdicts by claim and passage, so the same pair is never paid for twice.
The provenance chain underneath is unbroken: answer, to citation, to chunk, to document with its URL and date and content hash, to source with its licence. Every claim is traceable to a fetched, licensed record. There is no point in the chain where a claim floats free of its source.
A decision the eval made
Hybrid retrieval fuses a semantic and a keyword search with a deterministic rule. The textbook next move is a cross-encoder reranker: a small model that re-scores the top candidates for relevance. It is the obvious upgrade, and I built it.
Then the rule I had set for myself applied. The reranker ships only if the evaluation harness shows it beats the deterministic baseline on the curated set. It has not earned that here. With retrieval already sitting at and above its floors, the reranker added latency and a model dependency without moving the numbers that matter. So it stays off by default, behind a flag, ready for the day the corpus grows large enough to need it.
That is the entire thesis made literal. The eval is the arbiter, not my instinct and not the textbook. Building something and then leaving it switched off because the measurement did not justify it is not wasted work. It is the work.
There was a blunter failure too. The vector extension would not load on the stock Python interpreter on macOS, which ships without the ability to load SQLite extensions at all. The honest fix was to move the storage layer onto a build of SQLite that bundles extension loading and ships cross-platform wheels, rather than fight the interpreter. It loads everywhere now, and the diagnosis was most of the work.
The trade-off honest enough to put in print: one SQLite file and a deterministic baseline are the right call for a bounded, reproducible, single-machine corpus. They are the wrong call past about a hundred thousand vectors, where brute-force search stops being free and the answer is Postgres with a vector index. The storage interface is drawn so that the swap is a replacement, not a rewrite. It is the right trade for what this is, and the design says so out loud.
Clearance-safe by design
The guardrails are not a disclaimer at the bottom of the page. They are design decisions, made early, that the rest of the system had to respect.
Open data only: every source declares its licence in code, and the repository holds no secrets and no employer-connected data. Analytic, not operational: the system describes what is publicly known and announced, never targeting or interpretation. Region level only: procurement notices carry a delivery region, and the map resolves no finer than one of the twelve UK regions. The richer and riskier options, conflict-event geocoding and precise coordinates and “within so many kilometres of a place”, were considered and deliberately declined, because they cut against an open and analytic posture for a clearance-aware public repository.
Stating those boundaries plainly is not caution for its own sake. It is the data-governance reasoning any serious deployment of this kind has to do anyway, done in the open.
Open work
The honest list of what is not finished.
Embeddings are local-only in this version. That keeps cost at zero and the build reproducible, but it leaves quality on the table a hosted model might recover, and the eval harness is exactly the instrument to decide whether it is worth it.
The curated evaluation set is deliberately small. It proves the harness and gates regressions; it does not yet prove breadth. Growing it is the next obvious move, and the floors are written to ratchet up, never quietly down.
Claim segmentation at synthesis time is a heuristic, not a real claim extractor. It catches the cheap failures honestly, and it is upgradeable.
open-defence-radar is a deliberately small answer to a large and generic problem: how do you make an AI system you can actually trust over data where being wrong is expensive. The answer it argues for is the same one I would bring to that problem at scale. Ground every claim in a source. Measure quality continuously rather than asserting it. Refuse rather than guess. Resist the complexity the numbers do not justify. The domain here is open defence data; the discipline travels to anywhere the output has to be believed.