The chatbot you remember is not the technology you're being sold

Local government has a capacity problem that worsens every year: demand climbs while budgets and headcount stay flat or shrink. The contact centre feels it first, swamped with the same high-volume, routine enquiries - a collection date, a council tax question, the status of an application, how to report a problem - that crowd out the time staff need for residents in genuine difficulty. The promise of conversational AI is to absorb that routine demand around the clock, freeing people for the work that needs a person.

It is a promise many councils have heard before and been burned by. The rigid, scripted chatbot of a few years ago - forever answering “Sorry, I didn't understand that”, did more for resident frustration than resident service. So, it is worth stating plainly: the conversational AI of 2026 is not that chatbot, and it should not be evaluated as though it were. Understanding what has genuinely changed - and what must still be true for it to work in public services - is the difference between a deployment residents come to trust and another expensive disappointment.

Three generations - and why the distinction decides your evaluation

The single most useful lens for evaluating these tools is generational. Three distinct architectures are all marketed under the same words, and they are not remotely equivalent.

Generation 1 - declarative (rule-based)

Fixed decision trees and scripted flows. Predictable, but brittle: they break the moment a resident phrase something in an unexpected way. This is the chatbot everyone remembers and nobody liked, and it is still quietly sold under newer labels.

Generation 2 - retrieval-augmented (RAG)

A language model that understands intent expressed in natural language and grounds its answers in your own approved content. The leap here is both accuracy and naturalness: it answers from your documents and records, not from whatever the open model happens to have absorbed. For most public sector uses today, this is the centre of gravity.

Generation 3 - agentic

An “observe, plan, act, reflect” loop that does not merely answer a question but works to complete the resident's goal - checking a case status, updating a record, completing a transaction across connected systems. This is the frontier, and the real prize for self-service; it is also where the governance demands rise sharply.

The practical implication: know which generation you are actually being offered. Paying agentic prices for a dressed-up decision tree or deploying agentic actions before you can prove the basics, are both avoidable mistakes - and both common.

Grounding is the whole game

If there is one design decision that determines whether a public sector assistant is safe to deploy, it is grounding. Left to answer from their general training, large language models hallucinate - producing confident, fluent, wrong answers - in the range of 15 to 27 per cent of customer-service interactions. Constrain those same models to answer only from supplied, approved source material, and that rate collapses to roughly one per cent. Retrieval-augmented generation against a maintained, current knowledge base is, by a wide margin, the most important architectural choice for accuracy.

In commercial customer service, a hallucination is a dent in a satisfaction score. In public services it is something graver: a wrong entitlement, a misstated statutory right, a deadline given incorrectly to someone who relied on it. So the assistant must answer only from the council's approved, current content; be able to cite where each answer came from; and decline - handing over to a human - when it is not confident. Grounding is not a feature to compare on a list. It is the licence to operate.

“In the public sector, a hallucinated answer isn't a bad review. It's a resident misled about something that matters.”

The metric trap: deflection is not resolution

For years, conversational AI was sold on deflection - the share of contacts kept away from a human. It is the wrong headline metric, and in the public sector a quietly dangerous one, because deflection counts the resident who gave up exactly the same as the one who got an answer. The industry's own data makes the gap stark: AI may deflect 45 per cent or more of queries while fully resolving only around 14 per cent of issues without human help.

A “deflected” but unresolved resident does not, as a dissatisfied retail customer might, quietly take their business elsewhere. They call back - angrier. They escalate to a councillor. Or, worst of all, they go without a service they were entitled to. The metrics that matter, therefore, are resolution (did the resident actually get what they needed?) and clean escalation (when they did not, were they passed to a person smoothly?). The market has rightly shifted to judging these systems on resolution rather than deflection: an assistant that completes the task beats one that merely answers - and an assistant that traps people beats nothing at all.

Escalation is a feature, not a failure

Even with excellent AI, most people still trust a human more - surveys consistently put it around 84 per cent. That is not a problem to engineer away; it is a design brief. Residents must always know when they are speaking to a machine - clear labelling is non-negotiable - and they must be able to reach a person in a single step. When they do, the handover should carry the full conversation with it, so no one has to repeat themselves to the human who picks up.

The strongest systems escalate proactively. A confidence threshold that, when unmet, makes the assistant decline rather than guess. And - crucial in public services - recognition of distress, vulnerability or safeguarding signals that routes those conversations straight to a trained person, every time, without exception. The mark of a mature assistant is not that it never says “I can't help with that.” It is how safely and gracefully it does so.

Why commercial CX playbooks don't transfer cleanly

It is tempting to lift a retail or telecoms chatbot playbook wholesale into a council. Resist it. Public services differ on dimensions that change the design itself. The stakes are higher - entitlements and statutory obligations, not order tracking. The users include the most vulnerable people in the community, often at moments of stress. There are duties most businesses do not carry: equality and accessibility obligations mean the assistant must work across the community's languages (the subject of our companion guide on inclusive services), in plain language, and to recognised accessibility standards. The resident cannot switch to a competitor when the experience is poor. And the data is sensitive. Every one of these factors pushes grounding, escalation, inclusion and governance far up the priority list - well above the engagement-and-deflection goals that drive most commercial deployments.

The agentic frontier: promise, and the caution it demands

The real prize for citizen self-service is Generation 3: assistants that do not just explain how to do something but actually do it - check the status of a claim, update contact details, arrange a service. This is where genuine resolution lives, and where the capacity dividend is largest. It is also where governance bites hardest, because an assistant that can act on a resident's case is an assistant that can get it wrong at scale.

The discipline here mirrors responsible AI more broadly, covered in full in our governance guide: a dedicated, least-privilege identity for the assistant; an explicit allow-list of the actions it may take; human approval for anything consequential; containment; and a complete audit trail. The sensible path is crawl, walk, run - ground the answers first, prove accuracy and escalation in live use, then extend carefully into low-risk actions before higher-stakes ones. The deployments that end in trouble are almost always the ones that let an assistant act before they had proven it could reliably answer.

What good looks like: the buyer's checklist

A practical test for any conversational AI or citizen self-service platform aimed at the public sector.

Does it answer only from our approved, current content - and can it cite the source of each answer?

What is its measured resolution rate, not merely its deflection or containment rate?

Can a resident reach a human in one step, with the full conversation carried across?

Does it recognise distress, vulnerability and safeguarding signals and route them straight to a person?

Is the AI clearly labelled, so residents always know they are not talking to a human?

For any action it takes: least-privilege access, human approval for consequential steps, and a full audit trail?

Can we see why it answered as it did - the observability to monitor accuracy and improve it?

Does it work across our community's languages and meet our accessibility obligations?

Does resident data stay in our environment and out of external model training - and does it layer onto our existing channels rather than force a rip-and-replace?

The prize - claimed safely

Done well, the benefits are substantial and real: round-the-clock access for residents who cannot call during office hours; routine demand genuinely resolved rather than merely queued; and contact-centre staff freed from repetitive enquiries to spend their time with residents who need judgement, empathy and care. Add the inclusion gains of an assistant that works across many languages, at any hour, and the case becomes compelling.

But hold to the reframed scorecard. Success is resolution and trust, not raw deflection. Favour platforms that layer onto your existing channels and systems and can go live quickly, over rip-and-replace programmes that stall - the market has moved decisively in that direction, and for good reason. And treat the assistant as a living service that is monitored, measured and improved, not a project that is launched and forgotten.

Build it in, don't bolt it on

The clearest sign of a conversational AI platform fit for public services is that grounding, escalation and governance are designed in. Our own approach grounds every answer in the organisation's approved content, with sources it can cite; declines and escalates to a human - carrying full context - when confidence is low or vulnerability is detected; applies least-privilege, audited, human-approved controls to any action it takes; works across the community's languages and accessibility needs; and keeps resident data within the organisation's environment, never used to train external models. Built to resolve, and to know its limits - not just to deflect.

The scripted chatbot that frustrated a generation of residents is gone. What replaces it is grounded, accountable, multilingual and - crucially - aware of when to step back and bring in a person. Judge it accordingly: not by how much demand it deflects, but by how much it genuinely resolves, and how gracefully it escalates what it should not touch. Get that right, and conversational AI becomes one of the most visible, everyday proofs that a council can adopt AI in a way residents actually trust.

If you are weighing up conversational AI for citizen self-service and want to separate the substance from the sales pitch, we would be glad to share what a grounded, well-governed deployment looks like.