The Pokemon generations bug: when I asked for more items than exist

Some bugs live in your code. This one lived in my words. Someone typed "Pokemon generations" into Popular Rank, asked the on-device model to turn it into a rankable list, and got a hard failure every single time. The model answered correctly. My validator threw the answer away anyway, because I had quietly asked the world for something it does not have: enough Pokemon generations to satisfy a fixed quota.

How generation actually works

Before the bug makes sense, the pipeline does. Popular Rank turns any free-text topic into a list of short items you can then rank by pairwise comparison. Generation runs through a three-tier service called ThreeTierGenerator, and it tries each tier in order until one returns something usable:

Tier A, TopicCatalog. A curated lookup for common topics. Instant, high confidence, no model involved.
Tier B, on-device AI. On iOS 26 and up, Apple's FoundationModels framework. It checks SystemLanguageModel.default.availability, opens a LanguageModelSession(), and calls session.respond(to:) with a prompt. Medium confidence.
Tier C, smart fallback. Pure heuristics with no model at all, so the app never returns nothing. Low confidence.

The default list size, the limit passed down through the whole chain, is fifteen. That number is the quiet protagonist of this story.

What actually broke

Tier B is where the failure lived. The prompt I sent asked the model to list exactly fifteen of whatever the topic was. The raw text came back, I split it into lines, and then I ran it through ListValidator, which cleans each item and decides whether the batch as a whole is good enough to show.

Cleaning is the easy part, and it does real work: it strips numbered prefixes with a regex (^\d+[\.\)\:\-\s]+ so "1. Pikachu" becomes "Pikachu"), removes emoji, deduplicates case-insensitively, and rejects anything longer than three words or fifty characters. Those rules keep the list tidy. They were not the problem.

The problem was the count check. There were actually two of them, and they multiplied the same bad assumption. ListValidator returned isValid only when the cleaned list had at least limit / 2 items. Then, separately, ThreeTierGenerator gated again on validatedItems.count >= limit / 2 before accepting the tier. With a limit of fifteen, both floors sat at seven.

Pokemon generations broke that in the most literal way possible. At the time there were only nine of them. The world does not contain fifteen. So the model did the honest thing under pressure: it tried to honor "exactly fifteen" and produced a mix of the real nine plus invented filler, or it returned a clean nine that still had to clear a floor written for a much larger set. Tier B would fail validation, retry once with a stricter prompt, fail again, and silently fall through to Tier C, which handed back generic heuristic junk instead of the nine correct answers that had been sitting right there.

A quota the world cannot fill

The deeper issue is a whole class of topics, not one franchise. Plenty of legitimate things to rank are finite and small: the days of the week, the visible planets, the primary colors, the Great Lakes, the Pokemon generations. Each has an exact, knowable count, and that count is usually well below a generous default.

When you ask a generative model for exactly fifteen items from a set that only has nine, you put it in a corner. It can refuse, or it can pad. Padding is the outcome I worried about most, because a model trying to satisfy "give me fifteen" may invent a tenth and eleventh generation that do not exist, and now the ranking quiz is built on fiction. An overly rigid prompt does not just fail loudly. It can fail quietly by manufacturing exactly the wrong kind of confidence, and my downstream cleaning would happily polish those fakes into clean three-word items.

The prompt was the bug. The code was only doing what the prompt told it to want.

The two-part fix

The repair touched three files and had two ideas behind it, and both ideas mattered.

First, I changed what I ask for. Instead of demanding exactly limit items, I now ask for up to limit, and I tell the model in plain words to list all of them if fewer exist. That single rewording releases the pressure to pad. A finite topic returns its true, complete set, and a large topic still gets capped at a sensible size.

Second, I changed what I accept. The limit / 2 floor scaled with the request, which is exactly backwards for small topics: the more I asked for, the higher the bar a short-but-correct list had to clear. I replaced both floors with a flat minimum of three. Three is enough for a pairwise ranking to be meaningful, and it stops punishing a topic for being naturally small. In the validator I wrote it as min(3, limit) so that even a deliberately tiny request can still pass.

// PromptBuilder.swift, before
List exactly \(limit) \(query).

// PromptBuilder.swift, after
List up to \(limit) \(query). If fewer exist, list all of them.

// ListValidator.swift
// before: let isValid = cleanedItems.count >= limit / 2
let isValid = cleanedItems.count >= min(3, limit)

// ListGenerationService.swift (Tier B acceptance)
// before: if isValid && validatedItems.count >= limit / 2 {
if isValid && validatedItems.count >= 3 {
    // accept on-device result
}

Together these turned a category of topics from reliable failures into reliable successes. Nine generations now come back as nine, ranked cleanly against each other, with no invented entries, no rejected answer, and no embarrassing detour into Tier C.

Why having two floors mattered

It is worth dwelling on the fact that the same wrong assumption was encoded twice, in two layers that did not know about each other. ListValidator is a generic utility; it returns an isValid flag based on its own threshold. ThreeTierGenerator is the orchestrator; it applied a second threshold of its own before trusting the tier. Either one alone would have rejected the nine generations. Fixing only one would have left the bug alive behind the other.

That is a small lesson in its own right: when a magic number expresses a belief about the world, it tends to leak into more than one place. Grepping for limit / 2 and finding two independent copies was the moment the shape of the bug became obvious. The duplication was not the cause, but it was the fingerprint.

The lesson I am keeping

I spend a lot of effort on the usual code-level care: validating inputs, guarding against nil, keeping behavior reproducible. It is easy to treat the prompt as configuration rather than logic. This bug was a gentle reminder that with on-device generation the prompt is logic. The wording carries assumptions, and a hidden assumption can be every bit as much a defect as an off-by-one or a missing guard.

The tell, in hindsight, was that I had encoded a belief into a constant. "There are at least fifteen of everything worth ranking" is not true, and I wrote it down as a quota without ever saying it out loud. The fix was less about code and more about asking for what is actually there: give me what exists, and I will rank that.

Ask a generative model for up to N, not exactly N, whenever the real-world count might be smaller. Tell it explicitly to list all of them if fewer exist.
A rigid quota does not just cause refusals; it can pressure a model into padding with invented items, which your cleaning step will then dutifully make look legitimate. That is the quieter and worse failure.
Acceptance thresholds that scale with the request (limit / 2) punish small, finite topics. A flat floor treats them fairly.
When a number encodes a belief about the world, it tends to live in more than one layer. Fix every copy, not the first one you find.
With on-device AI, the prompt is part of your logic. An overly strict prompt is a bug, even when every line of code runs correctly.

The Pokemon generations bug: when I asked for more items than exist.

How generation actually works

What actually broke

A quota the world cannot fill

The two-part fix

Why having two floors mattered

The lesson I am keeping

Workstation4

Letting people trade speed for confidence with one honest toggle.

A three-tier generator so any prompt produces a usable list.