A three-tier generator so any prompt produces a usable list

Popular Rank lets you rank anything, but "anything" starts as a single line of free text someone types into a box. Turning that prompt into a clean list of items the quiz can compare two at a time was harder than I expected, mostly because of two rules I wanted to hold to: it should work offline, and nothing the user types should leave the phone. What I landed on is a generator that falls through three tiers, with a validator and a separate quality pass wrapped around all of them. The whole thing lives behind one protocol method, generateItems(query:limit:), returning a GenerationResult that also carries a confidence level and which tier produced it.

One box, many ways to fail

The promise on the screen is simple. You type "pizza toppings" or "largest moons in the solar system," and a moment later you are tapping A-vs-B picks. Behind that simplicity is a surprising amount of room for things to go wrong. A single model call can return an empty response, a numbered preamble instead of items, near-duplicates, items wrapped in quotes or prefixed with emoji, or a polite refusal. The model can also be unavailable entirely on a device that does not support on-device generation. None of those outcomes felt acceptable, because a half-broken list is worse than no list: it wastes the taps the user is about to spend.

So instead of treating list generation as one call that either works or does not, I treat it as a pipeline with graceful degradation. Each tier is faster and more certain than the one below it, and we only fall to the next tier when the current one declines.

func generateItems(query: String, limit: Int) async throws -> GenerationResult {
    let trimmed = query.trimmingCharacters(in: .whitespacesAndNewlines)
    guard !trimmed.isEmpty else { throw ListGenerationError.emptyQuery }

    // Tier A: instant, high confidence
    if let catalogItems = TopicCatalog.items(for: trimmed) {
        return applyQualityPass(items: catalogItems, confidence: .high, source: .catalog)
    }
    // Tier B: on-device AI (iOS 26+)
    if #available(iOS 26, *),
       let aiResult = try await attemptOnDeviceGeneration(query: trimmed, limit: limit) {
        return aiResult
    }
    // Tier C: heuristic fallback (always returns something)
    return generateFallback(query: trimmed, limit: limit)
}

Tier A: the curated catalog

The first thing we check is whether we already know the answer. Popular Rank ships with a hand-built catalog of around 250 topics, from fast food chains to chess openings to coffee origins. When a prompt maps cleanly to one, we serve it directly with confidence: .high. This is the happiest outcome on every axis: it is instant, it never touches a model, and the quality is whatever I decided when I wrote the topic by hand.

Matching is more forgiving than a dictionary lookup. The query is normalized first: lowercased, stripped of punctuation except apostrophes, with runs of whitespace collapsed. After a direct hit fails, it tries a substring match in both directions, so "my favorite pizza toppings" still resolves to the "pizza toppings" entry. That fuzzy step is what makes the curated tier catch far more real prompts than its 250 keys would suggest. The fastest code is the code you do not run, and the most reliable list is the one a person already curated.

Tier B: on-device FoundationModels

When the catalog has no match, we hand the prompt to Apple's on-device FoundationModels, available on iOS 26 and up. This is the part that makes the feature feel open-ended: you can ask for something I never anticipated and still get a sensible list back. The model runs entirely on the phone, which is what let me offer free-text generation while keeping the privacy promise intact.

Before doing anything, I check SystemLanguageModel.default.availability. If it is .unavailable (no Apple Intelligence on this device, or it is still downloading) I return nil and fall straight to Tier C rather than throwing. When the model is available, I open a LanguageModelSession and send a deliberately strict prompt: one item per line, just the name, 1 to 3 words, no numbers, no emoji, no numbered prefixes, all distinct and real.

The interesting part is that I do not trust the first answer. The whole attempt is a small loop of up to two tries:

for attempt in 0..<2 {
    let prompt = PromptBuilder.buildPrompt(for: query, limit: limit, isRetry: attempt > 0)
    let response = try await session.respond(to: prompt)
    let rawItems = PromptBuilder.parseResponse(response.content)   // split on newlines
    let (items, isValid, _) = ListValidator.validate(rawItems, limit: limit)
    if isValid && items.count >= 3 { return success(items) }       // good enough
}
return nil   // both tries failed validation; fall through to Tier C

The retry is not just a re-roll. On the second pass the prompt gets meaner, with extra instructions like "NO phrases like 'Red apple', just 'Red'" because the most common failure I saw was the model adding qualifiers and blowing past the word limit. I accept any result with at least 3 valid items, since plenty of honest topics (oceans, the primary colors) simply do not have twenty entries.

The validator: turning model output into a contract

The validator is the piece that turns a pile of fallbacks into a guarantee. It treats every item as untrusted and either cleans it or rejects it. Concretely, each candidate is:

stripped of numbered prefixes ("1.", "2)", "3:", "4-") with a small regex, ^\d+[\.\)\:\-\s]+;
stripped of emoji at the unicode-scalar level;
length-checked: 2 to 50 characters, and no more than 3 words;
checked for stray numbers, but with one deliberate exception. Trailing numbers near the end of a word are allowed so "iPhone 15" survives, while "Top 10 Cars" gets rejected;
run through a tiny profanity blocklist;
de-duplicated case-insensitively against everything accepted so far.

A list counts as valid once it has at least 3 clean items. The rule I held myself to is plain:

The user always gets either a usable list or a clear error, never a half-broken one.

Keeping validation outside the generators, rather than inside each one, means every tier is held to the same standard and no tier can quietly leak a bad list through. The number-position check was the gotcha that took the longest to get right. My first version rejected anything with a digit, which threw away perfectly good product names; the fix was to measure each digit's distance from the end of the string and only reject numbers that sit too far from the trailing edge.

A quality pass for the things validation misses

There is a layer the original story glossed over. Even a list of individually valid items can be collectively mediocre: "Cat" and "Cats", or both "Dog" and "Golden Retriever" in the same list. So before any result is returned, it goes through a separate ListQualityService.qualityPass that the validator does not handle:

normalize each item (trim, drop surrounding quotes and bullets, collapse spaces);
case-insensitive exact dedup, keeping the first occurrence;
near-duplicate removal using a string-similarity score with a 0.92 threshold, so "Cat" and "Cats" collapse to one;
granularity detection, which flags pairs where a broad item contains a more specific one. These become warnings rather than removals, because sometimes that overlap is intentional.

In debug builds the result carries a GenerationDebugInfo struct recording the tier, retry count, validation errors, generation time in milliseconds, and how many duplicates the quality pass removed. That instrumentation is how I learned which prompts were actually exercising the retry path versus sailing through on the first try.

Tier C: the floor that never returns empty

Not every device can run the on-device model, and I did not want to leave those people with nothing. The third tier pattern-matches the prompt against a set of built-in lists for well-understood categories: colors, planets, months, US states, programming languages, zodiac signs, and a couple dozen more. It also understands a few shapes of request via regex, like "types of X" and "list of X", which it re-routes back into those built-in lists.

What matters most about Tier C is that it can never return empty. If nothing matches, it falls back to semantic variations built from the query itself: "Classic Tacos," "Popular Tacos," "Famous Tacos," and so on. That is obviously a draft rather than a real answer, so a result from this tier is tagged source: .fallback and the UI shows a banner inviting the user to edit before they start ranking. There is also a small artificial 0.6 second delay before the fallback returns, purely so the wait feels like the phone did some thinking rather than snapping back instantly with placeholders.

What I took away

Treating open-ended generation as a pipeline with degradation, rather than one call that wins or loses, made the whole feature easier to reason about and to instrument.
Cheap, curated answers beat clever ones. A forgiving fuzzy match over 250 hand-written topics covers far more real prompts than the key count suggests.
On-device models let you accept free text without breaking a privacy promise, but their output is worth treating as untrusted: validate, retry with a stricter prompt, and only then accept.
Separating per-item validation from a collective quality pass mattered. The single-item rules cannot catch near-duplicates or granularity overlap; those needed their own step with a similarity threshold.
A floor tier that can never return empty kept the device class without Apple Intelligence from being left out, and honestly labeling its output as a draft was kinder than pretending it was a real answer.

The feature reads as one box and a moment of waiting. Underneath it is three ways to succeed, two passes that decide whether success is real, and exactly one way to fail. I tried to make sure the one failure is an honest one.

A three-tier generator so any prompt produces a usable list.