Workstation4 / Blog / Bug Fix
Bug Fix Content Testing

When a correct answer was impossible to submit.

A word-reorder question asks you to rebuild a sentence from shuffled tiles. The rule that makes it feel right, that Submit stays disabled until every tile is placed, turned out to be the same rule that quietly made two questions impossible to answer. One was Thai, one was Hungarian, and both left people stuck on a screen with a perfectly correct sentence in front of them and no button to press.

The rule that broke

Word reorder is one variant of the CONSTRAINED_PROD item type. In the data model it carries a small ProductionSpec: a format enum (word_reorder) and an acceptList of strings that count as a correct answer. The tiles themselves are not stored separately. They are derived from the question stem, which is just the sentence written with " / " as a delimiter between words.

The SwiftUI view does the obvious thing on appear: split the stem on that delimiter, trim each piece, shuffle, and hand each word a stable integer id so SwiftUI's ForEach can animate tiles moving between the two areas. When you tap a tile it moves from available to placed; tapping it again sends it back. The answer the app submits is simply the placed tiles joined with a single space.

// from WordReorderView.onAppear
let words = item.prompt.stem
    .components(separatedBy: " / ")
    .map { $0.trimmingCharacters(in: .whitespaces) }
    .filter { !$0.isEmpty }
availableWords = words.shuffled().enumerated()
    .map { IdentifiedWord(id: $0.offset, word: $0.element) }

// and the submit gate, further down:
.disabled(availableWords.count > 0)

That last line is the whole story. A sentence is only a sentence when every word is present, so leaving Submit disabled until availableWords is empty felt natural. Letting someone submit three of five words would only ever produce a wrong answer they did not intend. The gate was deliberate, and for the overwhelming majority of items it does exactly the right thing.

The interaction rule was right. The content it depended on was not.

The authoring trap

The bug was not in the view. It was in an unwritten assumption the view made about its data: that the number of tiles equals the number of words in the correct sentence. When that holds, placing every tile and forming the correct answer are the same act, and emptying availableWords is a perfect proxy for being done.

Two items broke the assumption. Each shipped with an extra distractor tile, a plausible-looking word that had no slot anywhere in the correct sentence. As pure test design a distractor is reasonable: it raises difficulty by giving you something to reject. But the word-reorder format was never built to hold one. The accepted answer was defined as all the tiles in some order, joined by spaces, not a subset. An unused tile had nowhere to go.

The consequence followed straight from the gate. You could place every tile that belonged in the sentence, assemble a verbatim-correct answer, and still face a dimmed Submit button, because one tile remained in availableWords and the gate required that array to be empty. There was no path to completion. The right answer existed and was sitting on the screen, and it was simply impossible to submit.

How the matcher actually works

It is worth being precise about what "correct" means here, because it explains why the bug was about tiles and not about spelling. Production answers do not have to match an accept-list entry byte for byte. matchesAcceptList runs two tiers. Tier one is a normalized exact match: NFC Unicode normalization so é equals e plus a combining accent, trailing sentence punctuation stripped across scripts (including , , , ۔), commas removed, Arabic tashkeel stripped, Japanese furigana hints in parentheses removed, all spaces removed, then lowercased. Tier two, only if tier one misses, is an NLEmbedding sentence-similarity check that accepts when the distance is below 0.15 (roughly similarity above 0.85).

Removing spaces in the normalizer is the subtle part: it means tile order within the sentence does not actually have to be unique to match, and it lets a no-space language like Japanese compare cleanly against a spaced accept-list entry. But none of that helps when a tile cannot be placed at all. Tiers and embeddings only get a chance to run once you can press Submit, and the distractor never let you.

Fixing it in content, not code

The tempting move is to reach for the app: relax the gate, allow one leftover tile, make distractors a first-class feature of word reorder. I did not, and the reason is the architecture I had already committed to.

Test content does not live in the app binary. Each language is a JSON blueprint served from a content API and described by a manifest. The manifest carries a schemaVersion and, per language, a ContentVersion with a version string and a checksum. The app compares each language's stored checksum against the manifest and downloads only what changed, falling back to the copy bundled in the app. That pipeline exists precisely so a bad question can be corrected without an App Store review.

This was a bad item, not a bad app. I rewrote the two offending blueprints so every tile belongs in the answer, removed the distractors, and re-published. The fix reached users as a quiet checksum bump and a small download, not a multi-day release cycle.

  • App code kept its honest rule: a sentence is done when every tile is placed.
  • Content was brought back in line with the contract that rule assumes.
  • Delivery happened in minutes through the manifest, with the bundled copy as the offline fallback.

The real guard against a whole class of bugs

Fixing two items is easy. The harder question is how a homeless tile reached production at all, and how to keep it from happening across more than forty languages of generated content that no single person reads end to end.

The answer is a parameterized coverage test. There is one test case per bundled blueprint, discovered automatically by listing the Blueprints folder in the app bundle, so adding a language extends coverage for free. For every item it asserts the structure (multiple-choice items have a real correct_option_id present in their options, production items have a non-empty accept list), then it drives the canonical correct answer and a deliberately wrong sentinel through the real TestSession scoring path, and finally it runs a full perfect attempt and a full all-wrong attempt to confirm they land on the top and bottom CEFR cut scores.

The word-reorder-specific check is the one that would have caught this bug. It does not trust the accept list in isolation; it proves a player could actually build an accepted answer from the tiles the UI will show. It brute-forces permutations of the tiles, joining each with a space exactly as the view does, and looks for one the matcher accepts.

// brute-force the tiles the UI would show, joined the way the UI joins them
let solved = WordReorderSolver.firstAcceptedPermutation(
    tiles: tiles,
    isAccepted: { session.isAnswerCorrect(item: item, answer: $0) }
)
#expect(solved != nil,
    "\(languageId)/\(item.itemId): NO arrangement of tiles \(tiles) is \
     accepted: item is unanswerable in the app")

An item with a stray distractor fails this loudly: no arrangement of the available tiles can ever equal the intended sentence under the placement rule, so firstAcceptedPermutation returns nil. To keep the brute force honest there is a guard rail of its own: tiles are capped at ten, because 10! is already about 3.6 million permutations and anything larger is a content smell, not a real sentence. The same suite has an Android twin, so the parity promise between the two apps is enforced rather than hoped for.

Takeaways

  • An interaction rule and the data it consumes form a contract. When the data violates it, the symptom appears in the UI but the bug lives in the content.
  • If your architecture lets you fix data without shipping code, lean on it. A checksummed content manifest turned a multi-day release into a few-minute publish.
  • The durable fix was not the rewrite of two items. It was a test that runs every blueprint's tiles through the exact matcher the app uses and refuses to let an unanswerable question exist. Validate the property you actually care about, that a person can finish the question, not a proxy for it.
W4

Workstation4

A quiet workshop for cool, strange, useful iOS apps. Run by one developer who chases the weird problems for sport.

About the workshop →