Adding word reorder and minimal pairs to the test

For a long time, a placement test in the app meant one thing: read the stem, tap one of four choices, move on. Multiple choice is easy to author, easy to grade, and easy to trust. It is also a fairly narrow lens on what someone actually knows. The v2 content schema widened that lens a little, nudging the test past recognition into constrained production. Each new type sounded simple on a whiteboard. Each one ended up earning its own SwiftUI view, its own path through the scoring engine, and at least one lesson I did not see coming.

The shape of a question

Everything starts with the type model. A bundled blueprint decodes into a list of TestItem values, and each one carries a QuestionType. There are four: MC_CLOZE, READING_COMP, LISTENING, and the new one, CONSTRAINED_PROD. The first three are all multiple choice underneath; they differ in how the stem is dressed up, not in how they are graded. Constrained production is the interesting one, because it does not ship a list of options at all. Instead it carries a ProductionSpec with a format and an acceptList, and the format decides which of three sub-experiences the user gets.

enum ProductionFormat: String, Codable {
    case freeResponseCloze = "free_response_cloze"
    case wordReorder       = "word_reorder"
    case minimalPair       = "minimal_pair"
}

At runtime a single ConstrainedProdQuestionView does nothing but switch on that format and hand off to the right view: a text field for cloze, a tile tray for reorder, an A/B chooser for minimal pairs. Keeping that fan-out in one place is what let me add types without touching the runner that walks the user through the test.

Why more than multiple choice

Multiple choice measures whether you can pick the right answer out of a lineup. That is a real skill, but it leans on recognition, and a confident guesser can climb a CEFR ladder a bit faster than their grammar really supports. Constrained-production items help close that gap. They ask you to build something rather than spot it, while staying bounded enough that a deterministic engine can grade them without a human in the loop.

Free-response cloze asks the user to type the missing word instead of selecting it.
Minimal pairs test whether you can hear or read the difference between two forms that differ by a single sound or character.
Word reorder hands you a scrambled sentence and asks you to put it back together.

The trade is real: production items are harder to grade fairly and harder to author safely. But they tell us more per question, and on a placement test signal density matters.

How word reorder works

Word reorder turned out to be the most mechanically interesting of the new types. The blueprint encodes the words in the stem joined by " / ", which is a delimiter that does not naturally occur inside a sentence. On appear, the view splits the stem on that delimiter, trims each piece, and wraps each word in an IdentifiedWord so SwiftUI's ForEach has a stable identity even after a shuffle. The tiles then flow into a custom FlowLayout (a small Layout conformance) that wraps to the next line when a row runs out of width, which matters a lot once you hit long German compounds or scripts that render wide.

let words = item.prompt.stem
    .components(separatedBy: " / ")
    .map { $0.trimmingCharacters(in: .whitespaces) }
    .filter { !$0.isEmpty }
availableWords = words.shuffled()
    .enumerated()
    .map { IdentifiedWord(id: $0.offset, word: $0.element) }

// On submit, the answer is just the placed tiles joined by spaces:
let sentence = placedWords.map(\.word).joined(separator: " ")

Tapping a tile moves it between the available tray and the sentence area, animated with a short easeInOut. The interaction rule I landed on is that Submit stays disabled until every tile is placed, expressed simply as .disabled(availableWords.count > 0). For a sentence-building task that felt like the right call. A half-built sentence is not really an answer, and letting someone submit one would only add noise to the scoring.

The authoring trap hiding in a good rule

That rule is good for the user. It also quietly hands all the responsibility to whoever authors the content. Because Submit requires every tile to be placed, the item is only answerable if every tile genuinely belongs in the correct sentence. The moment a blueprint ships with one extra tile that has no home in the answer, a perfectly capable user can place every other word and still be locked out of submitting, because there is always one tile left over.

A UI constraint that is right for the user can be wrong for the content, and the gap only shows up when generated data meets a rule that assumes the data is clean.

This is the kind of bug that does not live in any single layer. The view is fine. The scoring engine is fine. The content is the variable, and across dozens of languages of generated items, the content will eventually drift. I flagged the trap when I built the type, and sure enough it surfaced later for a couple of items in different languages, the ones the test suite now names as the th_022 and hu_022 class of bug. The part I am glad about is that I knew exactly where to look.

One engine, many views

Adding question types is not just a UI exercise. Every new view funnels its result into the same isAnswerCorrect(item:answer:) on the test session. For multiple choice that is a plain string compare against the correct option id. For constrained production it runs the answer through matchesAcceptList, and that is where most of the subtlety lives, because typed answers in fifty different writing systems do not compare cleanly.

The matcher works in two tiers. Tier one is a normalized exact match: both the answer and each accept-list entry pass through a single normalizeForMatching step, and if any pair comes out equal the answer is correct. The normalizer does a lot of quiet work that I only fully appreciated once I was staring at why a clearly-right answer was being rejected:

Unicode NFC composition, so é typed as one code point equals e plus a combining accent.
Stripping trailing sentence punctuation across scripts, not just . and ? but also 。, ។, ।, and friends.
Removing commas, including the full-width and CJK variants.
Stripping Arabic tashkeel diacritics, since the tiles and the accept list do not always agree on whether they are present.
Removing Japanese furigana hints in parentheses, so 食（た）べます matches 食べます.
Removing all spaces, which both handles no-space CJK and lets a multi-word reorder answer match regardless of spacing.

Tier two only runs when tier one fails. It uses Apple's on-device NLEmbedding.sentenceEmbedding to compute the distance between the answer and the closest accept-list entry, and accepts the answer only when that distance is below 0.15, roughly a similarity above 0.85. That catches harmless rewordings the normalizer cannot, while a deliberately high threshold keeps it from rubber-stamping a wrong answer. If embeddings are not available for a language, the tier simply does not fire, and the item stays graded by the strict path.

Minimal pairs and a layout I broke twice

Minimal pairs render as two large A/B buttons. The catch is that the stem arrives as a single block of text shaped like Context: ... then Sentence A: "..." then Sentence B: "...", and those labels are localized. So the view parses lines with a small set of Unicode-aware regexes: ^\p{L}+\s*A\s*[：:]\s*(.+)$ for the A line, the same for B, and a looser pattern for context, with a fallback to the last two non-empty lines if nothing matches. Using \p{L}+ instead of a literal word means it survives Japanese 文A, Estonian Lause A, and everything in between, and matching [：:] covers both the half-width and full-width colon.

The layout itself bit me. A long context, like one Japanese listening item, pushed the A and B buttons off screen, and my first instinct was to clamp the context in its own scroll view. That collapsed its height and let it overlap the buttons. The fix was almost the opposite: let the context render at its natural height with fixedSize(horizontal: false, vertical: true) and trust the outer scroll view that already wraps the whole question. I left a comment in the file pointing at that debug session so I would not re-break it a third time.

The safety net

The thing that actually let me sleep at night was a parameterized coverage test that runs once per bundled blueprint. It asserts there are at least forty of them, decodes each through the exact production loader, and then drives every item through the real TestSession: the canonical right answer must score correct, a sentinel garbage string must score wrong, and a full perfect run must land on the top CEFR cut while an all-wrong run lands on the bottom.

For word reorder it does something I am a little fond of. There is no stored "answer string" to compare against; the only thing that proves an item is solvable is finding an arrangement of tiles that the matcher accepts. So the test brute-forces it with a small recursive permutation solver, joining each permutation with a space exactly the way the UI does, and asserts that at least one of them passes. That is the structural guard against the extra-tile trap. To keep the search honest it caps reorder items at ten tiles, because 10! is already 3.6 million permutations and anything larger is an authoring smell rather than a question. I would never reliably notice a stray distractor while reading a Thai or Hungarian item. The solver notices instantly, because the item simply will not resolve.

Takeaways

Model the variation where it belongs: four question types, but one of them carries a format enum, so the fan-out lives in a single view and the runner never has to care.
Production answers across writing systems need a real normalizer; the interesting bugs were NFC, furigana parentheses, and stray diacritics, not the obvious case-folding.
A UI rule that is right for the user can quietly become a content contract, so it helps to name the assumption when you write the rule.
When there is no stored answer to compare against, the most honest reviewer is a test that solves the question with the same engine the user is graded by.

Adding word reorder and minimal pairs to the test.