Grading typed answers without being a pedant

Some placement items in Language Level Check ask you to type or assemble the answer instead of tapping a choice. These are the production questions: the engine calls them constrainedProd, and each one carries a productionSpec with an acceptList of valid strings. The first time I shipped them I used plain string equality against that list, and almost immediately a perfectly correct answer with a trailing period got marked wrong. Exact equality is a pedant. A human grader is more forgiving, and I wanted the matcher to agree with the human.

The bug that actually started it

The tidy story would be that I sat down and designed a forgiving matcher. The real story is that a word-reorder question exposed a gap I had not thought about. In the word-reorder UI, the learner taps tiles to build a sentence, and the UI joins those tiles with spaces. That is fine for English. It is broken for Japanese, Chinese, and Cantonese, where the canonical accept-list strings are written with no spaces at all.

So the learner would assemble exactly the right sentence, the UI would hand the grader a space-joined string, and the space-free accept entry would never match it. Every word-reorder item in a no-space script was unwinnable. That is the kind of bug that does not show up in your own English testing and quietly punishes a whole set of languages.

Fixing it as a one-off space strip would have worked for that one case and left a dozen similar traps. Instead I stopped treating answer matching as a comparison and started treating it as a small, ordered, explainable pipeline.

Tier one: normalize, then compare exactly

The first tier does the boring, reliable work. It runs both the user's text and each accepted answer through the same normalizer and then checks for exact equality. The steps run in a fixed order, and each one erases a difference a fair grader would ignore:

Trim surrounding whitespace.
Apply Unicode NFC normalization, so precomposed é (U+00E9) and decomposed e plus a combining accent compare as equal.
Strip trailing sentence punctuation, not only . ! ? but the script-specific marks too: 。！？, the Khmer ។, the Devanagari danda ।, the Arabic full stop ۔.
Drop commas in their Latin, CJK (、), and full-width (，) forms.
Remove Arabic diacritics (tashkeel: fathah, dammah, kasrah, shadda, sukun, the tanwin marks), because tiles may carry them while the accept list does not, or the reverse.
Strip Japanese furigana hints written in full-width parentheses, so 食（た）べます matches 食べます.
Remove all spaces, which both fixes the word-reorder bug and is safe because production answers are short enough that word identity survives without the spaces.
Lowercase the result.

That ordering matters. NFC has to run before diacritic stripping so the combining marks are in a predictable form, and furigana stripping is a regex over full-width parens, so it has to see the text before spaces are removed. Conceptually the Swift side reads like this:

private func normalizeForMatching(_ text: String) -> String {
    var s = text.trimmingCharacters(in: .whitespacesAndNewlines)
    s = s.precomposedStringWithCanonicalMapping            // NFC
    while let last = s.unicodeScalars.last,
          trailingPunctuation.contains(last) {              // . ! ? 。 ！ ？ ។ । ۔
        s = String(s.dropLast())
    }
    s = s.replacingOccurrences(of: ",", with: "")
         .replacingOccurrences(of: "、", with: "")
         .replacingOccurrences(of: "，", with: "")
    s = s.unicodeScalars
         .filter { !arabicDiacritics.contains($0) }         // strip tashkeel
         .map(String.init).joined()
    s = s.replacingOccurrences(of: "（[^）]*）", with: "",   // furigana hint
                               options: .regularExpression)
    s = s.replacingOccurrences(of: " ", with: "")           // CJK + reorder fix
    return s.lowercased()
}

Most correct answers never leave tier one. It is fast, it has no model behind it, and you can read the code and know exactly why any answer passed or failed.

Tier two: an on-device embedding fallback

Only when tier one fails do I reach for the second tier, and only on iOS. Here I use sentence similarity from Apple's NaturalLanguage framework. I look up NLEmbedding.sentenceEmbedding(for:) for the test's target language, compute the distance(between:and:) from the user's answer to each accept entry, and accept the answer when the smallest distance falls under 0.15, which is roughly a cosine similarity above 0.85.

That threshold is deliberately tight. Tier two exists to rescue a near miss the normalizer could not catch, not to invent generosity. Loosen it and the matcher starts accepting answers that merely sound related, which in a placement test means inflating someone's score. The conservative bound is the whole point.

The framework does not ship an embedding model for every language the app supports, and that is fine. When no model exists, NLEmbedding.sentenceEmbedding(for:) returns nil, my lookup returns nil cleanly, and the verdict falls back to tier one's result rather than guessing. No model means no tier two, and tier one still stands on its own. Everything runs on device, so there is no network call in the grading path and nothing about the answer leaves the phone.

The hard constraint: two platforms must agree

Language Level Check runs on iOS, Android, and the web, and a placement result has to mean the same thing everywhere. So I held myself to a rule: tier one must be step-for-step identical with the Android implementation. The Kotlin AnswerMatcher mirrors the Swift normalizer exactly, down to using java.text.Normalizer.Form.NFC and the same punctuation and diacritic sets.

Android deliberately omits the embedding tier. I left a comment in the Kotlin file explaining why, because the reasoning is the load-bearing part:

iOS additionally has an NLEmbedding fuzzy-match fallback. That is deliberately NOT replicated here. Fuzzy matching can accept near-miss wrong answers, and the bug being fixed is purely a normalization gap. Android performs the deterministic match only.

The nice property this buys me is that the platforms cannot disagree on a clearly correct answer. Any normalized-exact match passes on both. The embedding tier on iOS can only ever add acceptances for genuine near misses; it can never reject something tier one already accepted. So the place where it matters most, an obviously right answer, is right no matter which app you opened. The only divergence possible is iOS being slightly more forgiving on a borderline case, and even that is bounded by a tight threshold.

The edge cases that earned their place

Every rule in the normalizer is there because something broke without it, not because it seemed thorough:

The space strip came straight from the word-reorder bug in no-space scripts.
NFC came from accent-bearing languages where the same character can arrive composed or decomposed depending on the keyboard.
Tashkeel stripping came from Arabic, where short-vowel diacritics are optional in everyday writing and an accept list rarely spells them out.
Furigana stripping came from Japanese reading hints leaking into typed answers.
The script-specific terminal punctuation set came from realizing a Devanagari danda or an Arabic full stop is exactly as ignorable as a Latin period.

What I took from it

Treat answer matching as an ordered pipeline, not a single comparison, and be able to name a reason for each step.
Do the deterministic normalization first; let a model help only with what is left over.
Keep the model tier conservative. A tight similarity threshold rescues near misses without rewarding guesses, and in a scored test that restraint protects the learner.
Degrade cleanly. When no embedding model exists for a language, fall back to the rules instead of fabricating an answer.
Anchor cross-platform agreement at the layer every platform shares, and write down why the platforms differ where they do.
Test in the language that will break you. The bug that started all of this was invisible in English.

The result is a matcher that grades roughly the way a fair human would, while staying deterministic and explainable enough that I can justify any verdict it produces, in any of the 46 languages.

Grading typed answers without being a pedant.

The bug that actually started it

Tier one: normalize, then compare exactly

Tier two: an on-device embedding fallback

The hard constraint: two platforms must agree

The edge cases that earned their place

What I took from it

Workstation4

Two analytics writes became one atomic write, and the orphans went away.

Tiles: study, recall, feedback, and a scheduler that picks the board.