Reading pinball scores from photos with multi-frame consensus

A pinball score display is one of the harder things to point a camera at. Dot-matrix glyphs flicker at their own refresh rate, the glass throws glare, players sit in odd two-by-two layouts, and the digits you most need to read are often the ones halfway through ticking up. I decided fairly early that no single photo felt trustworthy enough to build a permanent game record on, so I built the capture around a short burst of frames and let several cheap reads vote on the result before I trust any of them.

Why one photo is not enough

The failure modes of a single shot are stubborn, and they rarely overlap. One frame catches a glare band straight across the millions digit. The next frame is sharp on that digit but caught the display mid-refresh, so a segment is dark and a 6 reads like a 5. A third is fine everywhere except the camera was tilted and the model guessed the wrong player column. Any one of these produces a plausible but wrong number, and a wrong score that looks right is more troublesome than an obvious failure, because nobody notices it until the season stats are already skewed.

The thing that helped was noticing that these errors are usually independent across frames. A digit that is ambiguous in one photo is often crisp in the next one a few hundred milliseconds later. So instead of asking the model to be perfect on one image, I capture a handful and let agreement carry the weight.

Capturing the burst

The camera keeps a small ring buffer of recent frames. When you tap to capture, I freeze the most recent five frames out of that buffer and immediately stop the camera session, so the green privacy light goes off while analysis runs. There is a hard floor here: if fewer than two valid frames survive, the session bails out and asks for a retake rather than guessing from a single image.

Each frame goes to a vision model independently and concurrently. I use a small, cheap model first (gpt-5-nano), driven with a Structured Outputs JSON schema so I never have to parse free-form prose. The schema is deliberately tiny and strict:

struct ScoreResult: Codable, Sendable {
    let machineName: String?
    let playerCount: Int
    let scores: [Int]      // P1 first, P2 second, ...
    let confidence: Double
    let notes: String?
}

Before fusing anything, every per-frame read is validated on its own. A shot is thrown out if playerCount does not equal scores.count, if the player count falls outside one to four, or if any score is zero or negative. Catching a malformed read here, rather than during the vote, keeps obviously broken frames from polluting the consensus.

Grouping by sorted scores, then voting

Once I have the valid per-frame reads, I group them and let the largest group win. The interesting part is the group key. I deliberately key on the sorted scores, not the raw array, so that two frames which found the same numbers but disagreed about player order still land in the same bucket:

// Order-insensitive key: same numbers, different order, same group.
func groupKey(playerCount: Int, scores: [Int]) -> String {
    let sorted = scores.sorted().map(String.init).joined(separator: ",")
    return "\(playerCount):[\(sorted)]"
}

var groups: [String: [Shot]] = [:]
for shot in validShots {
    let sr = shot.result
    groups[groupKey(playerCount: sr.playerCount, scores: sr.scores), default: []]
        .append(shot)
}

I declare consensus when the largest group has at least three agreeing frames. Three is enough that a lone bad frame, or even two correlated bad frames, cannot outvote the truth. When that bar is met, the read is marked verified and the result carries an agreementCount of however many frames backed it.

Player order is its own trap

Because the group key throws away order on purpose, I still have to recover it. The tempting shortcut is to rank players by score, but that is exactly the wrong instinct: scores change constantly, ties happen, and a comeback flips the ranking mid-ball. So the model is told to resolve player identity by display position: P1 is top-left, P2 is top-right, P3 bottom-left, P4 bottom-right on a two-by-two dot-matrix display. Every prompt says it plainly, including the line "Do not sort scores by value."

Within the winning group, I pick the most common ordering. If the frames do not unanimously agree on order, I do not just take the majority and hope; I spend one more small call that hands the model the already-confirmed set of numbers and asks only the easier question, which number sits in which position:

let orderingIsUnanimous = orderCounts.count == 1
if !orderingIsUnanimous {
    let tie = try await openAI.resolveOrdering(
        imageData: bestShot.imageData,
        scores: representative.scores
    )
    // Only accept the re-order if it found the SAME numbers.
    if tie.scores.sorted() == representative.scores.sorted() {
        representative = tie
    }
}

That guard matters. The tie-breaker is allowed to rearrange the digits I already trust, but it is not allowed to change them. If it comes back with a different set of numbers, I discard it and fall back to the majority ordering. Splitting "which numbers" from "which position" into two questions made each one much easier for a small model to get right.

Escalating only when the vote fails

If no group reaches three, the cheap frames could not agree, and I treat that as a signal rather than an error. I pick the best remaining frame, preferring ones that passed the quality gate and then breaking ties on the model's own confidence, and send just that one frame to a stronger model (gpt-5-mini). Then comes the part I like: I check whether the stronger model's answer falls into a group the cheap frames already produced.

If mini's read matches an existing nano group, I label the result mini_match, mark it verified, and bump the agreement count to include mini's vote.
If mini lands somewhere none of the cheap frames did, I label it mini_no_match and do not mark it verified. The number still shows, but without the trusted seal.

The cost structure is what makes this worth doing. Nano runs at a fraction of mini's per-token rate, so analyzing five nano frames is cheaper than one mini call, and most captures never escalate at all. The pipeline spends the expensive model only on the genuinely hard displays, which is the small minority.

Keeping confidence inspectable

A fusion system that just emits a number is still a black box; you cannot tell whether four frames agreed or whether it was a coin flip. So every fused result carries its own provenance: a fusionMethod label describing the path it took, plus the agreementCount and a compact groupsSummary string showing exactly how the votes split.

// groupsSummary example: "2:[1338620,579450]:3,2:[1338620,579451]:1"
//   two players, this score set, three frames agreed
//   ...one stray frame read the last digit as a 1
{
  "scores": [1338620, 579450],
  "fusionMethod": "consensus",
  "agreementCount": 3,
  "verified": true
}

The values fusionMethod can take, consensus, mini_match, mini_no_match, rescue, tell me at a glance how a read was reached. When something comes out wrong, this is the first thing I look at, and the summary usually explains itself: one stray frame caught a digit mid-flicker and lost the vote three to one. Determinism helps here too. Because the group key is built from sorted scores, the same five frames always collapse to the same groups regardless of which finished first, so a bad read can be reproduced and debugged rather than blamed on luck.

If a system makes a decision you cannot interrogate later, you have not really finished the feature. You have left yourself a future bug report you cannot answer.

I have since experimented with handing all five frames to one model in a single multi-image call and letting it cross-reference internally, which is simpler and sometimes sharper. But the per-frame voting version is the one that taught me the most, because every step of its reasoning is visible.

What I took away

When one observation is unreliable, redundancy beats precision: several mediocre frames, fused, outperform a single excellent guess.
Errors across frames tend to be independent, so a majority vote cancels them instead of compounding them. Three agreeing frames is a strong bar.
Resolve identity by something stable, like display position, never by something that changes, like score.
Split a hard question into easy ones: "which numbers" and "which position" are far easier separately, and you can guard the second so it never corrupts the first.
Spend compute where it pays. Vote cheaply first; escalate to the expensive model only when the cheap votes disagree, and use the cheap votes to check the expensive answer.
Attach provenance to every result. A method label, an agreement count, and a group summary turn confidence from a feeling into something you can read.

Reading pinball scores from photos with multi-frame consensus.