Why I rank with Bradley-Terry instead of a running tally

The most obvious way to rank a list from A-vs-B picks is to count wins. Tally up every choice, sort by total, done. I tried that first, and it started to wobble the moment I asked a slightly harder question: is the item on top actually the best, or did it just get lucky with easy matchups? A running tally cannot really answer that, because it forgets who you beat. So I replaced it with a Bradley-Terry model, the same family of math used to rate chess players and sports teams, and the rest of this post is what I learned wiring it into an iOS app that has to stay fast and honest on a phone.

What a running tally throws away

Imagine ranking ten coffee shops. One shop wins five comparisons, another wins five comparisons. A tally calls that a tie. But suppose the first shop beat five mediocre contenders and the second beat the five strongest. Those two records are not really alike, and most of us looking at the matchups would notice the difference. A count of wins cannot see it, because a win against a weak opponent and a win against a strong one are worth exactly the same point.

There is a second, quieter problem. A tally gives you an order but no sense of how much to trust it. Five wins to four looks decisive on paper and is often close to noise, especially early in a quiz when most items have barely been compared. I wanted the ranking to feel earned, which meant the model needed to account for opponent strength and then say how solid the result really is.

The Bradley-Terry idea

Bradley-Terry assigns each item a single hidden number, a utility score. The probability that item i beats item j is the sigmoid of the difference between their scores:

P(i beats j) = 1 / (1 + exp(-(u_i - u_j)))

Two items with equal utilities sit at a 50/50 coin flip. A large gap pushes the probability toward one. The final ranking is just those utilities sorted, but because they are fit jointly against every comparison at once, beating a strong item moves your score more than beating a weak one. Opponent quality is baked in rather than ignored, which I found a really pleasing property. One subtlety: the utilities are only identifiable up to a constant offset, so I normalize them to mean zero after every fit. That keeps the numbers stable from one run to the next, which matters because I refit constantly.

Fitting it on a phone, on sparse data

I fit the scores by gradient ascent on the log-likelihood of all the picks made so far. The gradient has a famously clean form: for each comparison, the update to an item is simply actual - predicted, where actual is 1 if it won, 0 if it lost, and 0.5 for a tie (a tie nudges both items equally, the standard Bradley-Terry treatment). The inner loop is small enough to read whole:

for event in validComparisons {
    let pAWins = sigmoid(u[event.itemAId]! - u[event.itemBId]!)
    let actualA: Double = switch event.choice {
        case .itemA: 1.0
        case .itemB: 0.0
        case .tie:   0.5
        case .skipped: continue
    }
    let grad = actualA - pAWins        // residual
    gradients[event.itemAId]! += grad
    gradients[event.itemBId]! -= grad  // zero-sum per comparison
}

The tricky part is that real data is sparse and noisy. A person ranking a list might make only a few dozen comparisons, sometimes contradicting themselves on adjacent taps. Left unconstrained, the optimizer will happily shove scores toward plus or minus infinity to perfectly explain a handful of lopsided results, which is overfitting in its purest form. Two guardrails keep it honest:

L2 regularization (lambda 0.01). Each step subtracts lambda * u_i from the gradient, gently pulling scores back toward zero so the model stays calm when it has seen very little. It is the difference between a measured answer and one the model made up.
A 200-iteration cap with an epsilon (1e-6) convergence check. The loop tracks the largest parameter change per pass and stops the moment it drops below epsilon. With a learning rate of 0.5 it usually settles in a couple dozen iterations, and the hard cap guarantees the fit never blocks the UI even on pathological input.

There is also a clean special case for two-item lists, where gradient ascent is overkill. For N=2 I just count wins with Laplace smoothing (add one pseudo-observation to each side) so that five straight picks read as roughly 86 percent rather than a brittle 100 percent. Smoothing is the small, unglamorous fix that keeps tiny samples from sounding more certain than they are.

Elo to choose pairs, Bradley-Terry to rank

One thing that surprised me is that I ended up using two different rating ideas for two different jobs. During the quiz I run a lightweight Elo update for pair selection, not for the final answer. Each candidate pair gets scored by how close the two items currently are (uncertain matchups are the most informative), with a soft penalty for pairs I have already shown and a boost for items that have not appeared recently. That balances exploration against exploitation and makes sure no item gets ignored. Bradley-Terry then does the heavy lifting once, on the full comparison history, to produce the ranking and confidence the user actually sees. Elo picks good questions; Bradley-Terry gives the careful answer.

Answering the question users actually ask

Knowing the order is only half the job. The question people actually care about is simpler: is my number one a real winner, or basically a coin flip with the runner-up? A point estimate cannot say. So I quantify the uncertainty with a bootstrap. It resamples the comparisons with replacement, refits the model on each resampled set, and records how often the same item lands on top, plus the average modeled probability that #1 beats #2. If the same item wins 190 of 200 runs, that is a strong result; if it wins 110 of 200, the top spot is genuinely contested and the honest answer is "too close to call." That fraction maps to a label: 0.85 and up is High, 0.65 and up is Medium, anything lower is Low.

The catch is cost. The bootstrap refits the whole model hundreds of times, and I wanted confidence to appear without a visible stall. Two things made that workable. First, the sample count adapts to list size, since each refit gets more expensive as items grow:

func adaptiveBootstrapSamples(itemCount: Int) -> Int {
    switch itemCount {
    case 0...5:   return 200
    case 6...12:  return 150
    case 13...25: return 100
    default:      return 70
    }
}

Second, the whole thing runs in an async task with a roughly 1500 ms time budget. It checks the clock every ten samples, yields cooperatively so it never freezes the UI, and stops early if it runs long, reporting how many samples it actually used. Confidence computed from 90 honest resamples is far better than a perfect 200 that arrives after the user has put the phone down.

The bootstrap is what turns a sorted list into one you can trust a bit more. It lets the app admit when it does not yet know, which is exactly the moment a tally would have answered with false confidence.

Storing the result so it survives

The comparisons themselves are SwiftData records, and I deliberately store each one by item UUID rather than as an object relationship. That way the history survives even if the underlying item is later renamed or deleted, and it leaves room to change the ranking algorithm later without migrating old data. When a session finishes I cache the outcome on the session: the fitted utilities (encoded as Data), the final ordered IDs, and the confidence numbers (pTop1, the #1-beats-#2 probability, and how many bootstrap samples ran). Reopening an old ranking is then a read, not a recomputation.

Keeping the math honest

Statistical code fails quietly, which scared me, so the model ships with a small DEBUG validation suite that runs known cases and asserts the answers: five straight picks of A must give P(A beats B) above 0.9; a perfect 5-5 split must land near 0.5; all-ties must stay near 0.5; and no comparisons at all must return neutral zeros without crashing. None of this ships to users, but it catches the embarrassing class of bug where a sign flip or a normalization mistake produces a plausible-looking ranking that is silently wrong.

What I took away

A win count is lossy: it discards opponent strength, the very thing that makes a ranking meaningful. A jointly fit model recovers that for free.
On sparse, noisy human data, regularization and smoothing matter a lot; they are what separate a measured estimate from an overfit one.
Different rating tools fit different jobs. Elo was great for choosing the next informative pair; Bradley-Terry was the right tool for the final, careful answer.
Reporting uncertainty matters as much as reporting the order, and a time-budgeted bootstrap can do it on-device without ever stalling the UI.
Statistics deserve tests with known answers. Cheap assertions on hand-checked cases are the easiest way to keep silently-wrong math from shipping.

Why I rank with Bradley-Terry instead of a running tally.