Workstation4 / Blog / Swift
Swift SwiftUI UX Algorithms

Letting people trade speed for confidence with one honest toggle.

A pairwise ranking quiz has a quiet tension baked into it: more comparisons mean tighter confidence, but more comparisons also mean more taps. Some people want a fast verdict and others want a rigorous one, and forcing both into the same budget tends to leave one of them unhappy. I tried to resolve it with a single toggle on the quiz intro screen, and I tried to make the trade explicit instead of hiding it. Along the way I learned that my first instinct, just halving the work, was actually the wrong call.

How the ranking actually works

Popular Rank never asks you to rank a whole list at once. It turns the list into a sequence of A-vs-B picks and runs an Elo-style rating engine over your answers. Every item starts at a rating of 1500, and each comparison nudges the winner up and the loser down with a K-factor of 32, the same constant chess rating systems use. The order of the items just falls out of their ratings.

The interesting part is which pair the engine asks about next. It uses adaptive uncertainty sampling, scoring each candidate pair so that the most informative comparison wins. The intuition comes from information theory: a 50/50 matchup carries close to a full bit because the outcome is genuinely uncertain, while a 90/10 matchup carries almost none. So the engine prefers pairs whose ratings are close, softly penalizes pairs it has already shown, and adds a coverage boost so no item gets forgotten. Repeats are not a bug; close items genuinely need more samples to separate.

// Per-pair scoring: closer ratings are more informative,
// repeats are softly penalized, neglected items get a boost.
let informativeness = max(0, maxInformativenessScore - ratingDiff)
let repetitionPenalty = Double(timesCompared) * repetitionPenaltyPerComparison
let score = informativeness - repetitionPenalty + coverageBoost

Confidence is reported separately, by bootstrap resampling the comparison history: I rerun the ranking many times over resampled answers and report how often the same item lands at #1 (pTop1) and the average chance the leader beats the runner-up (pTop1BeatsTop2). Those numbers tighten as you do more comparisons, and they are exactly what the speed toggle trades against.

The tension between taps and confidence

The catch is that confidence is bought with comparisons. Run too few and the order is close to a coin flip; run too many and a one-minute quiz turns into a chore. The engine already had two ways to stop: a hard cap on total comparisons, and an early stop when the ranking looks stable. Stability means the #1 item has held the top spot across a window of the last 8 comparisons and leads #2 by more than a 100-point Elo margin. The hard cap is the backstop that scales with list size:

  • 3 to 5 items: 20 comparisons
  • 6 to 12 items: 40 comparisons
  • 13 to 30 items: 90 comparisons
  • more than 30 items: 120 comparisons

There is no single right number here, because there is no single right user. A person settling a friendly argument wants the answer now. A person ranking their all-time favorites wants the model to be sure. Rather than guess at one budget that splits the difference and leaves everyone a little disappointed, I let the user pick the regime that matches their intent.

One toggle, two budgets

The intro screen now offers a segmented Quick versus Thorough picker. Thorough keeps the full hard cap. Quick lowers it, so a ranking settles in fewer taps. My first version simply halved the cap, and that was a mistake I caught the same day: for a 12-item list, halving 40 down to 20 comparisons left the confidence intervals too loose to trust. The fix was to be less aggressive, and to special-case the small lists that are already fast.

guard isQuickMode else { return baseCap }

// Quick mode: full budget for small lists (already fast),
// 75% for larger lists where the savings actually matter.
if itemCount <= 5 { return baseCap }
return (baseCap * 3) / 4

So Quick mode is 100% of the cap for 3 to 5 items and 75% for everything larger. That second pass mattered more than the first: "make it faster" is easy, but "make it faster without making the answer worse" is the part that takes a second look at real numbers.

The design rule I kept coming back to was that neither mode should feel like a downgrade. Quick is not broken or sloppy; it is a smaller sample that still produces a real, confidence-rated order, just with wider intervals. Thorough is not punishingly long; it is the full sample for someone who cares about the margins. Framing them as a deliberate trade rather than a quality slider kept both choices feeling like genuine options.

Showing the trade up front

A toggle that silently changes behavior felt worse than no toggle, because the user cannot reason about it. So each mode shows its own estimated comparison count and time before the quiz begins. The intro view computes both budgets directly from the same calculateHardCap function the engine uses, so the preview can never drift from reality. The ETA leans on a measured baseline of about 3.5 seconds per comparison, which is roughly where real taps land.

The trade is right there on the screen: this many comparisons, roughly this long, in exchange for this level of rigor. Giving people the numbers is what makes it an honest decision instead of a guess.

This is the part I care about most. The toggle is not really a performance feature; it is a transparency feature. The work was less about lowering a number and more about surfacing the consequence of that number, so the choice actually means something.

Fighting the sense of endlessness

Even a well-budgeted pairwise quiz can feel like it has no horizon, because each individual pick looks identical to the last. Without a sense of momentum, it is easy to start wondering whether the thing will ever end. I added milestone banners at 50 percent and 80 percent of progress, each paired with an SF Symbol, a checkered flag for halfway and a check circle for almost done, and a haptic.

One small detail I got wrong at first: I reached for a light UIImpactFeedbackGenerator, then switched to UINotificationFeedbackGenerator with a .success type. The notification haptic reads as an event, an actual milestone, rather than just another button press, which is exactly the feeling I wanted. The banner auto-dismisses after 1.5 seconds and each milestone fires only once, tracked in a small Set<Int> so a stable ranking that lingers near a threshold cannot retrigger it. A visual banner can be missed mid-tap, but a physical pulse at the halfway mark lands as genuine progress without breaking the rhythm of choosing. It tells the hand, not just the eye, that you are getting somewhere.

A quieter change: fewer items by default

One supporting tweak did not touch the quiz UI at all. I cut the default length of an AI-generated list from 20 items down to 15. Fewer items means the engine reaches a stable order in fewer pairs, so the same Quick or Thorough budget buys proportionally more certainty per item. It is a small lever with a surprisingly large effect, because the number of distinguishable pairs grows quadratically with the item count: shrinking the problem is often cheaper than tuning the solver. A related fix loosened generation so a topic with only a handful of real members, say a category where just 9 things exist, lists all of them instead of failing when it cannot hit an exact count.

Taken together, these changes share one idea. Respect the user's time, but never take away their agency over the trade. Tell them what each path costs, give them the momentum to finish, and keep both the fast answer and the rigorous one feeling like real, finished rankings.

What I learned

  • When a feature has a built-in trade, expose the trade instead of picking for the user. A toggle with visible time and comparison estimates, derived from the same function the engine uses, turns a hidden assumption into an informed and trustworthy choice.
  • "Faster" is the easy half. Halving the comparison cap felt clever until the confidence intervals went slack; 75 percent for larger lists and 100 percent for already-quick small lists was the honest answer.
  • The right haptic carries meaning. Switching from an impact tap to a .success notification haptic made a milestone feel like an event rather than another tap.
  • Shrinking the input can beat tuning the model. Cutting the default list from 20 to 15 items bought more confidence per tap than any change to the sampler, because pair count grows quadratically with item count.
W4

Workstation4

A quiet workshop for cool, strange, useful iOS apps. Run by one developer who chases the weird problems for sport.

About the workshop →