Workstation4 / Blog / data
data parsing typescript content

Importing 49,712 public trivia questions into Quizo.

Curated quiz packs are lovely, and they do run out. A streamer who runs trivia three nights a week will chew through a hand authored set in a couple of weeks. So alongside packs like the US Citizenship civics set, I wanted Quizo to pull from a large public bank and serve close to endless rounds. The catch was that the bank does not arrive as tidy JSON. It is a plain text format that I had to parse carefully before any of it could safely become a question on stream. The dataset I landed on, OpenTriviaQA, ships 49,712 questions across 22 category files, which is more than enough to keep a channel busy for a very long time.

The source format

OpenTriviaQA is a Creative Commons (CC BY-SA 4.0) dataset, and it stores questions as blocks of plain text rather than structured data. Each block opens with a #Q marker for the question line, then a ^ line that holds the text of the correct answer, then lettered options. The dataset's own README is honest about the tradeoff: the format is easy for humans to edit and reasonably easy to parse into something better later. A real block looks like this:

#Q The theory of relativity was introduced in physics by this man.
^ Albert Einstein
A Galileo Galilei
B Albert Einstein
C Archimedes
D Isaac Newton

Two things about that shape are worth noticing right away. First, the correct answer is given as text on the ^ line, not as a letter. You have to match it back against the options to learn that the answer here is B. Second, the file is human edited, which means it is exactly the kind of input that will eventually contain a block missing an option, or a stray blank line, or an answer whose text does not quite match any option because someone fixed a typo in one place and not the other. I decided to treat the whole file as untrusted from the first line.

One internal shape for everything

Before parsing, it helped to pin down the target. Every content source in Quizo, the local JSON packs and the imported bank alike, normalizes into the same Question type, so the quiz engine never has to know where a question came from.

interface Question {
  text: string;
  options: { A: string; B: string; C?: string; D?: string };
  correct: 'A' | 'B' | 'C' | 'D';
}

The optional C and D are the interesting part. The bank mixes classic four option questions with two option true/false style questions, and I wanted both to be first class rather than padding true/false out to four. Modeling C and D as optional in the type meant the rest of the app, including the overlay that renders the answer buttons, could branch on their presence honestly instead of trusting a count I made up.

Parsing defensively

The interesting work was never the happy path. It was everything the format does in unexpected ways. I decided early that one malformed block should never crash an import or, worse, surface a broken question mid round. So the parser splits on the #Q marker, walks each block line by line, and quietly drops anything that does not hold together. The validation comes down to a few rules:

  • Require A and B. A block with fewer than two options is not a question, so it is dropped.
  • Reject ragged blocks. The check is literally if (hasC !== hasD) continue;. A C with no D, or a D with no C, usually means a truncated or corrupted entry, and I would rather lose one question than show half of one.
  • Confirm the answer exists. Because ^ is text, I loop the parsed options and find the letter whose value equals the correct answer string. If nothing matches, the block is skipped. This is the rule that catches the subtle typo mismatches.

Putting that together, the core of the reader is small and boring on purpose:

for (const block of content.split('#Q ').filter(b => b.trim())) {
  const lines = block.trim().split('\n').map(l => l.trim()).filter(l => l);
  if (lines.length < 2) continue;

  const text = lines[0];
  let correctAnswer = '';
  const options: Record<string, string> = {};

  for (let i = 1; i < lines.length; i++) {
    const line = lines[i];
    if (line.startsWith('^')) correctAnswer = line.slice(1).trim();
    else if (/^[A-D]\s/.test(line)) options[line[0]] = line.slice(1).trim();
  }

  if (!options.A || !options.B) continue;       // need A and B
  if (!!options.C !== !!options.D) continue;     // no ragged blocks

  const correct = Object.keys(options)
    .find(letter => options[letter] === correctAnswer);
  if (!correct) continue;                        // answer must exist

  questions.push({ text, options, correct });
}

The /^[A-D]\s/ guard matters more than it looks. Option text frequently contains its own newlines and punctuation, and a naive "first character is a capital letter" rule would happily mistake a wrapped answer line for a new option. Requiring a single A through D followed by whitespace keeps the parser from inventing options out of run on text. The whole philosophy is: validate at the boundary, normalize into one clean shape, and never let a questionable block travel further into the system. By the time a question reaches the state machine, it has already been checked.

Unlimited mode versus caching

Once parsing felt reliable, the next question was how to serve it. I exposed three modes: local JSON packs, individual OpenTriviaQA categories (loaded as packs with an otqa: id prefix), and an unlimited mode that pools every parsed question. Caching is where these intentionally diverge.

Category packs and local packs are stable and finite, so I cache them in a Map after the first parse. There is no reason to re-read and re-validate the civics set, or to re-walk a 39,000 line category file, on every session when nothing has changed. That keeps repeated session startup fast.

Unlimited mode is different on purpose. It loads the full pool once but reshuffles it fresh for every session rather than caching a fixed order. The shuffle is a plain Fisher-Yates pass, and the load function carries an explicit guard:

// Don't cache unlimited mode - we want a fresh shuffle each time
if (id !== 'unlimited' && quizPackCache.has(id)) {
  return quizPackCache.get(id)!;
}
Cache the things that should feel the same. Reshuffle the things whose entire value is feeling different.

There is a second, subtler reshuffle inside the engine. Finite packs end at a finished screen when the index runs past the last question. Unlimited mode instead reshuffles the deck and cycles back, setting currentQuestionIndex = -1 so the next advance lands cleanly on zero again. The session never "ends," and reshuffling at the wrap point means a streamer does not replay the same opening run of questions every cycle. The overlay even reports totalQuestions: -1 in unlimited mode so the UI can hide the "of 50" counter and just show a running question number. A small sentinel value, but it let me reuse the exact same state machine for finite and infinite play.

Why the distractors matter

A trivia round is only as good as its wrong answers. A question with three obviously absurd options is not a test, it is a formality. This is where the two content sources play different roles. The public bank brings the raw volume that makes unlimited mode possible; the hand authored packs, like the civics set with well over a hundred questions, are where I try to hit a higher standard, writing plausible distractors so a player has to actually know the answer rather than win by elimination. Both feed the same engine through the same Question shape, and the parser is the seam that lets them coexist without the rest of the app caring which is which.

Takeaways

  • Treat external content as untrusted input. Validate at the boundary, normalize into one internal shape, and drop the bad rows quietly.
  • Watch the loose edges of a human friendly format. A text marked answer, wrapped option lines, and ragged blocks were each a small gotcha worth an explicit rule.
  • Cache stable, finite content; reshuffle content whose whole point is variety. The same load path can serve both with one guard.
  • A sentinel like totalQuestions: -1 let one state machine handle both finite packs and an endless mode without forking the logic.
  • Volume and quality are different problems. A 49,712 question bank solves the first; hand authored packs set the standard for the second.
W4

Workstation4

A quiet workshop for cool, strange, useful iOS apps. Run by one developer who chases the weird problems for sport.

About the workshop →