Pruning the op log without breaking devices that fell behind

Pinball Points syncs game records through an append-only operation log: every upsert and delete is a row in an ops table, and clients replay rows past a cursor to catch up. It is a clean model right up until you remember that an append-only log grows forever. Pruning it is the easy part. Pruning it without quietly corrupting a phone that has been sitting in a drawer for two months turned out to be the part I had to think hardest about.

Why I sync with an op log at all

The sync server is a small Fastify + TypeScript service in front of PocketBase. Whole-record, last-write-wins sync would have been simpler to write, but I found it painful to reason about. Two devices edit different fields of the same game and one edit vanishes. A delete races an edit and the record comes back from the dead. I wanted edits and deletes to converge predictably across a phone, an iPad, and a server-rendered public profile, so each change is an immutable operation: a stable opId (a client-generated UUID), an entityType of score or profile, an opType of upsert or delete, and a payload. The server assigns a monotonic per-user serverSeq, writes the op row first, then applies it to the materialized scores and profiles collections.

Clients track their position with a sinceToken: a base64-encoded { seq } blob holding the highest serverSeq they have applied. On each sync they send that token plus any new local ops, and the server returns everything after it.

// the client cursor is just an opaque, signed-shaped wrapper around a number
export function decodeSinceToken(token: string | null): number {
  if (!token) return 0;
  try {
    const json = JSON.parse(Buffer.from(token, "base64").toString("utf-8"));
    const seq = Number(json.seq);
    return Number.isFinite(seq) && seq >= 0 ? seq : 0;
  } catch {
    return 0; // garbage token is treated as a fresh client
  }
}

That cursor is the whole contract. As long as every serverSeq the cursor could point at still exists on the server, an incremental pull is correct. The trouble starts the moment I delete history.

The cliff a stale cursor falls off

Because the log only ever grows, a background job prunes ops older than a seven-day retention window, running every six hours. That keeps storage and replay times bounded for active users. But consider a device whose cursor sits at serverSeq 1,040 when the lowest surviving op is now 5,200. The query serverSeq > 1040 happily returns everything from 5,200 onward, which is a syntactically valid range, and the client happily applies it. Nothing throws. The device just never learns about ops 1,041 through 5,199, which may include the delete that retired a record or the upsert that corrected a score.

This is the kind of bug I find unsettling: no crash, no error, no log line. The two stores simply disagree forever, and the only symptom is a player who is sure a game is missing while the server is sure it was deleted. Any sync engine that prunes history has to answer one question honestly: how do I know a client's cursor is still inside the surviving window?

A per-user compaction watermark

The answer I landed on is a single number per user, stored in a tiny compaction_meta collection with a unique index on user: a watermark recorded whenever I prune. It is the boundary below which history no longer exists. Setting it correctly has one edge case worth getting right.

// after deleting old ops, find the lowest seq that survived
const minRemaining = await pb.collection("ops").getList(1, 1, {
  filter: `user = "${userId}"`,
  sort: "+serverSeq",
  fields: "serverSeq",
});

let watermark: number;
if (minRemaining.items.length > 0) {
  // ops remain: watermark is one below the lowest survivor
  watermark = (minRemaining.items[0].serverSeq as number) - 1;
} else {
  // EVERY op was pruned (a long-dormant user): fall back to the
  // max seq captured BEFORE the delete, so the watermark still
  // sits above any cursor a stale client could be holding
  watermark = preCompactMaxSeq > 0 ? preCompactMaxSeq : deleted;
}

The all-pruned branch is the one that bit me first. If every op falls outside the retention window, there is no surviving row to read a sequence from, so I capture the max serverSeq before deleting anything and use that. Without it, a fully compacted user would get a watermark of zero, which means "history is intact" and would route a genuinely stranded device straight back onto the unsafe incremental path.

Checking the cursor on every sync

With the watermark recorded, the sync entry point gets one cheap comparison. If the client's decoded cursor is below the watermark, the history it needs is gone and incremental is unsafe.

const sinceSeq = Math.max(0, Math.floor(decodeSinceToken(sinceToken)));
const watermark = await getCompactionWatermark(pb, userId);

// watermark of 0 means "never compacted", never force a needless snapshot
const needsSnapshot = watermark > 0 && sinceSeq < watermark;

That watermark > 0 guard matters: a brand-new user has no compaction row, so I read a watermark of zero and must treat it as "history is complete" rather than "history is missing." Worth noting too: I process the client's pushed ops in order regardless of snapshot mode. A stranded device may still have new local games to upload, and those should land and get acked even on the trip where I am about to hand it a full rebuild.

Rebuilding a snapshot from live state

When a client falls off the cliff, I do not try to reconstruct the missing ops, because I deleted them on purpose. Instead I rebuild a full snapshot by walking the materialized scores and profiles collections (the current truth) and emitting synthetic operations. Crucially, I fetch all scores including soft-deleted ones, because each row carries a deletedAt tombstone rather than being hard-deleted.

for (const score of scores) {
  if (score.deletedAt) {
    // soft-deleted -> synthetic DELETE op, so the snapshot can express absence
    ops.push({ opId: `snapshot-score-del-${score.scoreId}`,
               entityType: "score", entityId: score.scoreId,
               opType: "delete", payload: null, serverSeq: 0, serverTime: now });
  } else {
    // live -> synthetic UPSERT op carrying only allowlisted fields
    ops.push({ opId: `snapshot-score-${score.scoreId}`,
               entityType: "score", entityId: score.scoreId,
               opType: "upsert", payload: pickAllowed(score),
               serverSeq: 0, serverTime: now });
  }
}

The tombstones are the part a naive snapshot gets wrong. A snapshot of only live records tells the client what exists but never tells it what to remove, so a game the player deleted on another device would quietly survive on the stale one. Emitting an explicit delete op for every soft-deleted row lets the snapshot express absence, not just presence. A few details that make this hang together:

The response carries a snapshot: true flag, so the client knows to replace local state wholesale rather than merge it into what it already has.
The synthetic ops all carry serverSeq: 0 precisely so they are never mistaken for cursor positions. The newToken is always re-derived from the real current max serverSeq, so the device rejoins the incremental path cleanly on its next sync.
Snapshots reuse the exact same ServerOp shape the client already applies for deltas, so there is no separate apply path. A snapshot is just a delta that happens to describe everything.
The upsert payloads run through the same field allowlist (machineName, allScores, playedAt, and so on) used for normal ops, so a rebuild can never leak an internal column the client should not see.

One more guard: serialized per user

Both the sync handler and the compaction job acquire a per-user lock, a promise chain keyed by userId with a 30-second timeout. Compaction deletes rows and rewrites the watermark; a sync read interleaved with that could see a half-pruned log. Serializing per user (not globally) means one dormant account being rebuilt never blocks an active player on a different account.

If deleting history can leave a reader pointing at nothing, record a watermark when you delete, and check it before you trust the reader's position.

What this taught me

The general lesson, for me, is that retention and recovery are two halves of one feature. I do not think you can ship the pruning job without also shipping the answer to what happens to a client the pruning job just stranded, and the honest answer is usually a full resync rather than a clever partial repair. A few things I took away:

An append-only log wants a compaction watermark per stream, not a single global one, so one slow client does not gate pruning for everyone.
Detect the stale-cursor case explicitly. An out-of-range pull looks exactly like a valid empty-ish delta, which is how the silent-disagreement bug sneaks in.
Handle the all-pruned edge case: capture the max sequence before deleting, or a fully compacted user gets a watermark of zero and the safety check inverts.
A snapshot must encode deletions through tombstones, or it will resurrect records the user already removed.
Reuse the op shape for snapshots so the client has one apply path, and always reset the cursor to the real current max so the device rejoins incremental sync.

Pruning the op log without breaking devices that fell behind.