Two analytics writes became one atomic write, and the orphans went away

A finished placement test produces two things at once: one session row that summarizes the run, and a handful of per-question answer rows that belong to it. For a while I wrote those as two separate network calls, and that small seam turned out to be a quiet, steady source of corrupt data. By the time I looked closely, roughly half of my iOS sessions were arriving with a score but no answers behind it. Here is how I closed the gap with a single atomic write, and the Postgres detail that caught me off guard on the way.

The seam between two writes

When someone completes a test, the client builds a test_sessions row describing the run and a batch of test_answers rows that reference it by session_id. My original design did this as two POSTs against the Supabase REST endpoint: insert the session, then insert the answers pointing at the same UUID. On a clean network this is fine. The trouble is everything that is not a clean network, and on iOS specifically, the operating system itself.

The logging happens after a test finishes, which is exactly when the user is most likely to background the app. iOS aggressively terminates background Task work, and it was cancelling my submission partway through. The session POST had already succeeded; the answers POST never landed. The result was a row with completed = true and total_correct > 0 but no per-answer data attached to it. An orphan. Something I could count but never actually analyze.

The unsettling part is that nothing in the app was wrong. Each write succeeded entirely on its own terms. The corruption lived purely in the gap between them, which is the kind of bug that never reproduces at your desk and only shows up in the aggregate weeks later. When I finally queried for it, the scale was the surprise:

About half of completed iOS sessions had a score but zero answer rows. The data looked real, which is worse than missing, because it quietly inflated every count while degrading the thing I actually wanted to learn from.

One transaction, all or nothing

I moved the primary write path to a single Supabase RPC, log_test_session, that takes the session and all of its answers and inserts them inside one Postgres transaction. The client hands over the complete result in one call, and the server either commits the whole thing or commits none of it. There is no longer a moment where a session can exist without its answers, because the database refuses to leave itself in that state.

The client side collapses to a single body with two keys: a session object and an array of answer objects.

// Before: two dependent round trips, a corruptible gap between them
let session = try await insertSession(...)        // POST 1 succeeds
try await insertAnswers(for: session.id, ...)     // POST 2 cancelled in background

// After: one call, the server wraps both inserts in a transaction
let ok = await postRPC(
    name: "log_test_session",
    body: [
        "session_data": sessionRow,   // [String: Any]
        "answers_data": answerRows    // [[String: Any]]
    ]
)
// fully recorded, or not recorded at all

The function itself is small plpgsql with SECURITY INVOKER and a pinned search_path, granted to the anon role so the publishable client key can call it. The interesting work is not the two INSERT statements. It is how it ingests loosely typed JSON from three different clients without forcing each of them to send a perfectly shaped row.

The JSONB gotcha that bit me

I wanted the clients to send only the fields they know about, and let Postgres fill in the rest from column defaults: id from gen_random_uuid(), created_at from now(). So I reached for jsonb_populate_record, which maps a JSON object onto a table's row type. It looked perfect. It was not, and the reason is subtle.

jsonb_populate_record sets any column whose key is missing from the JSON to NULL. It does not fall back to the column's DEFAULT. So a NOT NULL column like created_at, which I was relying on the database to fill, would instead get an explicit NULL and blow up the constraint. The fix is to detect the missing key and inject the value before populating the record:

-- Missing keys become NULL, NOT the column DEFAULT. So inject explicitly.
IF NOT (session_data ? 'created_at') THEN
    session_data := session_data || jsonb_build_object('created_at', now());
END IF;

INSERT INTO public.test_sessions
SELECT * FROM jsonb_populate_record(NULL::public.test_sessions, session_data);

The answers get the same treatment, one level deeper. Each element of the array is missing both id and created_at, so I walk the array with jsonb_array_elements, concatenate in whichever defaults are absent, and then insert the whole set with jsonb_populate_recordset. The ? existence operator and the || merge operator do almost all of the work; the lesson is just that "missing key" and "use the default" are not the same thing in this corner of Postgres.

Keeping the old path on purpose

I did not delete the two-POST path. The RPC has to be deployed to the Supabase project by hand through the SQL editor, and a new app build will reach servers before, during, and after that deploy. If I removed the fallback, an environment without the function would start failing every submission with a 404. So the client treats any non-2xx from the RPC, including "function not deployed," as a signal to drop back to the legacy direct inserts.

private func logAtomically(sessionRow: [String: Any],
                           answerRows: [[String: Any]]) async {
    if await postRPC(name: "log_test_session",
                     body: ["session_data": sessionRow,
                            "answers_data": answerRows]) {
        return
    }
    // Fallback: legacy two-POST path. Orphan-prone, but no worse than before.
    if await post(table: "test_sessions", rows: [sessionRow], attempt: 0),
       !answerRows.isEmpty {
        await post(table: "test_answers", rows: answerRows, attempt: 0)
    }
}

The retry policy matters here too. An HTTP error response is deterministic: a 404 will still be a 404 on the next try, so the RPC call does not retry those, it just returns false and lets the caller fall back. A genuine network failure is worth a couple of retries with a short backoff. Distinguishing "the server said no" from "I could not reach the server" kept me from hammering a missing function or, worse, retrying my way into duplicate rows.

This is a pattern I keep reaching for whenever a client and a server have to change together: ship the client so it prefers the new contract but degrades to the old one, then make the new contract the documented requirement. The fallback is insurance for the rollout window, not an excuse to skip the deploy.

Making the transport refuse to hang

Atomicity fixes what happens when a write lands. It does nothing for a write that never resolves. Earlier runs had analytics requests hang for three or more minutes against the Supabase host, most likely QUIC or HTTP/3 negotiation stalling on a connection that URLSession.shared had pooled and reused after it went stale. A logging Task that outlives the useful life of the app is just another way to never record the answers.

So the submit runs on a dedicated ephemeral session with bounded timeouts, and connectivity-waiting turned off so a stuck request fails fast instead of parking:

let config = URLSessionConfiguration.ephemeral
config.timeoutIntervalForRequest = 15
config.timeoutIntervalForResource = 30
config.waitsForConnectivity = false

The identity attached to each result is a UUID kept in the Keychain under a generic-password item with kSecAttrAccessibleAfterFirstUnlock. I like that this does two things at once. It survives reinstalls, so analytics stay continuous across a delete and re-download, and it carries nothing personal: no advertising identifier, no account, no device fingerprint. It is a stable random handle and nothing more.

Ephemeral URLSession, so a dead pooled connection is never silently reused.
Fifteen second request timeout, so a stuck submit fails fast instead of hanging for minutes.
Keychain UUID with after-first-unlock access, so identity is durable, available in the background, and still private.

A truncation bug I found nearby

While building the answer rows I noticed an unrelated data loss that the atomic write would have happily preserved: I was clamping each submitted answer to 20 characters. That was fine for multiple choice, but word-reorder questions submit a full reconstructed sentence. A Hungarian answer like "Látom az autót az ablakból." is about 27 characters, so most of those answers were being chopped mid-sentence and the analytics for that question type were meaningless. I bumped the limit to 60 graphemes, which comfortably preserves real submissions while still capping anything pathological. Atomicity guarantees you keep the data; it does not guarantee the data was worth keeping.

What I took from it

This shipped quietly inside a point release, so it would be easy to read as a footnote. I do not think it is one. Two dependent writes are two chances to half-succeed, and half-succeeding is worse than failing cleanly, because it leaves behind data that looks real.

If two writes must both happen, make them one write. A transaction is far cheaper than a reconciliation job, and it removes a whole class of orphan.
Atomicity is about the gap between operations, not the operations themselves. Both of mine always succeeded individually.
In Postgres, a missing JSON key under jsonb_populate_record becomes NULL, not the column DEFAULT. Inject the values you expect the database to supply.
Distinguish a deterministic HTTP error from a transient network failure before you retry, or you will either spam a dead endpoint or duplicate rows.
Bound the transport on anything the user never sees. A submit that can hang for minutes is a submit that can lose data.

Nobody notices analytics that simply stay correct, and I am genuinely glad about that. The best version of this work is invisible.

Two analytics writes became one atomic write, and the orphans went away.