Reading the shape of a recovered catalog with offline audio analysis

Eighty-six tracks in the archive arrived as bare MP3s. No tempo, no key, no genre, sometimes not even a reliable year. To give each recovered song a real page, I needed a way to read the shape of the catalog without inventing facts about it. So I built a small offline Python pipeline, and then I spent most of my time thinking about how much any one number deserved to be trusted.

Eighty-six files, almost no metadata

A memorial archive really leans on its detail. A track page that shows only a title and a play button feels thin, and thin is the wrong note for work I am hoping to keep around. But I could not write what I did not know, and the source files carried almost nothing. They were dumps from old drives and forum attachments: audio, and not much else.

The honest options were to leave the pages bare, to guess, or to measure. Guessing felt wrong here. Measuring was the interesting path, because audio is one of those nice cases where a file can tell you something real about itself if you ask it carefully. The trick is that different tools ask in different ways and disagree, so the real work is less about extraction and more about reconciliation.

A staged pass that reads each file

The pipeline runs as a sequence of small scripts, each writing its own JSON to an output/ directory, so a slow stage never has to run twice. Three of them read the audio in genuinely different ways:

mutagen reads whatever the MP3 already carries: bitrate, sample rate, channel mode, and any ID3 frames the original author embedded, including the BPM that FL Studio writes into TBPM.
librosa does signal-level analysis from the waveform itself: a beat-tracker tempo estimate, a chroma-based key guess, plus spectral and timbral features (MFCCs, spectral centroid, onset rate). The whole catalog runs in about 75 seconds single-threaded.
Essentia does the heavier music-information-retrieval work via its MusicExtractor (rhythm, tonal strength, dynamics), then runs a Discogs-EfficientNet embedding through a set of MTG-Jamendo classifier heads for 87-class genre, 56-class mood-theme, and 40-class instrument tagging, plus binary heads for voice-versus-instrumental and seven moods. That pass takes about seven minutes.

A merge_features.py step combines the three into one record per track, an LLM step applies hand-authored descriptions, and a final build_website_data.py trims the heavy arrays (MFCCs, chroma, raw score vectors) and writes a lookup file the static site reads at build time. No service runs in production. The analysis happens once on my machine, the trimmed result is committed alongside the rest of the data, and every track page renders from that file like everything else on the site.

Keeping the analysis out of the runtime feels right for an archive. I would like this to outlast any single host, so the fewer moving parts in production, the better. A computed number that lives in a versioned JSON file is just data. A number that depends on a model server staying up is one more thing that can go quiet.

Three opinions about every number

The interesting part is that I now had up to three sources for tempo and two for key, and they did not always agree. Beat trackers are famously prone to landing on a clean double or half of the real tempo, so I could not just trust the loudest voice. I decided every canonical value should come with a recorded provenance, ordered from most to least defensible.

For tempo: an embedded ID3 BPM wins, because that is the value the song's own author rendered out of FL Studio. If there is no embedded value, Essentia's rhythm extractor is next, and librosa's beat tracker is the last resort. For key, librosa and Essentia each report a confidence, so I let the more confident one win rather than hard-coding a preference.

# Canonical BPM: prefer the author's own encoded value, then
# Essentia's rhythm extractor, then librosa's beat tracker.
bpm, bpm_source = None, None
if id3_bpm:                       # FL Studio wrote this on export
    bpm, bpm_source = float(id3_bpm), "id3"
if bpm is None and essentia_bpm:
    bpm, bpm_source = essentia_bpm, "essentia"
if bpm is None and librosa_bpm:
    bpm, bpm_source = librosa_bpm, "librosa"

# Key: whichever estimator is more sure of itself wins.
if es_strength is not None and (lib_conf is None or es_strength >= lib_conf):
    key, scale, key_source = es_key, es_scale, "essentia"
elif lib_conf is not None:
    key, scale, key_source = lib_key, lib_mode, "librosa"

That bpm_source field turned out to matter more than the BPM itself. It means a page can quietly distinguish a number Vanessa authored from one a model inferred, and I never have to wonder later which is which. Provenance is cheap to record and expensive to reconstruct, so I record it everywhere.

How the key estimate actually works

The librosa key guess is worth a moment, because it is a tidy bit of music theory turned into arithmetic. You compute a 12-bin chroma vector (the average energy in each pitch class across the whole track), then correlate it against the Krumhansl-Schmuckler key profiles: two reference vectors, one for major and one for minor, that encode how prominent each scale degree tends to be. Rotate the chroma through all twelve roots, score every rotation against both profiles, and the best correlation is your best guess at key and mode.

# Krumhansl-Schmuckler: rotate the chroma through all 12 roots,
# correlate against the major and minor profiles, keep the best.
best = (None, None, -1.0)
for i in range(12):
    rotated = np.roll(chroma_mean, -i)
    for mode, profile in (("major", KS_MAJOR), ("minor", KS_MINOR)):
        corr = float(np.corrcoef(rotated, profile)[0, 1])
        if corr > best[2]:
            best = (PITCH_CLASSES[i], mode, corr)

It is naive (it ignores how the key moves through a song) but it is honest about that: the correlation coefficient it returns is exactly the confidence I feed into the reconciliation above. When Essentia is more certain, Essentia wins. When neither is certain, the page can simply say less.

Treating the output as a starting point

Every track also carries hand-authored notes, written by people who knew the work. Those notes sit next to the machine numbers on the page so the analysis never stands alone. The computed tags surface the shape of a song. The notes say what it actually meant. One is a measurement; the other is a memory, and I try never to let the measurement stand in for the memory.

Analysis is a way to surface the shape of a body of work. A memorial archive still needs a person to decide what is true and what gets said.

The tagging output gets the same skeptical treatment. Almost every track scores "electronic" as its top genre, which is true and also useless, so the build step drops the universal top tag and keeps the more distinguishing ones beneath it. A flat list of model scores becomes a short, readable set of chips only after a person decides which thresholds mean something.

The same pass repairs the files

While I had each file open, I let a separate writeback step do double duty. Beyond producing the JSON the site reads, it embeds ID3v2.4 frames into the MP3s themselves: title, composer credit, album art, lyrics, and the analysis results. The rule is strictly additive. It backs up every original first, then only fills in frames that are empty.

TBPM is written only when missing. Forty-seven tracks gained a computed BPM; thirty-nine kept the value Vanessa had encoded herself.
TKEY is written in DJ format ("Gm" for G minor, "G" for major), but only when Essentia's key strength clears 0.6, so a weak guess never gets stamped into the file.
Mood lean, secondary genres, instruments, themes, and the description go into custom TXXX frames, which iTunes-style players can surface without colliding with the standard tags.

This is the detail I am happiest about. The point of the archive is durability, and a downloaded MP3 that carries its own title, credit, art, words, and analysis holds up in a way a web page cannot. If the site ever disappears, the files still know what they are. The metadata travels with the music, which is what I want from a catalog meant to survive its own infrastructure.

What I am keeping

Measure when you can, because audio files will tell you real things about themselves. Try not to guess.
When tools disagree, do not pick a favorite once and forget it. Order your sources by how defensible they are, and let confidence decide the close calls.
Record the provenance of every value. A one-word source field is the cheapest insurance you will ever buy against confusing an authored number with an inferred one.
Keep heavy analysis out of production. Run it once in stages, commit the trimmed result, and let the static site read plain data.
Write the truth back into the files, additively and after a backup, so the music carries its own context wherever it goes.
Let the machine surface shape, but let a person decide meaning. On a memorial, that distinction feels like the heart of it.

The pipeline gave every recovered track a real page. The care around it (the provenance, the thresholds, the refusal to let a guess masquerade as a fact) is what I hope makes those pages worth trusting, because trust is most of what an archive like this has to offer.

Reading the shape of a recovered catalog with offline audio analysis.