Synthesizing a cajon hit in the browser, one tap at a time

Subdrum is a 3D cajon you can spin and strike right in a web page. The obvious way to make it sound like a drum is to record a real one and play back the clip. I went a different way: every hit you hear is synthesized live in the browser, built from scratch the instant you tap, with no audio file anywhere on the page. Here is the actual mechanism, the numbers I landed on, and the small gotcha that made it interesting.

Why not just ship a sample

A recorded cajon hit is the path of least resistance, and for a lot of projects it is the right call. But a single sample has two problems that showed up quickly in a toy you are meant to play with. First, it has to be downloaded, and the whole project is one self-contained index.html with Three.js for the model and nothing else; adding a clean drum recording would be the heaviest thing on the page by far. Second, a sample is identical every time. Tap the drum ten times in a row and you hear the exact same waveform ten times, which our ears seem to read as mechanical almost instantly. A real instrument never repeats itself perfectly.

So I treated the hit as something to generate rather than store. The Web Audio API turns out to be lovely for this: short, percussive, tonal-plus-noisy sounds are cheap to build from oscillators and a noise buffer, and they cost nothing to download because the code that makes them is already on the page.

Unlocking the context on the first tap

Before any of the fun parts, there is a browser rule to respect: an AudioContext created before a user gesture starts in a suspended state, and modern autoplay policies will keep it there. The fix is to create the context lazily and resume it inside the same gesture that triggers the first sound. I do both in one small helper that every hit calls:

let audioContext = null;

function initAudio() {
  if (!audioContext) {
    audioContext = new (window.AudioContext || window.webkitAudioContext)();
  }
  if (audioContext.state === "suspended") {
    audioContext.resume();
  }
  return audioContext;
}

Because a raycaster turns a tap on the canvas into a hit on the drum, that tap is the user gesture, so the very first strike both creates the context and resumes it. No separate "click to enable sound" step, which would have felt like a speed bump on something you are supposed to just poke at.

A cajon hit in three voices

A cajon strike is not one sound. It is a deep wooden body resonating, plus a sharp slap at the moment your hand lands. I modeled that as three layers, each with its own gain envelope so the parts fade at their own rates, all summed into one master gain that feeds ctx.destination.

The thump. A sine oscillator that starts at 120 Hz and exponentially ramps down to 45 Hz over 0.12 seconds. That falling pitch is what gives the hit its weight: the body of the sound drops rather than sitting at a fixed tone. Its gain starts at 1 and decays to 0.01 over 0.35 seconds, and the oscillator is stopped at 0.4 seconds so it does not linger as a dead node.
The resonance. A triangle oscillator an octave-ish below, sweeping 85 Hz down to 40 Hz over 0.15 seconds at 0.6 gain. The triangle's extra odd harmonics keep the body from sounding like a pure test tone, and its slightly longer pitch sweep but shorter gain decay (0.25 seconds) sits it just behind the thump.
The slap. A short burst of white noise through a low-pass filter. This is the transient crack of contact, the woody snap of a hand on plywood.

The thump and resonance are almost boilerplate Web Audio. The slap is where the detail lives, so it is worth pulling apart.

Baking the attack into the buffer

The naive way to make a percussive noise burst is to generate flat white noise and shape it with a gain envelope on the way out. I do shape it with a gain node, but the sharp part of the decay is actually baked directly into the sample data when I fill the buffer. I create a tiny 0.08-second buffer and write each sample as white noise multiplied by a falling exponential:

const bufferSize = ctx.sampleRate * 0.08;
const buffer = ctx.createBuffer(1, bufferSize, ctx.sampleRate);
const data = buffer.getChannelData(0);

for (let i = 0; i < bufferSize; i++) {
  // white noise, decaying fast across ~15ms
  data[i] = (Math.random() * 2 - 1) * Math.exp(-i / (ctx.sampleRate * 0.015));
}

The -i / (sampleRate * 0.015) term means the exponent is measured in real seconds regardless of the device sample rate, with a time constant of about 15 milliseconds. So the noise has effectively collapsed within roughly the first 45 milliseconds of the buffer no matter whether the hardware runs at 44.1 kHz or 48 kHz. Doing the decay at the sample level, rather than only on a GainParam, gives the attack a crisper edge than an envelope alone, because every sample is already pre-shaped before any node touches it.

That buffer then runs through a lowpass BiquadFilter at 300 Hz with a Q of 1, and a gain node that drops from 0.4 to 0.01 over just 0.06 seconds. The low-pass is the difference between "static" and "thud": cutting everything above 300 Hz strips out the hiss and leaves the muted, wooden character. A fresh noise buffer is generated on every single hit, which means the slap is genuinely different randomness each time, not a looped clip.

Making no two hits the same

Layering three voices gets you a convincing cajon, but playing it repeatedly brought back the robotic-repetition problem I was trying to avoid with samples. A deterministic synth is just as identical as a recording. What helped was a small amount of controlled randomness computed once per strike and applied across the voices:

// computed at the top of every hit
const pitchVar = 0.9 + Math.random() * 0.2;  // 0.9 to 1.1
const gainVar  = 0.8 + Math.random() * 0.2;  // 0.8 to 1.0

masterGain.gain.setValueAtTime(0.8 * gainVar, now);
osc.frequency.setValueAtTime(120 * pitchVar, now);
osc2.frequency.setValueAtTime(85 * pitchVar, now);

One detail I like: pitchVar only scales the starting frequency of each sweep. The ramp targets stay fixed at 45 and 40 Hz. So the pitch of the attack wanders by plus or minus ten percent, but every hit still settles into the same low body. That keeps the variation from sounding detuned; the drum drifts at the moment of contact and then resolves home.

The variation has to be small. Too much and the drum sounds broken. The goal is not a different sound each time, it is the same sound breathing.

The plus-or-minus ten percent on pitch and the asymmetric eighty-to-one-hundred percent on gain were the result of just listening. Gain only ever bends downward from full because a hit that randomly jumped louder than the loudest tap felt jarring, whereas an occasional softer one reads as a gentler touch. The mechanism is trivial. Tuning how far it is allowed to move was the whole job.

What this bought me

The synthesized approach helped in a few concrete ways. The page stays light because there is no audio asset to fetch. The sound is generated on demand, so there is no buffer to preload and no decode step before the first tap can make noise; the only setup cost is creating the context, which the first gesture covers. And the result genuinely never repeats, both because of the per-hit random factors and because the noise burst is freshly generated every time.

There is a tradeoff worth naming. A synthesized hit will not fool an audio engineer into thinking it is a miked recording of a specific drum, and I was not trying to. I wanted something that feels like a drum the moment a stranger taps it, on a page that loads fast, and that rewards a second and third tap instead of punishing them.

Takeaways

For short percussive sounds, synthesis can beat samples on weight, latency, and liveliness all at once.
Create the AudioContext lazily and resume() it inside the first user gesture, or autoplay policy leaves it suspended and silent.
Model a real hit as separate voices with separate envelopes; the staggered decay is what reads as physical.
A pitch-swept oscillator gives percussion its falling weight; sweep the start, but keep the target fixed so randomness never sounds detuned.
Bake fast transients into the sample data itself (scale by Math.exp(-i / (sampleRate * t))) for an edge a gain envelope alone will not give you.
Randomize lightly: about ten percent on pitch, and gain only downward from full, turns a deterministic synth into something that feels played rather than triggered.

Synthesizing a cajon hit in the browser, one tap at a time.