Subdrum is a single HTML file that lets you spin a wooden cajon in the browser and tap it to play. It started as a small birthday-style page for a friend's drum, but the playable part hid a question I underestimated: when someone touches the screen, did they hit the drum, or did they hit the empty space behind it? Answering that precisely, on a mouse and on a thumb, while drag, pinch, and pan all share the same canvas, turned out to be most of the work.
A tap is not a hit
The drum is a 3D object floating in a scene. A pointer event, by contrast, is a flat pair of pixel coordinates on a 2D canvas, and the screen carries no information about depth. A tap near the silhouette of the drum could be landing on the wood, or it could be sailing past the edge into the background. I did not want to play a thump on every tap, because then striking empty air would sound exactly like striking the face. That small dishonesty would make the object feel fake, and I think people quietly stop trusting things that respond when they should not.
So before any sound, I needed to translate a flat screen tap back into the scene and ask one precise question: does a line drawn from the camera, through this pixel, into the world, actually touch the cajon mesh?
Casting the ray
That line is a ray, and Three.js gives you a Raycaster to work with it. The first step is to convert the pointer coordinates into normalized device coordinates: clip space that runs from minus one to plus one across both axes, with the origin in the center of the viewport. Because Subdrum's canvas fills the whole window, I can divide directly by window.innerWidth and window.innerHeight rather than measuring the element's bounding box. The y axis gets negated because screen coordinates grow downward while clip space grows upward.
Here is the actual core of it, lightly trimmed.
const raycaster = new THREE.Raycaster();
const pointer = new THREE.Vector2();
function onTap(event) {
event.preventDefault();
// Touch events carry coords on event.touches; mouse on the event itself.
let clientX, clientY;
if (event.touches && event.touches.length > 0) {
clientX = event.touches[0].clientX;
clientY = event.touches[0].clientY;
} else {
clientX = event.clientX;
clientY = event.clientY;
}
pointer.x = (clientX / window.innerWidth) * 2 - 1;
pointer.y = -(clientY / window.innerHeight) * 2 + 1;
raycaster.setFromCamera(pointer, camera);
const hits = raycaster.intersectObjects(cajonMesh.children, true);
if (hits.length > 0) {
playCajonSound(); // synthesized thump, below
showTapFeedback(clientX, clientY);
if (typeof plausible !== 'undefined') plausible('beat');
}
}
A couple of details there are easy to skip past. Touch and mouse events do not expose coordinates the same way: a touch event holds them in event.touches[0], while a click holds them directly on the event, so the branch at the top normalizes both into one clientX / clientY pair before any math runs. And the cajon is not one mesh but a small group of parts, so I call intersectObjects(cajonMesh.children, true) with the recursive flag set to true, which walks into each child's descendants too. Tap the void around the drum and hits.length is zero, so nothing fires. That silence is the feature.
One tap, not two
Touch devices are too generous. A single thumb press fires a touchstart and then, a fraction of a second later, a synthesized click, because browsers emulate mouse events for the benefit of pages written before touch existed. I listen for both so the drum works everywhere, but that means a single press would play twice, and the second thump lands slightly late, so it reads as a flam rather than one clean strike.
The fix is a small debounce guard. I stamp the timestamp of the last accepted tap and ignore any new one that arrives within 100 milliseconds.
let lastTapTime = 0;
function handleTap(event) {
const now = Date.now();
if (now - lastTapTime < 100) return; // drop the emulated twin
lastTapTime = now;
onTap(event);
}
renderer.domElement.addEventListener('click', handleTap);
renderer.domElement.addEventListener('touchstart', handleTap, { passive: false });
- The first event through the door, whichever it is, plays the sound and stamps the time.
- Its phantom twin arrives a few milliseconds later, fails the check, and is dropped silently.
- A genuine fast double tap, well past the window, still gets through, so drumming quickly still works.
One hundred milliseconds is comfortably longer than the gap between a real event and its emulated partner, and comfortably shorter than any rhythm a human hand can play. Note the { passive: false } on the touch listener: I call event.preventDefault() inside onTap to stop the browser from also scrolling or zooming on that press, and a passive listener is forbidden from doing that.
Sharing the canvas with the camera
Tapping is not the only thing happening on that surface. OrbitControls run on the same canvas, and they are configured deliberately so that one finger and two fingers do different jobs.
controls.touches = {
ONE: THREE.TOUCH.ROTATE,
TWO: THREE.TOUCH.DOLLY_PAN
};
controls.mouseButtons = {
LEFT: THREE.MOUSE.ROTATE,
MIDDLE: THREE.MOUSE.DOLLY,
RIGHT: THREE.MOUSE.PAN
};
controls.minDistance = 0.8; // how close you can lean in
controls.maxDistance = 8;
controls.autoRotate = true; // idle spin until you touch it
controls.target.set(0, 0.865, 0); // orbit the body, not the floor
The play gesture has to coexist with all of that without stealing it. What helped is that these gestures are genuinely different motions, not just different intents. A play is a tap: down and up in roughly the same place, over in a moment. A rotate is a drag: the pointer travels. A zoom or pan uses two fingers, not one. Because a hit is the small, brief, single-pointer gesture, it slots into the gaps the camera controls leave open, and a deliberate drag to look around the drum does not trip it. The raycast is still the final arbiter: even a clean tap counts only if the ray lands on the mesh.
The drum also auto-rotates while idle so the page feels alive before anyone interacts, then yields the moment you touch it. I hang that off the controls' own start event rather than my tap handler, so it fires for rotates and pans too, not just hits:
controls.addEventListener('start', onUserInteraction);
function onUserInteraction() {
if (userHasInteracted) return;
userHasInteracted = true;
controls.autoRotate = false;
instructionsEl.classList.add('hidden');
}
The goal was never just to detect taps. It was to make the drum feel like an object you can reach out and strike, where missing is as honest as hitting.
The sound of a hit
Once a ray confirms a hit, the thump itself is synthesized with the Web Audio API rather than played from a sample, which keeps the page to a single file with no audio download. A cajon's low slap is mostly a pitch that drops fast, so the body is a sine oscillator sweeping from about 120 Hz down to 45 Hz over 120 milliseconds, with a triangle wave underneath for resonance and a short burst of low-passed noise for the attack of palm on wood. Each strike gets a little random pitch and gain variation so repeated taps do not sound like a copy machine. The first tap also has to call audioContext.resume(), because browsers keep the audio context suspended until a user gesture unlocks it; the same tap that passes the raycast is what wakes the sound up.
Takeaways
- A screen tap carries no depth; raycasting is how you turn a flat pixel into a real question about a 3D scene.
- Convert to normalized device coordinates first, negating the y axis, then let the
Raycasterdo the geometry. If your mesh is a group, intersect its children with the recursive flag on. - Touch fires both
touchstartand an emulatedclick; a 100 ms debounce drops the twin without blocking real fast taps. Keep the touch listener non-passive if you needpreventDefault. - When gestures share one canvas, lean on the fact that they are physically distinct motions, route one-finger versus two-finger to different jobs, and let the raycast have the last word on whether a hit counts.
- Synthesizing the sound instead of loading a sample kept the whole thing in one HTML file, and unlocking the audio context on that first gesture is the one bit of setup that is easy to forget.