9:16 reframe explained: why speaker tracking beats center crop

May 15, 20267 min read

reframeengineering

A 1920×1080 source video has to become 1080×1920 to land on TikTok. That is a 43% drop in horizontal pixels, and what you discard is the difference between a clip that looks professional and a clip that looks like a screenshot of a Zoom call. The easy approach is to crop the center; every video editor does it by default. The easy approach is also wrong about 70% of the time.

This post is about what AutoAIClips does instead. We call it speaker-aware reframe, which makes it sound more magical than it is. Under the hood it’s three signals composed together: face detection per frame, transcript diarization (who is speaking at each timestamp), and a stability heuristic that prevents the crop from jittering when the speaker is moving. Each of those pieces is well-known on its own; the value is in how they compose.

Why center crop fails

Imagine a typical podcast frame: two people sitting across from each other, the camera locked off in a wide shot. The host is on the left third, the guest is on the right third, and the center third is the empty space between them. Center crop on this frame produces a 9:16 portrait of the negative space — both speakers cropped out of the frame, the wall in the background as the star of the show. This is the single most common failure mode in “dumb” reframe tools.

The second failure mode is single-host content where the camera is on the right side of the frame. Center crop now shows the host’s left shoulder and half of their face — the chin and one eye, with the rest cropped out. Anyone watching knows immediately the video was not shot for the 9:16 format.

The third failure mode is multi-cam interviews where the active speaker changes. Center crop never moves; it stays locked on the center pixel of the program feed, regardless of who is talking. The viewer’s attention has to ping-pong between the visible mouth and the audio source, which produces a low-grade cognitive friction that nobody articulates but everybody feels.

What speaker-aware reframe actually does

The mechanical setup is this. We run a face detector — currently a YuNet variant tuned for portrait orientation — on every fourth frame of the source video. That gives us a list of (timestamp, face_bbox) tuples for each detected face. In parallel, we run AssemblyAI’s diarization on the audio track, which gives us (start_ms, end_ms, speaker_id) intervals for each spoken segment.

The composition step is straightforward: for each output frame, look up the active speaker from the diarization, then find the face that has been spatially closest to the active speaker’s previous face for the last second. Center the crop on that face. If no diarization data exists (e.g. silence or non-speech audio), default to the largest face in the frame.

The stability problem

If you naïvely re-center the crop on the active speaker every frame, you get a jittery output that pans around as the face detector’s confidence wobbles by a few pixels per frame. The fix is exponential smoothing on the crop position: the target crop is the speaker’s face, but the actual crop moves toward the target with a time constant of about 200ms. Fast enough that hard camera cuts feel responsive; slow enough that micro-jitter doesn’t propagate.

The harder case is speaker changes during a sentence. Imagine a host asks a quick one-second question, the guest starts to answer, and you need the crop to swing from host-side to guest-side. The 200ms smoothing makes that swing visible — but a gentle pan reads as “the camera operator panned to the guest” rather than as a jump cut, which is exactly what we want.

Edge cases we handle

Speaker off-screen. If the active speaker has no detected face in the current frame (camera is on the guest while the host speaks), we hold the previous crop position rather than jumping to the largest face. This avoids a jarring swing back to the wrong person.
Multiple faces, same speaker. Picture-in-picture recordings (interviewer face in a corner inset on a screen-share). We disambiguate using face size — the inset face is smaller, so we crop on the larger primary face.
No faces detected. Conference talks with distance shots, screen recordings, B-roll. We fall back to center crop with a soft Ken-Burns motion to keep the frame from feeling static.
Audio-only sources. No face track exists. We composite a waveform animation over a brand-template background so captions have something to anchor to.

Why this matters more than caption animation

Many AI clippers compete heavily on caption animation styles — karaoke highlight, pop-on, line-by-line, emoji bursts. Those animations are fine. They are also easy to copy; every new clipper that launches gets caption parity with the leaders within months. Reframe quality is harder, requires the diarization + face-detect pipeline upstream, and is the single most visible quality difference in the output. If you compare a competently reframed clip to a center-cropped clip side by side, the reframed clip looks 10× more professional. The caption animation difference, by comparison, looks like 1.2×.

Watch the reframe before you watch the captions. It tells you whether the tool you are evaluating has actually built the pipeline, or whether it’s just slapping a center-crop transform on top of someone else’s clip-detection model.

Try AutoAIClips on your back catalog.

$9.99/week. Cancel from the billing portal in one click.

Get started