# Voicemail Visualizer and Scrubber UI in SwiftUI: The Build

> By Lawrence Arya, Founder & CEO of VP0. Published 2026-06-10. 9 min read.
> Source: https://vp0.com/blogs/voicemail-visualizer-scrubber-ui-swiftui

Nobody listens to voicemail for pleasure. Here is the waveform, scrubber, and tap-to-seek transcript that make messages fast to triage.

**TL;DR.** A voicemail visualizer in SwiftUI is a triage tool: a precomputed waveform (decode the audio, downsample to a few hundred peak buckets with Accelerate, cache, draw two layers), a scrubber that maps the full waveform width to seconds with a generous hit area, and a transcript from on-device speech recognition that works as a second scrubber, tap a word to hear that moment. One clock drives everything, the player's reported time, and skip-silence falls out of the bucket data almost free. Start the inbox and player screens from a free VP0 design an agent like Claude Code or Cursor extends from its source page, and never ship decorative fake bars.

## What a voicemail visualizer actually is

A voicemail visualizer turns a voice message into something the eye can navigate: a waveform showing where the speech lives, a scrubber to land anywhere in it, and usually a transcript running alongside. The pattern has real lineage, [visual voicemail](https://en.wikipedia.org/wiki/Visual_voicemail) arrived with the original iPhone and changed voicemail from a sequential chore into a list you triage, and the modern version finishes that thought: not just which message, but which second of it.

The product insight that shapes the build: nobody listens to voicemail for pleasure. The user wants the number, the time, the one sentence that matters, so every element exists to shorten the path to it, the waveform shows where someone is talking versus pausing, the scrubber jumps the silence, the transcript lets the eye search faster than the ear. A visualizer that is beautiful but does not make messages faster to consume has decorated the chore instead of shortening it.

## Building the waveform from audio

The waveform is precomputed, not live. When the message arrives or first plays, decode the audio file, walk its samples, and reduce them to a few hundred buckets, each bucket the peak or average level of its slice of time, then render those as bars. The reduction is the whole trick: a thirty-second message at a [44,100 Hz](https://developer.apple.com/documentation/avfaudio/avaudioplayer) sample rate is over a million samples, and the screen needs perhaps three hundred values, so the work is a fast downsampling pass, the kind of vectorized batch operation [Accelerate](https://developer.apple.com/documentation/accelerate) exists for, done once and cached with the message.

Render-wise, the bars are a single drawn view, not three hundred subviews, with two layers: the full waveform in a quiet color, and the played portion overlaid in the accent, clipped at the playhead. That two-layer trick makes progress legible at a glance and costs nothing, since playback progress just moves a clip edge rather than recoloring bars. The same craft, recording-side, is covered in the [audio waveform recorder](/blogs/audio-waveform-recorder-ui-react-native/), where the buckets arrive live instead of from a file.

## The scrubber mechanics

The scrubber is a mapping between two spaces, pixels and seconds, and every behavior falls out of doing that mapping honestly. A drag anywhere on the waveform seeks: touch position over width gives the fraction, fraction times duration gives the time, and playback resumes from there. While dragging, show the would-be position, a timestamp lifting above the finger, and let the audio follow the finger live if seeking is cheap, or on release if it is not, but pick one and commit, because a scrubber that sometimes follows and sometimes lags feels broken even when it is merely inconsistent.

Three touches separate a scrubber that feels engineered from one that feels generated. The playhead and the audio never disagree, one source of truth drives both, the player's time. The drag target is generous, the full height of the waveform, not a thin line that demands precision from a thumb. And a haptic tick on grab and on release gives the gesture edges, the same vocabulary every good [timeline scrubber](/blogs/podcast-player-timeline-scrubber-ui/) speaks. Skip-silence is the genre's power move: since the waveform data already knows where the quiet buckets are, a single tap can jump the playhead past the pause to the next speech, which on real voicemails saves more time than any speed control.

## The transcript as the second surface

The transcript is not a caption; it is a second scrubber. On-device speech recognition through the [Speech framework](https://developer.apple.com/documentation/speech) turns the message into timestamped text, and the binding runs both ways: the current sentence highlights as playback moves, and tapping any word seeks the audio to that moment. For triage, the eye reads the transcript in a tenth of the listening time, which means for many messages the audio never plays at all, and that is the feature succeeding, not failing.

Honesty about recognition quality belongs in the design. Voicemail audio is telephone-grade, names and numbers get mangled, and accents stress the model, so render the transcript as a useful draft rather than a record, and keep the audio one tap away from any doubted word. Where a number or address matters, the user will verify by ear, and the design should make that verification effortless, tap the suspicious word, hear the original. The transcription craft in full lives in the [Whisper transcription UI](/blogs/whisper-voice-transcription-app-ui-swiftui/), and the same tap-to-seek binding powers the [snippet clipper](/blogs/podcast-snippet-clipper-ui-react-native/), where selected text becomes a shareable clip.

## States, speed, and the listening realities

Playback speed earns its place in voicemail more than almost anywhere: 1.5x on a rambling message is the difference between listening and waiting, and pitch-corrected speedup is built into the platform players. Pair it with the five-second skip-back, the universal "wait, what was that number" control, and remember position per message, because a long voicemail interrupted by real life should resume where it stopped, not restart.

| Control | Why voicemail needs it |
| --- | --- |
| Speed (1x, 1.5x, 2x) | Most messages are slower than the listener |
| Skip silence | Voicemail pauses are long and frequent |
| Skip back 5s | Numbers and names get missed on first pass |
| Per-message resume | Interruptions are the normal case |

The surrounding states need the same care as playback: a message can be downloading, transcribing, played, or failed, and each is a different row appearance. Unplayed deserves the strongest signal in the inbox, since triage is the screen's job, and a transcribing row should still be playable, the text arriving when it arrives, because holding the audio hostage to the transcript inverts the priorities. The screens those rows live on, the inbox list, the expanded player, the transcript view, are ordinary iOS design, and a free [VP0](https://vp0.com) design provides them as real layouts with a machine-readable source page an agent like Claude Code or Cursor extends from a pasted link, while you wire the audio pipeline underneath.

## Common mistakes when vibe coding the visualizer

The signature failure is the fake waveform: the agent renders pleasing random bars that have nothing to do with the audio, and the screen looks right in every demo while lying about the one thing it exists to show. Insist on the real pipeline, decode, downsample, cache, and verify by playing a message with a long pause, which should be visibly flat in the bars.

Three more recur. The playhead drifts from the audio because the UI animates its own timeline instead of reading the player's clock; bind every visual to the player's reported time and the drift disappears. The waveform recomputes on every appearance, making the inbox stutter on scroll, when the buckets should be computed once and stored with the message. And the scrubber's gesture fights the list's scroll, vertical drags page the screen while horizontal drags seek, unless the gesture priorities are set deliberately, which generated code never does on its own. Each is invisible in a one-message demo and obvious in a twenty-message inbox.

## Key takeaways: a voicemail visualizer in SwiftUI

- **The job is triage speed.** Waveform, scrubber, and transcript all exist to shorten the path to the sentence that matters.
- **Waveform is precomputed.** Decode once, downsample to buckets, cache; two layers make progress free.
- **One clock drives everything.** Playhead, highlight, and bars all read the player's time.
- **The transcript is a second scrubber.** Tap a word, hear that moment; treat the text as a draft, not a record.
- **Start the screens from a free VP0 design.** Inbox, player, and transcript arrive shaped; you own the audio pipeline.

## What to build first

Build the real waveform pipeline before any polish: decode, downsample with Accelerate, cache the buckets, render the two-layer bars, and bind the playhead to the player's clock. Add the scrubber with a generous hit area and committed follow behavior, then the transcript with tap-to-seek, then skip-silence, which the bucket data gives you almost free. Start the inbox and player screens from a free VP0 design extended by your agent, and spend your attention on the seek feel and the silence detection, the two places where engineering quality is audible. If the product only ever plays short confirmations, a plain progress bar serves honestly and the full visualizer is ceremony; build it when messages are long, rambling, and worth triaging, which is to say, when they are actual voicemail.

## Frequently asked questions

**How do I build a voicemail visualizer with a scrubber in SwiftUI?** Precompute the waveform: decode the audio, downsample its samples into a few hundred peak buckets with Accelerate, cache them, and draw two layers, the full waveform quiet and the played portion accented, clipped at the playhead. Make the whole waveform a drag target that maps touch position to seconds, drive every visual from the player's reported time, and add a transcript with tap-to-seek from the Speech framework. A free VP0 design supplies the inbox and player screens an agent extends.

**How do I draw a real audio waveform instead of fake bars?** Read the actual samples: decode the file, walk the raw audio, and reduce each slice of time to its peak or average level, a few hundred buckets for a screen-width waveform. Accelerate makes the pass effectively instant, and caching the result with the message means it happens once. Verify honesty with a message containing a long pause, which must render visibly flat. Random decorative bars look identical in demos and betray the user the first time they scrub toward what looks like speech.

**How does tap-to-seek on the transcript work?** On-device recognition returns words with timestamps, so the transcript is a time-indexed surface: tapping a word seeks the player to that word's moment, and as playback runs, the current segment highlights by comparing the player's clock against the word timings. Keep the binding one-directional from the player's time, and treat the text as a useful draft, telephone audio mangles names and numbers, so the design's job is making verification effortless: tap the doubted word, hear the original audio.

**Should voicemail playback have speed controls?** Yes, more than most audio: messages are spoken slower than listeners think, so pitch-corrected 1.5x is the everyday setting, with 2x for the ramblers and a five-second skip-back for the missed phone number. Skip-silence is the bigger win, jumping the long pauses voicemail is full of, and the waveform's bucket data already knows where they are. Remember position per message too, because interruptions are the normal case and restarting a two-minute message is how voicemail stays hated.

**Is there a free template for a voicemail app UI?** The screens are the reusable part, the message inbox with state-aware rows, the expanded player with waveform and controls, the transcript view, and VP0 provides them free: real iOS designs with a machine-readable source page that Claude Code, Cursor, or another agent reads from a pasted link and extends. The audio pipeline, the downsampling, the seek feel, and the silence detection are the engineering you own, and they are exactly where the listening experience is won or lost.

## Frequently asked questions

### How do I build a voicemail visualizer with a scrubber in SwiftUI?

Precompute the waveform: decode the audio, downsample its samples into a few hundred peak buckets with Accelerate, cache them, and draw two layers, the full waveform quiet and the played portion accented, clipped at the playhead. Make the whole waveform a drag target that maps touch position to seconds, drive every visual from the player's reported time, and add a transcript with tap-to-seek from the Speech framework. A free VP0 design supplies the inbox and player screens an agent extends.

### How do I draw a real audio waveform instead of fake bars?

Read the actual samples: decode the file, walk the raw audio, and reduce each slice of time to its peak or average level, a few hundred buckets for a screen-width waveform. Accelerate makes the pass effectively instant, and caching the result with the message means it happens once. Verify honesty with a message containing a long pause, which must render visibly flat. Random decorative bars look identical in demos and betray the user the first time they scrub toward what looks like speech.

### How does tap-to-seek on the transcript work?

On-device recognition returns words with timestamps, so the transcript is a time-indexed surface: tapping a word seeks the player to that word's moment, and as playback runs, the current segment highlights by comparing the player's clock against the word timings. Keep the binding one-directional from the player's time, and treat the text as a useful draft, telephone audio mangles names and numbers, so the design's job is making verification effortless: tap the doubted word, hear the original audio.

### Should voicemail playback have speed controls?

Yes, more than most audio: messages are spoken slower than listeners think, so pitch-corrected 1.5x is the everyday setting, with 2x for the ramblers and a five-second skip-back for the missed phone number. Skip-silence is the bigger win, jumping the long pauses voicemail is full of, and the waveform's bucket data already knows where they are. Remember position per message too, because interruptions are the normal case and restarting a two-minute message is how voicemail stays hated.

### Is there a free template for a voicemail app UI?

The screens are the reusable part, the message inbox with state-aware rows, the expanded player with waveform and controls, the transcript view, and VP0 provides them free: real iOS designs with a machine-readable source page that Claude Code, Cursor, or another agent reads from a pasted link and extends. The audio pipeline, the downsampling, the seek feel, and the silence detection are the engineering you own, and they are exactly where the listening experience is won or lost.

---
*Published on the [VP0 Journal](https://vp0.com/blogs). Free to read, index and cite with attribution.*