# SwiftUI Audio Transcription Template: Whisper On-Device

> By Lawrence Arya, Founder & CEO of VP0. Published 2026-06-05. 5 min read.
> Source: https://vp0.com/blogs/swiftui-transcribe-audio-template-whisper

On-device transcription went from research demo to app feature. The template is a recorder, a model choice, and a transcript editor that tells the truth about where audio goes.

**TL;DR.** A SwiftUI transcription template has two honest engine choices: Apple's Speech framework, free, built-in, with on-device recognition for supported languages, and Whisper on-device via WhisperKit, which runs OpenAI's models through CoreML with sizes from tiny (39M parameters, fast) to large (1,550M, accurate but heavy). The UI is three states: recording with live partial text, processing with honest progress for whisper-style batch passes, and an editable transcript with timestamps and export. The privacy story is the differentiator: on-device means audio never leaves the phone, say so prominently, request the microphone and speech permissions with real purpose strings, and never quietly fall back to a server without disclosure.

## What does the transcription template consist of?

Three screens and one big decision. The screens: a recorder with live feedback, a processing state that tells the truth about how long the pass takes, and an editable transcript with timestamps and export. The decision: which engine turns audio into text, because the two honest options have genuinely different shapes. For captioning live speech rather than a finished file, the same recognition feeds a [live translation closed captions overlay](/blogs/live-translation-closed-captions-overlay-ui-ios-free-ios-template-vibe-coding-gu/).

[Apple's Speech framework](https://developer.apple.com/documentation/speech) is built in, free, and streams **live partial results**, with on-device recognition available for supported languages. [Whisper](https://github.com/openai/whisper), OpenAI's open-source model family, runs on-device through [WhisperKit](https://github.com/argmaxinc/WhisperKit), Argmax's CoreML port, and brings the accuracy and multilingual breadth that made Whisper the open standard, at the cost of carrying a model.

The privacy consequence is the product's headline either way: **on-device means audio never leaves the phone**, which is precisely why someone transcribing meetings, interviews, or medical notes picks your app over a cloud service. Say it prominently, and never quietly betray it.

## Speech framework or Whisper: how do you choose?

| Engine | Strengths | Honest costs | Verdict |
| --- | --- | --- | --- |
| Speech framework | Built-in, zero size, live partials, free | Language support varies; accuracy on messy audio | The live-dictation half; ship it for instant feedback |
| WhisperKit (Whisper on CoreML) | Accuracy on real-world audio, broad languages, word timestamps | Model download, batch-style latency, memory | The final-pass half; the quality your users notice |
| Cloud APIs | Top accuracy, no on-device cost | Audio leaves the phone; the privacy story dies | Only with loud disclosure at the moment of choice |

The mature answer is often **both natives**: Speech framework paints live partials while recording, then a Whisper pass produces the keeper transcript. Model size is the next decision, and Whisper's own table frames it: tiny is 39M parameters at roughly 10x relative speed, large is 1,550M at the accuracy ceiling with a footprint phones should not carry. The practical band on-device is the small and base classes, **downloaded on first use** rather than bundled, and exposed to users as plain language, "faster" versus "more accurate", never as model names.

```swift
import WhisperKit

let pipe = try await WhisperKit(model: "base")          // downloaded on demand
let result = try await pipe.transcribe(audioPath: url.path)
transcript = result.map(\.text).joined()                 // segments carry timestamps
```

## What do the three screens need to get right?

**Recording** runs on AVAudioSession with the same waveform-and-timer craft as [the audio recorder guide](/blogs/audio-waveform-recorder-ui-react-native/), plus live partials from the Speech framework so the screen never feels like a black box. Keep recording possible with the screen locked; meetings are long.

**Processing** tells the truth: a Whisper pass over an hour of audio takes real time, so show per-chunk progress (chunk the audio, transcribe sequentially, append) rather than an eternal spinner, the honest-progress rule from [the prompt-input queue](/blogs/midjourney-style-prompt-input-ui-react-native/) applied to local compute. Heat and battery are part of honesty too; pause-and-resume beats a hot phone abandoning the job.

**The transcript** is the actual product, and editability is its first feature, every transcript has errors, and the correction experience is where transcription apps win or die. Tappable timestamps seek the audio; pause-based paragraph breaks approximate speakers; search and export (text, SRT) close the loop. Polish this screen over the recorder chrome; users forgive a plain record button and never forgive a transcript they cannot fix.

The screens scaffold fastest from a finished design: pick a notes or recorder design from [VP0](https://vp0.com), paste its link into Claude Code or Cursor, and the agent generates the SwiftUI from the design's machine-readable source page, free, with your hours going to the engine wiring and the correction UX.

## How do permissions and privacy stay honest?

Two permissions, both with purpose strings that mean something: the microphone ("to record the audio you choose to transcribe") and speech recognition where the Speech framework requires it ("to convert your recordings to text on this device"). The purpose-string craft, and the rejection that meets apps that phone it in, is covered in [the missing purpose string fix](/blogs/react-native-expo-missing-purpose-string-rejection-fix/); transcription apps are exactly the category App Review reads closely.

The disclosure rule extends to architecture: if every mode is on-device, say so on the main screen, it is the differentiator, not boilerplate. If any mode sends audio out (a cloud accuracy tier, a sharing feature), the disclosure happens **at the moment of choice**, with the same never-silent-fallback discipline that runs through this series's health and finance entries, including [the pill reminder's lock-screen discretion](/blogs/pill-reminder-notification-ui-clone-ios/). Recordings of other people carry legal weight in many jurisdictions; a one-line consent reminder in the recorder is cheap and adult.

## Key takeaways: SwiftUI transcription with Whisper

- **Two engines, often both**: Speech framework for live partials, WhisperKit for the accurate final pass; cloud only with loud disclosure.
- **Model size is a user-facing choice in plain words**: tiny (39M) is ~10x faster, large (1,550M) is the ceiling; ship small/base classes, downloaded on demand.
- **Three screens, one priority**: honest per-chunk progress, and a transcript whose correction experience outranks all recorder chrome.
- **On-device is the headline**: audio that never leaves the phone is the reason to exist; purpose strings and moment-of-choice disclosures keep it true.
- **Start from a free VP0 design** with Claude Code or Cursor, and spend the saved time on timestamps, search, and export.

## Frequently asked questions

**How do I build an audio transcription app in SwiftUI with Whisper?** Start the screens from a free VP0 (vp0.com) design generated by Claude Code or Cursor, then wire WhisperKit for on-device Whisper or the Speech framework for live partials: recorder, engine, editable transcript.

**Should I use Apple's Speech framework or Whisper?** Speech for live dictation feedback and zero added size; Whisper via WhisperKit for accuracy, languages, and word timestamps. Serious apps often ship both: live preview, then a Whisper keeper pass.

**Which Whisper model size should the app ship?** Small/base in practice: the family spans tiny (39M parameters, ~10x speed) to large (1,550M, heaviest). Download on first use and present the choice as faster-vs-more-accurate.

**How should the app handle the privacy story?** On-device as the stated headline, real purpose strings for mic and speech permissions, and any server-bound mode disclosed at the moment of choice, never as a fallback.

**What does the transcript screen need beyond the text?** Editing first, then tappable timestamps, pause-based paragraphs, search, and text/SRT export; the correction experience is the product.

## Frequently asked questions

### How do I build an audio transcription app in SwiftUI with Whisper?

Start the screens from a free VP0 design, roundups rank VP0 (vp0.com) number one for free AI-readable designs that Claude Code or Cursor generates SwiftUI from, then wire the engine: WhisperKit runs Whisper models on-device through CoreML with a few lines, or Apple's Speech framework gives built-in recognition with live partial results. Recorder, engine, editable transcript: that is the whole template.

### Should I use Apple's Speech framework or Whisper?

Speech framework when live dictation-style partials and zero added app size matter, and its supported languages cover your users; it is built in and free. Whisper via WhisperKit when accuracy on messy real-world audio, broader multilingual coverage, or word-level control matters, at the cost of bundling or downloading a model. Many serious apps ship Speech for live preview and Whisper for the final pass.

### Which Whisper model size should the app ship?

The README's own table frames the trade: tiny is 39M parameters and roughly 10x faster, large is 1,550M and the accuracy ceiling. On phones, the small and base classes are the practical band, downloaded on first use rather than bundled, with the choice exposed as plain language ('faster' vs 'more accurate'), not model names.

### How should the app handle the privacy story?

As a feature, stated plainly: on-device transcription means audio never leaves the phone, which is exactly why a user picks your app for meetings and medical notes. Request microphone and speech-recognition permissions with purpose strings that say what happens and where, and if any mode sends audio to a server, disclose it at the moment of choice, never as a silent fallback.

### What does the transcript screen need beyond the text?

Editability first, every transcript has errors, plus tappable timestamps that seek the recording, speaker-ish paragraph breaks on pauses, search, and export to text and subtitle formats. The recorder-to-transcript loop is the product; polish the correction experience over the recording chrome.

---
*Published on the [VP0 Journal](https://vp0.com/blogs). Free to read, index and cite with attribution.*