Build a Live Translation Closed Captions Overlay on iOS

Live captions transcribe speech in real time and stay readable over anything. Here is how to build the overlay on iOS.

Lawrence Arya Founder & CEO of VP0 · June 8, 2026 · 8 min read Updated June 8, 2026 View as Markdown

TL;DR

A live translation and closed-captions overlay is a caption bar that transcribes spoken audio in real time, optionally translates it, and stays readable over any content. On iOS it is built from the Speech framework for recognition and the Translation framework for translation, with an overlay that updates as partial results stream in. The hard parts are keeping up with the speaker, staying legible over video, and being honest that machine captions and translations make mistakes. A free VP0 live-caption overlay template gives an agent the caption bar, the streaming behavior, and the states to extend, while you wire recognition and translation.

What a live caption overlay actually does

A live translation and closed-captions overlay does three jobs at once: it listens to spoken audio, turns it into text in real time, and shows that text as a readable bar over whatever is on screen, optionally translated into another language. It is the band of words at the bottom of a video call, a live event, or a media player, keeping pace with the speaker. On iOS, the recognition comes from the Speech framework, the optional translation from the Translation framework, and the audio is captured through AVFoundation. The overlay is the part you design, and it has to do something subtle: update continuously without becoming a jittery mess, and stay legible over a busy 1,920 by 1,080 video.

Seeing it as a pipeline, audio to recognition to optional translation to a readable overlay, sets the priorities. The recognition and translation are frameworks you call; the real design work is the overlay that presents a stream of changing text clearly.

Streaming partial results is what makes it feel live

The detail that separates a live caption from a laggy one is partial results. A speech recognizer returns its best guess continuously, refining the words as it hears more, so a real-time caption shows that evolving text rather than waiting for a finished sentence. If you wait for the recognizer to finalize, the captions arrive a beat or two behind the speaker, which feels broken; if you stream the partials, the words appear as they are spoken and settle as the recognizer grows confident. The overlay has to handle that gracefully: the latest partial updates in place, completed phrases scroll up, and the text never flickers distractingly as it refines.

This is the core interaction, and it is easy to underestimate. A caption that only shows final results is technically correct and feels slow, so the streaming, in-place update is what makes the overlay read as genuinely live.

Readability is the design problem

Captions are useless if they cannot be read, and they sit over unpredictable content, video, a camera feed, a presentation, so readability is the main design constraint. The standard answer is a scrim, a subtle dark gradient or panel behind the text, so the captions stay legible whether the background is a bright sky or a dark room. The type respects the user’s Dynamic Type and accessibility text-size settings, because the people who rely on captions most often use larger text, and the caption bar sits in a safe, consistent place rather than jumping around. Two to three lines is the usual window, with older text scrolling out of view. The same accessibility-first discipline drives screen-reader-optimized forms.

These choices are not polish; they are the function. A caption overlay that becomes unreadable over a bright background, or ignores the user’s text-size setting, fails exactly the users it exists to serve.

The approaches compared

There are three realistic levels of live captioning, and they differ in responsiveness and complexity.

Approach	Responsiveness	Accuracy honesty	Effort
Wait for final transcription	Laggy, captions arrive behind the speaker	Cleaner text, but out of sync	Low
Stream partial results	Real-time, updates as words refine	Shows refining text, keeps pace	Medium, the standard
Stream plus on-device translation	Real-time in another language	Two layers of machine error to label	Medium to high

Waiting for final results is the version that feels slow and is the wrong default for anything live. Streaming partial results is the standard for real-time captions. Adding translation layers a second machine step on top, useful for a multilingual audience but introducing a second source of error, so the honest move is to label captions, and especially translations, as machine-generated rather than presenting them as a perfect transcript. A free VP0 live-caption overlay template starts you on the streaming version, with the caption bar, the scrim, the in-place updating behavior, and the states already shaped and exposed through a machine-readable source page, so an agent like Cursor or Claude Code extends a real overlay and you wire the Speech and Translation frameworks. The recognition side overlaps a SwiftUI audio transcription template with Whisper, and the conversational angle a language tutor voice chat clone.

Being honest about machine accuracy

Live captions and translations are machine-generated, and they make mistakes, so an honest overlay does not pretend otherwise. Speech recognition mishears, especially with names, accents, and noisy audio, and translation adds its own errors, so for accessibility and for trust the captions should be presented as an aid rather than a verbatim record. That can be as simple as a small indicator that captions are automatically generated, not relying on them for anything safety-critical without a human, and making corrections easy where the app supports them. The latency also has to be honest: if recognition falls behind, the overlay should keep up with the latest speech rather than displaying a growing backlog of old text.

Framing the captions as a helpful, imperfect aid is the responsible posture, particularly because the users who depend on captions deserve to know they are reading a machine’s best guess, not a guaranteed transcript.

Key takeaways: a live caption overlay

It is a pipeline. Audio to recognition to optional translation to a readable overlay.
Stream partial results. Show the recognizer’s evolving guess so captions keep pace with the speaker.
Readability is the design problem. A scrim, Dynamic Type, and a stable position keep captions legible over any content.
Translation adds a second error layer. Useful for multilingual audiences, but label it as machine-generated.
Start from an overlay template. A free VP0 live-caption template gives an agent the caption bar and streaming behavior to wire recognition into.

What to choose

For a live captions and translation overlay, build it from a template that already handles the streaming updates and the readability, because the in-place partial-result behavior and the legibility over arbitrary content are the real work, not the framework calls. A free VP0 live-caption overlay template gives you the caption bar, the scrim, the updating behavior, and the states, so an agent extends a real overlay and you wire the Speech framework for recognition and the Translation framework where you need it, labeling the output as machine-generated. Waiting for final transcriptions is the one approach to avoid for anything live, since it always feels a step behind the speaker.

Frequently asked questions

How do I build a live captions overlay on iOS? Build it as a pipeline: capture audio with AVFoundation, transcribe it in real time with the Speech framework, optionally translate with the Translation framework, and present the result as a readable caption bar. The key is to stream the recognizer’s partial results so captions appear as words are spoken and refine in place, rather than waiting for final sentences. Make the overlay legible over any content with a scrim and Dynamic Type, keep it in a stable position, and label the captions as machine-generated. A free live-caption template gives you the bar, the streaming behavior, and the states to start from.

How do I make live captions appear in real time? Use the speech recognizer’s partial results. A recognizer returns an evolving best guess as it hears more, so display that updating text rather than waiting for it to finalize a sentence, which is what causes captions to lag behind the speaker. Update the latest partial in place, scroll completed phrases up, and avoid flicker as the text refines. Streaming the partials is what makes the overlay feel genuinely live, while waiting for final results, though it produces cleaner text, always arrives a beat or two behind the conversation.

Where can I get a live caption or translation overlay template? The most useful option is a template built for the streaming overlay, not just a static text bar. A free VP0 live-caption overlay template provides the caption bar, the scrim for readability, the in-place updating behavior, and the states, with a machine-readable source page, so an agent like Cursor or Claude Code extends a real overlay. You then wire the Speech framework for recognition and the Translation framework where needed, since the template is the presentation and the recognition is the framework’s. It is built for legibility over video and live, refining text rather than a finished transcript.

How accurate are live captions and translations? They are machine-generated and imperfect. Speech recognition mishears names, accents, and speech in noisy audio, and adding translation introduces a second layer of error, so live captions should be presented as a helpful aid rather than a verbatim record. An honest overlay indicates that captions are automatically generated, avoids relying on them for safety-critical information without a human, and makes corrections easy where supported. Being clear about the imperfection matters especially for the users who depend on captions, who deserve to know they are reading a best guess, not a guaranteed transcript.

How do I keep captions readable over video? Put a scrim behind the text, a subtle dark gradient or panel, so the captions stay legible whether the background is bright or dark, and respect the user’s Dynamic Type and accessibility text-size settings, since caption users often use larger text. Keep the caption bar in a stable, safe position rather than letting it jump around, and show two to three lines with older text scrolling out. Readability is the main design constraint for a caption overlay, because captions that cannot be read over the content fail the people who rely on them.

What VP0 builders also ask

How do I build a live captions overlay on iOS?

Build it as a pipeline: capture audio with AVFoundation, transcribe it in real time with the Speech framework, optionally translate with the Translation framework, and present the result as a readable caption bar. The key is to stream the recognizer's partial results so captions appear as words are spoken and refine in place, rather than waiting for final sentences. Make the overlay legible over any content with a scrim and Dynamic Type, keep it in a stable position, and label the captions as machine-generated. A free live-caption template gives you the bar, the streaming behavior, and the states to start from.

How do I make live captions appear in real time?

Use the speech recognizer's partial results. A recognizer returns an evolving best guess as it hears more, so display that updating text rather than waiting for it to finalize a sentence, which is what causes captions to lag behind the speaker. Update the latest partial in place, scroll completed phrases up, and avoid flicker as the text refines. Streaming the partials is what makes the overlay feel genuinely live, while waiting for final results, though it produces cleaner text, always arrives a beat or two behind the conversation.

Where can I get a live caption or translation overlay template?

The most useful option is a template built for the streaming overlay, not just a static text bar. A free VP0 live-caption overlay template provides the caption bar, the scrim for readability, the in-place updating behavior, and the states, with a machine-readable source page, so an agent like Cursor or Claude Code extends a real overlay. You then wire the Speech framework for recognition and the Translation framework where needed, since the template is the presentation and the recognition is the framework's. It is built for legibility over video and live, refining text rather than a finished transcript.

How accurate are live captions and translations?

They are machine-generated and imperfect. Speech recognition mishears names, accents, and speech in noisy audio, and adding translation introduces a second layer of error, so live captions should be presented as a helpful aid rather than a verbatim record. An honest overlay indicates that captions are automatically generated, avoids relying on them for safety-critical information without a human, and makes corrections easy where supported. Being clear about the imperfection matters especially for the users who depend on captions, who deserve to know they are reading a best guess, not a guaranteed transcript.

How do I keep captions readable over video?

Put a scrim behind the text, a subtle dark gradient or panel, so the captions stay legible whether the background is bright or dark, and respect the user's Dynamic Type and accessibility text-size settings, since caption users often use larger text. Keep the caption bar in a stable, safe position rather than letting it jump around, and show two to three lines with older text scrolling out. Readability is the main design constraint for a caption overlay, because captions that cannot be read over the content fail the people who rely on them.

#ios #captions #live-translation #accessibility #speech

Part of the Native Hardware, Sensors & Device Features hub. Browse all VP0 topics →

Keep reading

Guides 7 min read

Bluetooth Hearing Aid EQ Mixer UI for iOS

Build a Bluetooth hearing aid EQ mixer UI for iOS in SwiftUI, bound to AVAudioUnitEQ. Here is the audio path, the band sliders, and what to keep honest.

Lawrence Arya · June 18, 2026

Guides 6 min read

Apple HealthKit Pedometer UI: Free Step Counter Templates

Build a step counter UI for Apple HealthKit: HealthKit for daily totals and charts, Core Motion's CMPedometer for the live number, from a free template.

Lawrence Arya · July 1, 2026

Guides 7 min read

Bluetooth Mesh Network Chat Interface for iOS

Build a Bluetooth mesh network chat interface for iOS with MultipeerConnectivity. Here is the transport choice, the message UI, and the states it must show.

Lawrence Arya · June 18, 2026

Guides 10 min read

Build an Anonymous Voice Changer Pitch Slider on iOS

Route the mic through AVAudioUnitTimePitch and bind a slider to its pitch in cents. Here is the audio graph, the UI, and what anonymous really means.

Lawrence Arya · June 11, 2026

Guides 10 min read

DEXA Scan Body Fat 3D Visualizer UI for iOS Apps

Render DEXA body composition as a rotatable 3D body in SceneKit, color regions by fat percentage, and chart the trend. Here is the build, data path, and limits.

Lawrence Arya · June 11, 2026

Guides 10 min read

XR fitness app companion UI for iOS: the SwiftUI screens

Build the iOS companion for an XR fitness app: setup, live HealthKit metrics, summary, and history that stay in sync with the headset workout.

Lawrence Arya · June 10, 2026