AudioDesc User Guide
Audio Description Generator
1. Getting Started
Installing
Open the DMG file and drag AudioDesc to your Applications folder. First launch, macOS will ask you to confirm you want to open it since it wasn't downloaded from the App Store. That's normal. Click Open.
Setting Up Your API Key
AudioDesc uses Google's Gemini AI to look at video frames and write descriptions. You'll need a free Gemini API key to use it.
- Go to Google AI Studio and sign in with a Google account
- Create an API key
- When AudioDesc launches the first time, it'll ask for that key
- It gets stored in your macOS Keychain, so you only have to do this once
You can update the key anytime in Settings > API Key. Hit Test to make sure it's working before you run anything.
Choosing a Voice
AudioDesc uses the built-in macOS text-to-speech engine for the description audio. Default voice is Samantha, picked because it sounds noticeably different from a real person talking, which is what you want. Viewers should be able to tell immediately that they're hearing a description, not the program audio.
To change it, open Settings > Voice and pick from the dropdown. Use Test to hear it before committing. If you want better-sounding voices, download the enhanced versions at System Settings > Accessibility > Spoken Content > Manage Voices.
2. Basic Workflow
- Select a video — Click Browse and pick a video file. Almost any format works. Or pick a folder to process multiple videos.
- Choose a mode — Standard, Extended, or Both (details in the next section).
- Select export types — Check which output files you want.
- Choose an output location — "Create subfolder" is checked by default. Keeps things organized, especially for batch jobs.
- Click Analyze — The app pulls frames from the video, sends them to Gemini, and builds a list of suggested descriptions. This takes a few minutes depending on video length.
- Review descriptions — A window opens showing every suggested description with a thumbnail of the frame it came from. Edit, keep, or reject each one.
- Click Save & Continue to Mix — Generates the TTS audio and produces your output files.
3. Understanding the Modes
Standard Audio Description (WCAG 1.2.5, Level AA)
Descriptions go into natural pauses in the existing audio. Video length doesn't change. Because they have to fit inside whatever silence gaps exist, standard descriptions are shorter and more to the point. If there's no pause long enough nearby, that frame gets skipped.
Extended Audio Description (WCAG 1.2.7, Level AAA)
The original audio actually pauses at natural break points, the description plays, then the audio picks back up. No gap length constraint, so descriptions can be as thorough as they need to be. Available as audio-only (MP3) or as a video with freeze-frames at each pause.
Both
Runs both modes in one pass. The review screen shows standard and extended descriptions side by side so you can edit them independently. Standard and extended versions of the same description can say different things.
4. The Review Screen
After analysis finishes, a review window opens with every suggested description. Here's what you're looking at:
- Thumbnail — Preview of the frame being described. Click it to see a larger version.
- Info column — Timestamp, available gap length, content category (Slide, Graphic, Speaker, etc.), and word limit for standard mode.
- Standard description — The shortened version written to fit in the silence gap. Shows "No gap" if no qualifying pause was found nearby. Only shows in Standard or Both mode.
- Extended description — The full description. Only shows in Extended or Both mode.
- Reject button — Labeled "Reject" followed by the timestamp (e.g., "Reject 4:32"). Removes that description from the output. Click again to bring it back.
Click into any description text box to edit it directly. Your edits are saved when you click Save & Continue to Mix.
5. Output Files
| File | Description |
|---|---|
*_audiodesc_standard.mp4 | Original video with descriptions mixed into silence gaps. Same length as the original. |
*_audiodesc_extended.mp3 | Audio-only with the original audio paused at insertion points for full descriptions. |
*_audiodesc_extended.mp4 | Video with freeze-frames at each description point. Longer than the original. |
*_descriptions_only.mp3 | Full-length silent track with only the description audio at the correct timestamps. Useful for QA — play alongside the original to hear where descriptions land. |
*_audiodesc_log.json | Detailed stats: frames analyzed, descriptions generated, API calls made, estimated cost, and why frames were skipped. |
After mixing, buttons appear at the bottom of the main window to open each output file in Finder.
6. Settings Reference
API & Model
| Setting | Default | Description |
|---|---|---|
| API Key | — | Your Google Gemini API key. Stored in macOS Keychain. Click Save after entering it, then Test to confirm it works. |
| Gemini Model | gemini-3-flash-preview | The AI model used to analyze frames. Leave this alone unless Google releases a newer model or you want to experiment with a different one. |
Voice
| Setting | Default | Description |
|---|---|---|
| TTS Voice | Samantha | The macOS voice used for descriptions. Intentionally sounds different from a real person so viewers can distinguish descriptions from program audio. Download enhanced voices in System Settings for better quality. |
Silence Detection
| Setting | Default | Description |
|---|---|---|
| Threshold (dB) | -30 | How quiet audio has to get before the app counts it as silence. More negative = more sensitive. If you're getting fewer descriptions than you expect, try -40. Council chambers with HVAC hum, crowd noise, or any background audio often need a lower number than the default. |
Gaps & Timing
| Setting | Default | Description |
|---|---|---|
| Standard Min Gap (ms) | 2000 | Minimum silence length (in milliseconds) needed to insert a standard description. Anything shorter gets skipped. 2000ms = 2 seconds. |
| Extended Min Gap (ms) | 400 | Minimum silence needed to find an insertion point for extended mode. Just needs a brief natural pause between sentences since the video pauses for the full description. |
| TTS WPM Estimate | 140 | Estimated speaking rate of the TTS voice in words per minute. Used to figure out how many words fit in a given gap. If descriptions are getting cut off, try lowering this number. |
| API Call Delay (s) | 0.5 | Minimum wait between Gemini API calls. Prevents hitting rate limits. Leave this alone unless you're consistently seeing rate limit errors. |
| Frame Interval — under 1hr | 3 | How often (in seconds) to pull a frame from the video for analysis. Lower = more thorough but more API calls and higher cost. |
| Frame Interval — over 1hr | 5 | Same thing for longer videos. A higher interval keeps costs reasonable on two- and three-hour recordings. |
Dedup & Categories
| Setting | Default | Description |
|---|---|---|
| Hash Distance | 4 | How visually similar two frames have to be to count as duplicates. Lower = stricter. Visually identical frames within the time window below get filtered before any API calls happen. |
| Hash Dedup Window (s) | 10 | Only treat frames as duplicates if they're within this many seconds of each other. Protects slides that use the same template at different points in the video — they'd look identical to the app but contain different content. |
| Text Similarity | 0.80 | If a new description is more than 80% similar to a previous one, it gets skipped automatically. Catches paraphrases of the same scene. |
| Lower Third Min Gap (ms) | 5000 | Minimum silence needed to describe a name or title overlay. Set higher than standard because lower thirds are supplementary and shouldn't interrupt normal flow. |
| Lower Third Suppress (s) | 600 | Time window (10 minutes) before a similar lower third can be described again. Prevents the same station ID or meeting title graphic from getting described every few minutes. |
| Consolidation Window (s) | 4 | If multiple descriptions of the same category land within this many seconds of each other, they get merged into one. Helps with animated openers that get split into separate frames. |
| Speaker Suppress (s) | 90 | Minimum time between speaker or scene descriptions. Keeps the app from generating a new "person at podium" description every 30 seconds during a long presentation. |
| Max Speaker Descriptions | 8 | Hard cap on speaker-type descriptions before the app stops generating new ones. Resets when a genuinely different speaker or scene appears. |
| Unknown Frame Interval (s) | 15 | How often to keep frames the visual classifier can't confidently categorize. Kept conservative at 15 seconds to avoid accidentally skipping slides that look ambiguous to the classifier. |
| Max Scene Descriptions | 20 | Cap on non-slide descriptions (Speaker, Graphic, Wide Shot, Other) per video. Slides, title cards, and lower thirds are never affected by this limit. |
Video Open / Close
| Setting | Default | Description |
|---|---|---|
| Include Open / Close | unchecked | Adds a spoken description during the video's intro or outro. The original audio ducks down while it plays. |
| Duration (s) | 0 | How long the intro or outro segment is, in seconds. Set this to match the actual length of your opening or closing animation. |
| Text | — | What gets read during the intro or outro. Something like "Animated opening for the City Council broadcast" is enough. |
7. Batch Processing
To process multiple videos, click Browse and select a folder instead of a single file.
- Analyze — Processes one video at a time. After each analysis, the review window opens so you can check descriptions before mixing. Once you click Save & Mix, the next video starts automatically.
- Process (No Review) — Runs through all videos without stopping. Good for high-volume nights when you trust the defaults.
With "Create subfolder" checked, each video gets its own output folder (e.g., meeting_2026_audiodesc/). Keeps things from piling up.
8. Troubleshooting
API key not found
If you see "No Gemini API key" when the app opens, go to Settings, paste in your key, click Save, then Test. The key gets stored in macOS Keychain after that and shouldn't ask again.
No descriptions generated
If the app finishes and reports zero descriptions, the video may not have any visual content that needs describing. A talking-head interview with no slides, graphics, or text on screen is a legitimate case where audio description isn't needed — the audio already tells the viewer everything that's happening.
If you think descriptions are missing when they shouldn't be, try these in order:
- Lower the Silence Threshold to -40 dB. If the room has any background noise, the default -30 may not be finding enough silence gaps.
- Reduce the Frame Interval to 2 seconds to catch more frames.
- Run in Both mode. Some content only gets described in extended mode when no silence gap is available for standard.
App appears unresponsive during analysis
When the app is sending frames to Gemini for analysis, it can go quiet between progress updates. It's working. The status bar and progress bar show what's happening, and the log updates every 10 frames. Each frame takes a few seconds to come back.
If the progress counter hasn't moved in more than 5 minutes, press Cancel once. That triggers a graceful stop that saves everything processed so far. If nothing happens after 30 seconds, press Cancel a second time to force-stop.
Cost higher than expected
Cost is directly tied to how many frames get sent to Gemini. A few things that help:
- Increase the Frame Interval to 8-10 seconds instead of the default 3-5
- Increase Unknown Frame Interval to thin out ambiguous frames before they reach the API
- The log shows a cost estimate before the API calls start — cancel early if it's higher than you expected
- The sidecar JSON file after each run has exact API call counts and actual cost
Slides missing from descriptions
If slides are getting skipped, a few things to check:
- Make sure Hash Dedup Window is at least 10 seconds. Lower values can accidentally treat two different slides with the same template as duplicates.
- Reduce the Frame Interval — slides that are only on screen briefly can get missed at longer intervals.
- Run in Both mode to make sure slides get captured even when no silence gap is available for standard mode.
9. Tips
The default settings were tuned for council and board meeting videos — a mostly static camera, a mix of talking heads and slides, predictable structure. That's where they work best.
Community events are different. Cameras move around, speakers wander, and there's a lot more visual noise. If you're processing community event footage and getting flooded with repetitive speaker descriptions, try bumping Speaker Suppress up to 180 seconds and lowering Max Scene Descriptions to around 10.
The descriptions-only MP3 is underrated for QA. Drop it into your audio editor alongside the original video and you can hear exactly where every description lands without having to generate the full mixed output first.
Check Google AI Studio occasionally to keep an eye on your API usage. A two-hour council meeting with slides typically runs under $2.00 after the default optimizations. If you see a run that costs significantly more than that, the log file will show you exactly where the calls went.
10. Get in Touch
AudioDesc is built and maintained by Brian Paris at Sounds and Colors. If you run into something this guide doesn't cover, find a bug, or just want to share how it's working at your station, send a note to Loading....
Real feedback from people actually using this in production is how the app gets better. Don't hesitate to reach out.