PhotoDesc User Guide
Audio-described videos from your photos
1. Getting Started
Installing
Open the DMG file and drag PhotoDesc to your Applications folder. First launch, macOS will ask you to confirm you want to open it since it wasn't downloaded from the App Store. That's normal. Click Open. PhotoDesc runs on Apple Silicon (M-series) Macs.
Setting Up Your API Key
PhotoDesc uses Google's Gemini AI to look at each photo and write a description, and (optionally) to narrate it. You'll need a free Gemini API key.
- Go to Google AI Studio and sign in with a Google account
- Create an API key
- When PhotoDesc launches the first time, it'll ask for that key. You can also click Skip for now and add it later. You can import photos without a key; you just need one to generate descriptions or AI narration.
- The key is stored in your macOS Keychain, so you only enter it once
You can update the key anytime in Settings > API Key. Click Save, then Test to confirm it works. The dot in the top-right corner shows the key's status. ● API ✓ when it's connected, ● API ✗ if there's a problem.
Choosing a Voice
PhotoDesc can narrate with Google's natural Gemini voices (default: Kore) or with the built-in macOS voices (default: Samantha), which are free and work offline. Pick the engine and voice in Settings > Voice.
2. Basic Workflow
- Add photos — Click + Click here to add images in the middle of the window (or + Add more images at the bottom once you have some). Pick one photo or several. A project folder is created automatically the first time you add images.
- Generate descriptions — Click Generate All Text to have the AI describe every photo, or use the + chip on a single row to describe just that one. (The button becomes Generate Missing Text once some rows already have descriptions.)
- Review and edit — Each photo shows up as a row with its description. Read it, edit the wording, re-crop the frame, change the voice, whatever you need (see the next section).
- Pick a voice (and optionally music) — Set the voice in Settings; choose a background-music track from the Music menu at the bottom if you want one.
- Choose the frame shape — Set the Aspect ratio (16:9, 9:16, 1:1, 4:5) and Frame rate on the project line above the photos.
- Choose how to export — Flip the Export toggle to Per image or Sequence (details below).
- Click Create Videos — PhotoDesc generates the narration and renders your video(s) into the project folder. Use In… instead if you want them somewhere else.
~/Documents/PhotoDesc/. Edits, thumbnails, and the narration you've paid for are all kept there, so you can quit and reopen without losing anything or re-generating audio. Use File > Open to reopen a project, or Save As… to move/rename its folder.3. Two Ways to Export
The Export toggle at the bottom decides what you get when you click Create Videos.
Per Image
Each photo becomes its own short video, the image on screen while its description is read. Good for posting one at a time, or when each photo stands on its own.
Combined Sequence
All the approved photos play back to back as a single video, each shown while its narration plays, with optional dissolves between them. (Sequence is only available when you have more than one photo.)
4. The Review Screen
Every photo you add becomes a row. Here's what's on it:
- Thumbnail — A preview of the photo. Click Crop / Frame to reposition or zoom how it sits in the chosen aspect ratio.
- Description — The text the AI wrote. Click into it and edit freely; your changes are saved automatically.
- Text chip — A + generates a description if the row doesn't have one; a ↻ reopens it for a guided rewrite (you can add a note like "focus on the dog in the foreground" and the photo, current text, and your note are sent together).
- Voice chip — A + generates the narration audio; a ↻ regenerates it (after editing the text, for example).
- ▶ / ⏹ — Play or stop the generated narration for that row.
- Read by — Shows which voice the row will use.
- Reject — Leaves the photo out of the output. Click again to bring it back (it reads Rejected while excluded).
- Create — Renders just that one photo to a video, without doing the whole set.
- ⠯ handle — Drag to reorder photos (matters for Sequence export).
Generating audio costs money (see Settings > cost note); editing text and re-cropping are free. Nothing is final until you click Create Videos.
5. Voices and Background Music
Set the narration voice in Settings > Voice, a range of natural Gemini voices, or the free macOS voices. You can also override the voice on an individual photo.
To add music, pick a track from the Music menu at the bottom of the window (a few are built in, and + Add your own… lets you import a file). The track is looped to the length of the video, faded in and out, and ducked. It drops down whenever the narration is speaking and comes back up in the gaps, so the words always stay clear. The ▶ next to the menu previews the track; the Volume slider in Settings sets how loud it sits under the voice (default 22%).
6. Output Files
Everything for a project lives in its folder under ~/Documents/PhotoDesc/ (or wherever you exported with In…).
| File | Description |
|---|---|
*_described.mp4 | Per-image export. One video per photo, the image shown while its description is read. |
*_sequence.mp4 | Combined-sequence export. All photos in one video, back to back. |
thumbs/ | Cached thumbnails for the review grid. |
tts_clips/ | The generated narration audio. Kept and reused, so reopening a project never re-charges you for audio you already made. |
*_photodesc.json | The project file. Open it (or its folder) to pick up exactly where you left off. |
*_photodesc_log.json | A run log: images, voice/model used, and an estimated cost for the run. |
After a render, an Open Folder button appears at the bottom so you can jump straight to the files in Finder.
7. Settings Reference
API & Model
| Setting | Default | Description |
|---|---|---|
| API Key | — | Your Google Gemini API key. Stored in macOS Keychain. Click Save, then Test to confirm it works. |
| Gemini Model | gemini-3-flash-preview | The AI model used to describe photos. Leave this alone unless Google releases a newer one. |
Voice
| Setting | Default | Description |
|---|---|---|
| TTS engine | Gemini | Gemini for natural AI narration (uses the API), or macOS for the built-in system voices (free, offline). |
| Gemini voice | Kore | Which Gemini voice narrates. Test plays a bundled sample at no cost. |
| macOS voice | Samantha | Which system voice narrates when the engine is set to macOS. Download higher-quality voices in System Settings > Accessibility > Spoken Content > Manage Voices. |
| Gemini TTS model | gemini-3.1-flash-tts-preview | The model used to synthesize Gemini narration. |
Slideshow
| Setting | Default | Description |
|---|---|---|
| Blank image hold (s) | 5 | How long a photo with no description stays on screen. |
| Randomize | off | Varies each blank hold by roughly −2 to +3 seconds so a slideshow doesn't feel mechanical. |
Background Music
| Setting | Default | Description |
|---|---|---|
| Track | (None) | The music bed. A few tracks are built in; + Add your own… imports a file. Looped, faded, and ducked under the narration. |
| Volume | 22% | How loud the music sits under the voice. |
Framing
| Setting | Default | Description |
|---|---|---|
| Default fit | Fit | Fit shows the whole photo with filler around it; Fill crops to fill the frame. You can re-crop any photo individually with Crop / Frame. |
| Default edge filler | Blur | What fills the space around a "Fit" photo, a blurred copy of the image, or a solid color. |
| Filler color (hex) | — | The solid color used when edge filler is set to color (e.g. #000000). |
Dissolves & Transitions
| Setting | Default | Description |
|---|---|---|
| Fade in at start (ms) | 500 (off) | Fades up from black at the very beginning. Turn it on with the bottom-bar Dissolves checkboxes; the length is set here. |
| Fade out at end (ms) | 500 (off) | Fades down to black at the very end. |
| Sequence between | Cut | What happens between photos in a combined sequence: Cut (no transition), Fade to black, or Cross-dissolve. |
| Transition length (ms) | 500 | How long each fade-to-black or cross-dissolve takes. |
Output
| Setting | Default | Description |
|---|---|---|
| Aspect ratio | 16:9 | The shape of the video. 16:9 for widescreen, 9:16 for vertical/social, 1:1 square, 4:5 portrait. Set on the project line above the photos. |
| Frame rate | 29.97 | 29.97 / 59.94 are broadcast-safe; 30 / 60 are fine for web & social. |
Accessibility
| Setting | Default | Description |
|---|---|---|
| Status sounds | on | Plays a short cue at key moments, useful for screen-reader users. |
8. Troubleshooting
API key not found
If the top-right dot shows ● API ✗, open Settings, paste in your key, click Save, then Test. The key is stored in macOS Keychain after that.
No description generated for a photo
Make sure you have a working API key (the dot should read ● API ✓), then click the + chip on that row, or Generate Missing Text to fill in any that are blank. If a description came out wrong, click the ↻ chip and add a short note about what to focus on.
I don't hear music in the video
Pick a track from the Music menu before you click Create Videos. Selecting music only affects videos rendered afterward, and the volume slider must be above 0%. The music ducks low while the voice is speaking, so it's most audible in the gaps and at the start and end.
My project folder warning appeared
If you move or delete a project's folder in Finder while it's open, PhotoDesc shows a banner because it can no longer save there. Start a New Project, or reopen the folder if you can put it back.
Render was cancelled or interrupted
Press Cancel once for a graceful stop that keeps everything done so far; the narration you already generated is cached, so picking back up doesn't re-charge you.
9. Tips
Edit the descriptions to taste before generating audio. The AI gives you a solid first draft, but a quick human pass makes the narration feel intentional. Text edits are free; only the audio costs anything.
Use blank (no-description) photos as title cards or pauses, and turn on Randomize so a long slideshow breathes naturally instead of marching at a fixed beat.
For social media, set the aspect ratio to 9:16 (vertical) or 1:1 (square) before you render. It reframes every photo to fit, and you can fine-tune any individual crop with Crop / Frame.
Keep an eye on usage at Google AI Studio. The run log saved with each project lists an estimated cost if you want to track it.
10. Get in Touch
PhotoDesc is built and maintained by Brian Paris at Sounds and Colors. If you run into something this guide doesn't cover, find a bug, or just want to share how you're using it, send a note to Loading….
Real feedback from people actually using this is how the app gets better. Don't hesitate to reach out.