PhotoDesc User Guide

Audio-described videos from your photos

1. Getting Started

Installing

Open the DMG file and drag PhotoDesc to your Applications folder. First launch, macOS will ask you to confirm you want to open it since it wasn't downloaded from the App Store. That's normal. Click Open. PhotoDesc runs on Apple Silicon (M-series) Macs.

Setting Up Your API Key

PhotoDesc uses Google's Gemini AI to look at each photo and write a description, and (optionally) to narrate it. You'll need a free Gemini API key.

  1. Go to Google AI Studio and sign in with a Google account
  2. Create an API key
  3. When PhotoDesc launches the first time, it'll ask for that key. You can also click Skip for now and add it later. You can import photos without a key; you just need one to generate descriptions or AI narration.
  4. The key is stored in your macOS Keychain, so you only enter it once

You can update the key anytime in Settings > API Key. Click Save, then Test to confirm it works. The dot in the top-right corner shows the key's status. ● API ✓ when it's connected, ● API ✗ if there's a problem.

Choosing a Voice

PhotoDesc can narrate with Google's natural Gemini voices (default: Kore) or with the built-in macOS voices (default: Samantha), which are free and work offline. Pick the engine and voice in Settings > Voice.

Auditioning is free: the Test button next to the voice plays a bundled sample. It never makes an API call, so trying out voices costs nothing.

2. Basic Workflow

  1. Add photos — Click + Click here to add images in the middle of the window (or + Add more images at the bottom once you have some). Pick one photo or several. A project folder is created automatically the first time you add images.
  2. Generate descriptions — Click Generate All Text to have the AI describe every photo, or use the + chip on a single row to describe just that one. (The button becomes Generate Missing Text once some rows already have descriptions.)
  3. Review and edit — Each photo shows up as a row with its description. Read it, edit the wording, re-crop the frame, change the voice, whatever you need (see the next section).
  4. Pick a voice (and optionally music) — Set the voice in Settings; choose a background-music track from the Music menu at the bottom if you want one.
  5. Choose the frame shape — Set the Aspect ratio (16:9, 9:16, 1:1, 4:5) and Frame rate on the project line above the photos.
  6. Choose how to export — Flip the Export toggle to Per image or Sequence (details below).
  7. Click Create Videos — PhotoDesc generates the narration and renders your video(s) into the project folder. Use In… instead if you want them somewhere else.
Your work is saved automatically. Each project lives in its own folder under ~/Documents/PhotoDesc/. Edits, thumbnails, and the narration you've paid for are all kept there, so you can quit and reopen without losing anything or re-generating audio. Use File > Open to reopen a project, or Save As… to move/rename its folder.

3. Two Ways to Export

The Export toggle at the bottom decides what you get when you click Create Videos.

Per Image

Each photo becomes its own short video, the image on screen while its description is read. Good for posting one at a time, or when each photo stands on its own.

Combined Sequence

All the approved photos play back to back as a single video, each shown while its narration plays, with optional dissolves between them. (Sequence is only available when you have more than one photo.)

Photos without a description still count. A photo you leave blank shows on screen silently for a few seconds (set in Settings), so you can use title images or breathers between described shots. Turn on Randomize to vary each blank hold a little so a slideshow doesn't feel mechanical.

4. The Review Screen

Every photo you add becomes a row. Here's what's on it:

Generating audio costs money (see Settings > cost note); editing text and re-cropping are free. Nothing is final until you click Create Videos.

5. Voices and Background Music

Set the narration voice in Settings > Voice, a range of natural Gemini voices, or the free macOS voices. You can also override the voice on an individual photo.

To add music, pick a track from the Music menu at the bottom of the window (a few are built in, and + Add your own… lets you import a file). The track is looped to the length of the video, faded in and out, and ducked. It drops down whenever the narration is speaking and comes back up in the gaps, so the words always stay clear. The next to the menu previews the track; the Volume slider in Settings sets how loud it sits under the voice (default 22%).

6. Output Files

Everything for a project lives in its folder under ~/Documents/PhotoDesc/ (or wherever you exported with In…).

FileDescription
*_described.mp4Per-image export. One video per photo, the image shown while its description is read.
*_sequence.mp4Combined-sequence export. All photos in one video, back to back.
thumbs/Cached thumbnails for the review grid.
tts_clips/The generated narration audio. Kept and reused, so reopening a project never re-charges you for audio you already made.
*_photodesc.jsonThe project file. Open it (or its folder) to pick up exactly where you left off.
*_photodesc_log.jsonA run log: images, voice/model used, and an estimated cost for the run.

After a render, an Open Folder button appears at the bottom so you can jump straight to the files in Finder.

7. Settings Reference

API & Model

SettingDefaultDescription
API KeyYour Google Gemini API key. Stored in macOS Keychain. Click Save, then Test to confirm it works.
Gemini Modelgemini-3-flash-previewThe AI model used to describe photos. Leave this alone unless Google releases a newer one.

Voice

SettingDefaultDescription
TTS engineGeminiGemini for natural AI narration (uses the API), or macOS for the built-in system voices (free, offline).
Gemini voiceKoreWhich Gemini voice narrates. Test plays a bundled sample at no cost.
macOS voiceSamanthaWhich system voice narrates when the engine is set to macOS. Download higher-quality voices in System Settings > Accessibility > Spoken Content > Manage Voices.
Gemini TTS modelgemini-3.1-flash-tts-previewThe model used to synthesize Gemini narration.

Slideshow

SettingDefaultDescription
Blank image hold (s)5How long a photo with no description stays on screen.
RandomizeoffVaries each blank hold by roughly −2 to +3 seconds so a slideshow doesn't feel mechanical.

Background Music

SettingDefaultDescription
Track(None)The music bed. A few tracks are built in; + Add your own… imports a file. Looped, faded, and ducked under the narration.
Volume22%How loud the music sits under the voice.

Framing

SettingDefaultDescription
Default fitFitFit shows the whole photo with filler around it; Fill crops to fill the frame. You can re-crop any photo individually with Crop / Frame.
Default edge fillerBlurWhat fills the space around a "Fit" photo, a blurred copy of the image, or a solid color.
Filler color (hex)The solid color used when edge filler is set to color (e.g. #000000).

Dissolves & Transitions

SettingDefaultDescription
Fade in at start (ms)500 (off)Fades up from black at the very beginning. Turn it on with the bottom-bar Dissolves checkboxes; the length is set here.
Fade out at end (ms)500 (off)Fades down to black at the very end.
Sequence betweenCutWhat happens between photos in a combined sequence: Cut (no transition), Fade to black, or Cross-dissolve.
Transition length (ms)500How long each fade-to-black or cross-dissolve takes.

Output

SettingDefaultDescription
Aspect ratio16:9The shape of the video. 16:9 for widescreen, 9:16 for vertical/social, 1:1 square, 4:5 portrait. Set on the project line above the photos.
Frame rate29.9729.97 / 59.94 are broadcast-safe; 30 / 60 are fine for web & social.

Accessibility

SettingDefaultDescription
Status soundsonPlays a short cue at key moments, useful for screen-reader users.
A note on cost: the app is free. You bring your own Gemini API key, and you pay Google only for what you generate. Almost all of it is the AI narration. Describing the photos is fractions of a cent each, and narration runs about a tenth of a cent per second of speech, so a typical project comes in well under a dollar. The macOS voices are completely free. Each row's audio is generated once and reused, so editing and re-rendering don't re-charge you. The run log lists an estimate after every render.

8. Troubleshooting

API key not found

If the top-right dot shows ● API ✗, open Settings, paste in your key, click Save, then Test. The key is stored in macOS Keychain after that.

No description generated for a photo

Make sure you have a working API key (the dot should read ● API ✓), then click the + chip on that row, or Generate Missing Text to fill in any that are blank. If a description came out wrong, click the chip and add a short note about what to focus on.

I don't hear music in the video

Pick a track from the Music menu before you click Create Videos. Selecting music only affects videos rendered afterward, and the volume slider must be above 0%. The music ducks low while the voice is speaking, so it's most audible in the gaps and at the start and end.

My project folder warning appeared

If you move or delete a project's folder in Finder while it's open, PhotoDesc shows a banner because it can no longer save there. Start a New Project, or reopen the folder if you can put it back.

Render was cancelled or interrupted

Press Cancel once for a graceful stop that keeps everything done so far; the narration you already generated is cached, so picking back up doesn't re-charge you.

9. Tips

Edit the descriptions to taste before generating audio. The AI gives you a solid first draft, but a quick human pass makes the narration feel intentional. Text edits are free; only the audio costs anything.

Use blank (no-description) photos as title cards or pauses, and turn on Randomize so a long slideshow breathes naturally instead of marching at a fixed beat.

For social media, set the aspect ratio to 9:16 (vertical) or 1:1 (square) before you render. It reframes every photo to fit, and you can fine-tune any individual crop with Crop / Frame.

Keep an eye on usage at Google AI Studio. The run log saved with each project lists an estimated cost if you want to track it.

10. Get in Touch

PhotoDesc is built and maintained by Brian Paris at Sounds and Colors. If you run into something this guide doesn't cover, find a bug, or just want to share how you're using it, send a note to Loading….

Real feedback from people actually using this is how the app gets better. Don't hesitate to reach out.