Case study

PodForge

A self-hosted podcast pipeline that turns a topic into a finished, published episode. Claude writes the transcript to a per-feed brief; a Mac service renders it with a local Kokoro voice, mixes jingles and sound effects and normalises loudness; a Raspberry Pi writes the MP3 and rebuilds the feed. One command wraps the whole run, it is also exposed as an MCP server, and a calibration loop keeps episode lengths honest.

Last updated: 3 Jul 2026

Role: Architect and engineer (AI-paired)
Year: 2026
Client: Personal applied-AI build
Outcome: Live, multi-feed, self-hosted

What it does

PodForge turns a topic into a finished, published podcast episode, without a studio, a microphone, or a cloud bill.

You give it a subject, a document, or just the thread of a conversation. Claude writes a real transcript to the brief of whichever feed you are publishing to. A service on my Mac renders that script to audio with a local text-to-speech voice, mixes in the intro jingle and any sound effects, and normalises the loudness so every episode lands at the same level. A Raspberry Pi then writes the MP3 and rebuilds the RSS feed, and the episode appears in an ordinary podcast app a few seconds later.

The whole run sits behind one command, podcast publish. That same toolchain is also exposed as an MCP server, so I can drive it straight from a conversation: write the script here, render and publish there, report the duration and the feed URL back.

Example output: a published episode of “The Big Why” on the children’s feed. The audio is rendered on-device with local voices, the mix carries an intro jingle and sound effects played in the gaps, and the script was written to a calibrated length target rather than a guessed one. The Pi rebuilds the feed and the episode is delivered through a signed URL.

It is genuinely multi-feed. The same engine runs a focused, two-host study feed for my MBA revision and a gentler, more wondering children’s feed for my eight-year-old, each with its own cast, tone, jingles and rules. Adding another feed is one command and a small config file.

The problem

I wanted my own podcasts, on my own terms, for an audience of two.

One of them is me. I am working through an MBA, and the most useful thing I can do with a dense week of course material is turn it into something I can listen to on a walk: a focused, two-host episode that holds the argument together rather than reading a summary at me. The other is my eight-year-old son, who asks the kind of relentless, wonderful, gloriously silly questions a curious child asks — why is the sky blue, where does the sea end — and deserves a show that takes them seriously, in a voice and a world built for him.

The closest off-the-shelf answer is Google’s NotebookLM: hand it your sources, get back an AI “audio overview”. It is genuinely clever. But it writes the show its way, picks the voices, and keeps the result inside someone else’s web app; and the moment you want natural-sounding speech of your own, you are renting a hosted text-to-speech voice — an ElevenLabs subscription, or one like it — on top of a hosting bill. The writing lives in their product rather than in a conversation, the voices and the content sit on their servers, and the marginal cost of one more episode is a line on an invoice.

For something this personal and this small, that is the wrong shape. The marginal cost of one more episode should be close to nothing. The content should stay on hardware I own. And the interesting part, the writing, should sit where the writing is best done, in a conversation with Claude, not bolted onto a generic generator that turns any pasted text into the same flat read.

The children’s feed carries a second motive worth naming. He already has a Toniebox — the lovely, screen-free audio box where a figurine plays a fixed, bought-in story. PodForge’s children’s feed is the personalised counterpart to it: the episodes answer his questions, not a stranger’s, and because each feed is just ordinary podcast RSS, the show lands inside the Apple ecosystem he already lives in, where Screen Time and parental controls apply natively. A bedtime show written for him, that the platform’s own guardrails already police.

So the build question was narrow. Could I make a pipeline that writes a genuinely good script to a specific show’s brief, renders it to natural audio locally, publishes a real RSS feed I can subscribe to in any app, and costs nothing per episode to run. PodForge is that pipeline — in effect a Claude-first alternative to NotebookLM, with no hosted-voice subscription behind it — built end to end, with the awkward parts of audio left in rather than smoothed over.

Architecture

One path from a topic to a published feed. Claude writes the transcript to the per-feed config; the Mac service renders, mixes and normalises with a local model; the Pi writes the MP3 and rebuilds the feed.xml; delivery is a Cloudflare Tunnel with admin access and signed URLs for listeners. A single CLI, also an MCP server, wraps render and publish, and the measured duration feeds back into the next script’s length target.

The shape is a clean split between the part that should reason and the parts that should be deterministic.

The writing is reasoning, and it sits with Claude, guided by a Skill. The Skill is instructions, not code: it loads the target feed’s configuration and its running feedback log, applies the tone profile and the cast, structures any named segments, and writes the script. It treats the most recent feedback on a feed as an instruction rather than a suggestion, which is how a show develops a voice over a run of episodes.

Everything after the script is machinery. The Mac runs a small local service as a background agent. It renders the transcript with Kokoro, a text-to-speech model that runs on the Apple Silicon GPU at roughly real time or faster, so a ten-minute episode renders in about a minute and a half and never leaves the machine. It then mixes any sound effects, wraps the intro and outro jingles, and normalises the loudness. The finished MP3 is handed to a service on the Raspberry Pi, which stores it and rebuilds the RSS for that feed. Delivery runs over a Cloudflare Tunnel: the admin surface sits behind Access, and each listener gets their own signed URL, so a feed can be shared with one person without being public.

A measured calibration file closes the loop. Rather than assume a fixed words-per-minute rate, the system records the real rendered duration of every episode and uses it to set the word-count target for the next one, per feed, so the length estimate sharpens with use.

Control where it counts

The difference between a text reader and a listenable podcast is in the details of the speech, so the script carries a small set of controls that the renderer honours.

Pronunciation is handled, not hoped for. Acronyms and proper nouns that a model tends to mangle are wrapped with a phonetic override that forces the right reading on every occurrence, drawn from a dictionary that grows whenever the voice reveals a new mispronunciation. The script can also place a deliberate pause to land a point, slow a span it wants the listener to sit with, and tag a sound effect that the renderer resolves to real audio at build time. Those effects are placed to play in the gaps between lines rather than under the voice, so they add texture without fighting the words.

The unglamorous rules matter just as much. The dialogue is kept free of any markup, because the voice would read the symbols out loud. When an episode title promises a number of facts, the script is checked to contain exactly that many. Numbers are made to agree with the things they count. None of this is clever, and all of it is the difference between something you would actually listen to and something you would switch off.

A decision the build made

PodForge began as one thing, a study-podcast tool for revision, with a single feed and a single register. The obvious thing to do with a tool that works is to add features to that one show.

The better decision was to generalise it instead. The study feed and the children’s feed want genuinely different things: a different cast, a different pace, different vocabulary, a different attitude to jokes and sound effects, and in the children’s case a hard rule that the hosts are fictional characters and no real person is ever impersonated. Rather than fork the tool or pile both shows’ needs into one prompt, the build pulled everything show-specific out into a per-feed configuration and a per-feed feedback log, and left the engine generic. Adding a feed is now a command and a small file, and a change to one show cannot leak into another.

That carried a cost worth naming. It meant building a scope discipline that did not exist before: every change now has to be classified as tool-wide or feed-scoped before it is made, because the same code path serves every show. The reward is that the pipeline is no longer a single podcast with my preferences baked in. It is a small podcast platform that happens to run two feeds today and could run ten tomorrow.

The other decision worth stating plainly is the one not to spend money. A hosted voice — the ElevenLabs-style subscription a tool like NotebookLM nudges you towards — would likely be a notch more natural than a local one. But local rendering keeps the marginal cost of an episode at zero, works offline, and keeps every voice and every script on hardware I own, and the calibration loop is exactly the instrument that would tell me whether a paid voice ever earned its place. For a personal tool meant to run for years, that was the right trade, and the architecture is drawn so the voice is a swappable part, not a rewrite.

Honest by design

The guardrails are not a note at the end of an episode. They are rules the generator has to respect before anything is rendered.

The hosts are fictional, always. A feed can register a named persona as a character, but the system will not put words in the mouth of a real person the listener knows, and on the children’s feed that is a hard rule rather than a preference. The voices are a fixed, registered cast, so an invented guest cannot quietly collide with an existing one. The output is written in British English by default, pitched at the reading level the feed asks for, and a children’s episode is held to gentler pacing and simpler analogies than a study one.

For a tool that produces audio for my own family, that posture is the product, not the packaging. The point of owning the whole pipeline is that these rules live in code and config I control, not in the terms of service of a generator I rent.

Open work

The honest list of what is not finished.

The calibration loop is good and getting better, but it is still an average rate per feed, not a model of how pacing, jingles and sound effects each bend the final duration. Episodes with heavy effects are the ones it estimates least well, and that is the next thing to sharpen.

Sound-effect sourcing leans on a couple of free libraries and a local cache. It is reliable for common cues and thin for specific ones, and the resolver is built to fall back rather than fail, which is the right behaviour but not yet a rich one.

Artwork is per-feed rather than per-episode, because the publishing host is locked down enough that deploying a new cover for every episode is not yet automated. It is a plumbing problem, not a design one.

PodForge is a deliberately small answer to a question I actually had: can one person run their own podcasts properly, for almost nothing, without handing the writing, the voices or the content to someone else. The answer it argues for is the one I would bring to a larger build. Put the reasoning where reasoning is best done and keep everything else deterministic. Measure the thing you would otherwise guess. Pull what varies out into configuration so the engine stays generic. And own the rules that matter, rather than renting them. The subject here is a family podcast; the discipline travels.

Updates

18 June 2026

A better voice: from a dialogue model to per-speaker cloning

The local voice was the one part of the pipeline I had left rough. A push to make it more expressive turned into a useful lesson about choosing the right kind of model, and landed on a setup that is both more natural and more consistent than where it started.

The case study above describes the renderer as Kokoro, a small, fast, reliable local voice. It still is the dependable fallback, but the default has now moved on, and the path there is worth recording because it is a tidy example of letting evidence, not enthusiasm, choose the tool.

The motivation was expressiveness. Kokoro reads cleanly but a little flatly, and I wanted the study feed in particular to sound more like a conversation. The obvious candidate was Dia, an open dialogue model that generates two speakers in one pass with real warmth and timing. On paper it was perfect. In practice it had a fault that no amount of tuning fixed: it re-invents the speaker voices on every generation call, and because a long episode is rendered as many short calls, the voices drifted from one passage to the next and the joins between calls were audible. This is not a quirk of my setup; it is a known, open limitation of the model itself. I spent a full pass on segmentation, pacing, silence handling and reference-audio cloning before accepting that the model simply could not hold a steady voice across a whole episode.

So the first decision was to stop. I reverted the default to Kokoro, which renders a whole script in a single call and therefore cannot drift or seam, and wrote the alternatives up properly rather than keep pushing a tool that was the wrong shape for the job.

The answer turned out to be a different open model, Qwen3-TTS. The key difference is structural: it renders one fixed voice per turn rather than improvising a dialogue, so a cloned voice stays identical from the first line to the last. I clone each feed's existing British voices once into a small library, and the renderer maps every speaker to their voice from the same per-feed config the rest of the system already uses. The result is more natural than Kokoro, perfectly consistent unlike Dia, runs entirely on the Mac, and renders an episode in well under a minute. It is now the default for every feed, with Kokoro kept as the reliable fallback and the voice still a swappable part, exactly as the architecture was drawn to allow.

The lesson is the one the case study already argues for, sharpened by a real failure. The expressive-looking tool was not the right tool; the right tool was the one whose structure matched the constraint, a steady voice held across many calls. Measuring what you would otherwise guess, and keeping the voice a swappable part rather than a rewrite, is what made the change a single afternoon instead of a rebuild.
3 July 2026

Buying a better voice without buying a bill: a tiered, budget-capped renderer

The case study argued the calibration loop was the instrument that would tell me whether a paid voice ever earned its place. This is that instrument used in earnest: a free-first tiering that makes Google's high-quality voices the default, while a hard monthly cap keeps the marginal cost of an episode where it belongs, close to nothing.

The case study makes a promise: keep the voice a swappable part, and let the calibration loop decide whether a paid voice is ever worth it. This update is that promise kept, in both directions.

I gave Google's cloud text-to-speech a fair trial and it is clearly better than the local voices, so it is now the default. But I did not want better to quietly become a bill, which is the very shape I criticised NotebookLM for. So the change is not really switch to a cloud voice. It is a small routing layer that sits under PodForge, and under anything else that needs speech, and treats cost as a first-class constraint.

The rule is free-first. There is a tier ladder: the local on-device voices, which are free and unlimited; Google's voices inside their monthly free allowance; Google's voices as paid, but only while a hard monthly cap of ten pounds is not breached; and a gated premium tier that never runs without me asking. A usage ledger tracks characters and spend against that cap, and if a job would push spend over the line the router does not overspend, it drops back to the free local voice and says so. At my volume the whole thing sits inside the free allowance anyway, so in practice the quality went up and the marginal cost stayed at nothing. The cap guards a future I have not reached, not a bill I am paying.

The nicer surprise was multi-voice. Voices now live in named casts, and each speaker carries both a cloud voice and a local one, so the engine is chosen once for a whole script and an episode never shifts voice halfway through. Because the cloud palette is far larger than my small local set, the children's feed can finally give every guest character its own distinct voice, which the local setup could not do without two characters colliding on one. The study feed keeps its two hosts; the children's feed grew a proper ensemble.

And the honest part, because the open-work section demands one. The measured calibration loop is trained on the local voices, and the cloud voice speaks about a seventh faster. The first full children's episode rendered on the new default came in a little under seventeen minutes against a twenty-minute target, purely because the words were spoken quicker. Nothing broke, but the length model is now engine-specific, and I logged a separate cloud speaking rate rather than let episodes land short and pretend the estimate was fine. Measure the thing you would otherwise guess, again.