← All projects
Production App · Gen AI

Story Magic

An interactive storytelling platform for children that synthesizes text, custom illustrations, and narrated audio in parallel to create a low-latency, magical user experience.

Multimodal AI Product Design UX Engineering Streaming Architecture

Background

Bedtime stories are a ritual in our house. My daughter was at the age where she'd ask for the same story twice in a row, then immediately want a new one. I wanted to see if I could build something that gave her a unique story every time, one that felt made just for her.

The technical challenge was making three AI services (text, image, voice) feel like one thing to a child. The product goal was simpler: it had to feel magical.

How it works

The flow is designed to be entirely frictionless. It starts with the user entering a few keywords—like "silly dinosaur pizza." Before touching any AI API, a strict content filter quietly screens the input, silently removing inappropriate words or falling back to a generic story if necessary. This ensures the app never throws a confusing error to a child.

Requests go to GPT-4 (for the story) and DALL-E 3 (for the illustration) in parallel. The experience waits only for the story, and as soon as the story text arrives from GPT-4, it is immediately displayed in the UI. The text is simultaneously routed to ElevenLabs for voice synthesis. By running the text and image generation in parallel, and streaming the audio back in chunks, the narration starts playing before the full file is even generated.

The final story can then be saved locally to the browser, ready to be replayed or shared with a grandparent.

The interesting parts

Streaming audio without waiting. Getting ElevenLabs audio to start playing before the full file was generated took some work. Children don't have patience for progress bars — if they have to wait 10 seconds for narration to load, the moment is gone. Streaming incrementally solved that: the story starts reading out loud within a couple of seconds of the request.
Three AI APIs, one coherent experience. Each service has its own latency profile — GPT-4 is fast, DALL-E 3 is slower, ElevenLabs is somewhere in between. The app runs text and image generation in parallel where it can, and starts streaming audio as soon as the story text is ready, without waiting for the image. The result feels fast even though there's a lot happening.
Content safety by design, not afterthought. The filtering had to be permissive enough to keep stories fun — kids say weird things — but strict enough that nothing inappropriate made it into a story for a three-year-old. The fallback (generate a generic story if all keywords are filtered) meant the app never broke or showed an error to the user.

What I took from it

Building for a child is a useful constraint. A child can't read instructions and won't wait for loading screens, and the "user" has strong opinions but no vocabulary to articulate them. If she didn't immediately engage, something was wrong.

The hardest part was making three APIs with different failure modes and latencies feel like one product. I ended up parallelising where I could, streaming audio to kill wait time, and building fallbacks for each service independently.

View all projects