An interactive storytelling platform for children that synthesizes text, custom illustrations, and narrated audio in parallel to create a low-latency, magical user experience.
Bedtime stories are a ritual in our house. My daughter was at the age where she'd ask for the same story twice in a row, then immediately want a new one. I wanted to see if I could build something that gave her a unique story every time, one that felt made just for her.
The technical challenge was making three AI services (text, image, voice) feel like one thing to a child. The product goal was simpler: it had to feel magical.
The flow is designed to be entirely frictionless. It starts with the user entering a few keywords—like "silly dinosaur pizza." Before touching any AI API, a strict content filter quietly screens the input, silently removing inappropriate words or falling back to a generic story if necessary. This ensures the app never throws a confusing error to a child.
Requests go to GPT-4 (for the story) and DALL-E 3 (for the illustration) in parallel. The experience waits only for the story, and as soon as the story text arrives from GPT-4, it is immediately displayed in the UI. The text is simultaneously routed to ElevenLabs for voice synthesis. By running the text and image generation in parallel, and streaming the audio back in chunks, the narration starts playing before the full file is even generated.
The final story can then be saved locally to the browser, ready to be replayed or shared with a grandparent.
Building for a child is a useful constraint. A child can't read instructions and won't wait for loading screens, and the "user" has strong opinions but no vocabulary to articulate them. If she didn't immediately engage, something was wrong.
The hardest part was making three APIs with different failure modes and latencies feel like one product. I ended up parallelising where I could, streaming audio to kill wait time, and building fallbacks for each service independently.