Showroom by Speechbox

Katie NguyenDeveloper Relations Engineer

AI AutomationGenerative MediaADK FrameworkMultimodal AI

New ADK and MCP frameworks streamline Gen Media workflows.
AI agents maintain character consistency across images, videos, and audio.
Multimodal LLMs can now evaluate and self-correct AI-generated content.

At a recent Google Cloud session, Developer Relations Engineer Katie Wynn demonstrated how to build sophisticated Generative Media agents capable of automating complex creative tasks, from character design to full story production, leveraging Google's ADK and MCP frameworks.

The session, titled "Automating Creativity: Building Gen Media Agents with ADK and MCP," showcased a paradigm shift in media creation. Wynn highlighted the critical benefit of agentic workflows: consistency. Unlike traditional methods requiring meticulous manual tracking, AI agents, powered by Gemini, possess memory, enabling them to reference and utilize previously created assets to construct cohesive narratives. This capability ensures that characters, settings, and themes remain consistent across diverse media outputs, including images, videos, music, and narration.

Key Moment

Lulu's AI-generated debut

Wynn provided a live demonstration, building an agent to tell the story of "Lulu the shih tzu." Starting with a simple text prompt, the agent, utilizing Nano Banana 2, generated Lulu's initial image, then proceeded to craft a three-scene storyline, animate each scene into video using VEO, and layer in narration with Gemini's new 3.1 text-to-speech model and background music with Lyra. A deep dive into the code revealed how ADK facilitates the definition of tools and configuration of parameters like resolution and aspect ratio, while MCP servers provide access to Google Cloud's powerful Gen Media models. The agent's ability to automate the entire video editing process, combining various media elements, was a key takeaway.

Key Moment

ADK's powerful controls

Beyond initial generation, the session explored advanced agentic capabilities. Wynn demonstrated how agents can self-correct, showing an example where a natural language prompt to make background music louder in a final video was instantly executed. Furthermore, the concept of "agentic skills" was introduced, allowing developers to encapsulate complex instructions (like expressive audio tags for voice direction) into reusable modules, enhancing robustness and efficiency. Perhaps most groundbreaking was the discussion on using LLMs as judges for multimodal content evaluation. An AI agent can now compare generated media against original prompts, fact-checking details like character adherence or audio timing, and even regenerate content to ensure alignment with the creative vision.

Key Moment

Fixing audio with words

This innovative approach to generative media production promises to democratize creativity, empowering users—even those without extensive creative backgrounds—to rapidly ideate, produce, and refine high-quality multimedia content. By abstracting away the complexities of prompting and technical execution, ADK and MCP-powered agents are setting a new standard for automated storytelling.

Key Moment

LLMs evaluate images

“Gemini's really awesome at multimodality, so we're able to kind of analyze those images even with Gemini, and then fact-check a lot of these questions to make sure that everything's aligned.”
- Katie Nguyen, Developer Relations Engineer

Google Cloud Engineers Unveil AI Agents That Automate Creative Media Production

More Articles