Showroom by Speechbox

NVIDIA and Baseten Unveil Next-Gen AI Inference Capabilities on Google Cloud

Jason DavenportGoogle Cloud
AI InfrastructureGPU AccelerationCloud AIModel DeploymentScalable AIEnterprise AIDeep LearningOptimized Inference

At a recent conference, NVIDIA and Baseten leaders detailed their strategic partnership with Google Cloud, focusing on groundbreaking advancements in AI inference. The collaboration promises to deliver unparalleled speed, reliability, and scalability for AI applications, leveraging next-generation hardware and sophisticated software optimizations.

The discussion kicked off with a spotlight on NVIDIA's latest hardware innovations, set to significantly enhance AI inference and training on Google Cloud. Jay Raj from NVIDIA announced that Google Cloud will be among the first providers to offer Vera Rubin, NVIDIA's next-generation hardware, later this year. Additionally, Blackwell GPUs, specifically the RTX Pro 6000 with an impressive 96GB of VRAM, will enable the deployment of multiple models on a single GPU, marking a substantial leap in efficiency. Philip Kylie of Baseten emphasized that inference is crucial for delivering on the promise of AI applications, ensuring low-latency, high-reliability user experiences at hypergrowth scale.

Key Moment
Inference is Key

Baseten, a key partner, detailed their approach to inference at scale, processing billions of inferences daily. They highlighted their close collaboration with NVIDIA on both hardware and software, including early adoption of NVIDIA Dynamo and Blackwell GPUs. A standout feature is Baseten's multi-region deployment capability on Google Cloud, which unifies compute resources globally to minimize latency and maximize accessibility. Furthermore, Baseten provided day-zero support for Google's Gemma 4 model, praising its multimodality for image inputs and its wide range of sizes (from 2 billion to 30 billion parameters), making it ideal for fine-tuning task-specific enterprise solutions like KYC and document extraction.

Key Moment
Day Zero Support

Optimizing LLM inference is a core focus, with NVIDIA recommending TensorRT LLM, an open-source SDK that provides peak performance on NVIDIA hardware with just a few lines of code. This, combined with NVIDIA's NVFP4 precision format and Blackwell GPUs, delivers unparalleled speed. During a live demo, Philip showcased Baseten's platform, demonstrating seamless Gemma 4 deployment from Hugging Face to production on an L4 GPU, complete with an OpenAI-compatible API endpoint. The demo also highlighted the platform's robust auto-scaling capabilities, ensuring steady response times and meeting critical SLAs even under fluctuating traffic. Baseten also offers cost-efficient, token-based APIs for models like NeMo Triton, and advised on selecting GPUs to optimize Total Cost of Ownership (TCO) rather than just upfront cost.

Key Moment
AI Auto-Scaling Magic

The conversation extended to the role of Google Kubernetes Engine (GKE) in powering complex, multi-model agentic workflows. GKE's low-latency communication between models is vital for compound AI systems, saving dozens of milliseconds per turn and significantly reducing overall latency. Philip also introduced his book, "Inference Engineering," which argues that inference is a holistic challenge encompassing everything from CUDA to distributed systems, demanding tight latency and high uptime. Both Jay and Philip expressed excitement about engaging with developers, understanding their pain points in agentic workflows, and fostering the adoption of open models like Gemma 4 through community efforts and practical demos.

Key Moment
The Inference Bible

To me, what inference means is being able to actually deliver on the promise of AI applications.

- Jason Davenport, Google Cloud

More Articles