- Google Cloud to host NVIDIA's Vera Rubin and Blackwell GPUs.
- Baseten provides day-zero support for Google's Gemma 4 model.
- New optimizations like TensorRT LLM and NVFP4 boost LLM performance.
At a recent conference, NVIDIA and Baseten leaders detailed their strategic partnership with Google Cloud, focusing on groundbreaking advancements in AI inference. The collaboration promises to deliver unparalleled speed, reliability, and scalability for AI applications, leveraging next-generation hardware and sophisticated software optimizations.
The discussion kicked off with a spotlight on NVIDIA's latest hardware innovations, set to significantly enhance AI inference and training on Google Cloud. Jay Raj from NVIDIA announced that Google Cloud will be among the first providers to offer Vera Rubin, NVIDIA's next-generation hardware, later this year. Additionally, Blackwell GPUs, specifically the RTX Pro 6000 with an impressive 96GB of VRAM, will enable the deployment of multiple models on a single GPU, marking a substantial leap in efficiency. Philip Kylie of Baseten emphasized that inference is crucial for delivering on the promise of AI applications, ensuring low-latency, high-reliability user experiences at hypergrowth scale.
Baseten, a key partner, detailed their approach to inference at scale, processing billions of inferences daily. They highlighted their close collaboration with NVIDIA on both hardware and software, including early adoption of NVIDIA Dynamo and Blackwell GPUs. A standout feature is Baseten's multi-region deployment capability on Google Cloud, which unifies compute resources globally to minimize latency and maximize accessibility. Furthermore, Baseten provided day-zero support for Google's Gemma 4 model, praising its multimodality for image inputs and its wide range of sizes (from 2 billion to 30 billion parameters), making it ideal for fine-tuning task-specific enterprise solutions like KYC and document extraction.
Optimizing LLM inference is a core focus, with NVIDIA recommending TensorRT LLM, an open-source SDK that provides peak performance on NVIDIA hardware with just a few lines of code. This, combined with NVIDIA's NVFP4 precision format and Blackwell GPUs, delivers unparalleled speed. During a live demo, Philip showcased Baseten's platform, demonstrating seamless Gemma 4 deployment from Hugging Face to production on an L4 GPU, complete with an OpenAI-compatible API endpoint. The demo also highlighted the platform's robust auto-scaling capabilities, ensuring steady response times and meeting critical SLAs even under fluctuating traffic. Baseten also offers cost-efficient, token-based APIs for models like NeMo Triton, and advised on selecting GPUs to optimize Total Cost of Ownership (TCO) rather than just upfront cost.
The conversation extended to the role of Google Kubernetes Engine (GKE) in powering complex, multi-model agentic workflows. GKE's low-latency communication between models is vital for compound AI systems, saving dozens of milliseconds per turn and significantly reducing overall latency. Philip also introduced his book, "Inference Engineering," which argues that inference is a holistic challenge encompassing everything from CUDA to distributed systems, demanding tight latency and high uptime. Both Jay and Philip expressed excitement about engaging with developers, understanding their pain points in agentic workflows, and fostering the adoption of open models like Gemma 4 through community efforts and practical demos.
“To me, what inference means is being able to actually deliver on the promise of AI applications.”
- Jason Davenport, Google Cloud




