NVIDIA AI Open Day Tech Talk Videos Now Available


LLM Training / Inference / CUDA Optimization Special Session
Download technical materials here:
https://scrm.nvidia.cn/mF/cms/none/FuceFYmFh5SGkhdaTzeC7N/e8WT66RnGUn6y5SwaaLx9F1
CUDA Optimization Core: Throughput • Latency
![co-newsletter-influencer-july-thumb-cptx-keynote-600×338-zhCN-3094203.jpg]

Session Overview
This session focuses on GPU CUDA optimization techniques, maximizing computational performance, memory bandwidth utilization, and minimizing latency. By exploring the co-evolution of GPU hardware and CUDA software programming alongside optimization principles, we demonstrate the synergy between hardware architecture and algorithm design. Practical examples using high-performance frameworks like CUTLASS will help developers accelerate AI training/inference in key scenarios (e.g., DeepSeek V3/R1 LLM optimization) and unlock GPU’s full potential.
- GPU Computing & Programming Model Evolution: Balancing throughput and latency in asynchronous computing
- GPU Memory System Evolution: Techniques for maximizing bandwidth utilization and latency hiding
- CUDA Abstraction Evolution: From C++ templates to Python CUTLASS development
Watch Full Video:
https://space.bilibili.com/1320140761/lists/5626365?type=season
LLM Training Special Session
Large-scale MoE models like DeepSeek-V3 are driving a new wave in AI, presenting unprecedented challenges to existing frameworks. This talk dives into performance breakthroughs for fine-grained MoE models, covering innovative optimizations in Megatron-Core, including:
- Memory-efficient management
- Compute-communication overlap
- Low-precision quantization
- Parallel strategy optimization

Key Topics:
- Megatron Core MoE in 2025: Architecture, features, performance optimizations, and best practices for DeepSeek-V3
- FP8 Mixed-Precision Training: Methodology and performance analysis
- FSDP Architecture Design in Megatron-Core
Watch Full Video:
https://space.bilibili.com/1320140761/lists/5626365?type=season
LLM Inference Special Session
As LLMs demonstrate powerful capabilities across applications, deploying them efficiently and cost-effectively has become a key industry focus. This session explores the latest advances in LLM inference, including:
- TensorRT-LLM’s development roadmap
- PyTorch workflow best practices
- Collaborative optimizations with DeepSeek and the open-source community
Key Topics:
- TensorRT-LLM Product Strategy Update
- TensorRT-LLM × PyTorch: A new paradigm for high-efficiency LLM inference
- Pushing DeepSeek’s Limits with TensorRT-LLM: Joint optimization with Tencent
Watch Full Video:
https://space.bilibili.com/1320140761/lists/5626365?type=season
熱門頭條新聞
- Wishing you peace and health on Dragon Boat Festival
- Generative AI Reshapes the Gaming Industry: Empowering Creation and Defining Future Development
- 33rd Stuttgart International Festival of Animated Film Concludes Successfully: Forging Long-Term Global Industry Links and Setting a New Benchmark for Animation Exchange
- Chinese Paladin 3 Animation Concludes with Heartwarming Ending, Fulfilling Fans’ Wishes and Setting a New Benchmark for Classic Chinese IP Adaptation
- Chinese Wishlist Ranked Third! Bulgarian Indie Studio Thanks Fans for High-Octane 90s-Style FPS Everything is Gun!
- RUNESCAPE: DRAGONWILDS TO LAUNCH ON PLAYSTATION 5 AND PLAYSTATION PLUS LATER THIS YEAR
- Amateur Developer Creates Overnight Hit with AI: 38-0-0 Tops UK iOS Game Chart and Goes Viral Globally
- Diverse Animations Present a Visual Feast! Animation Sections of the 28th Shanghai International Film Festival Build a Bridge for Global Cultural Exchange