NVIDIA AI Open Day Tech Talk Videos Now Available


LLM Training / Inference / CUDA Optimization Special Session
Download technical materials here:
https://scrm.nvidia.cn/mF/cms/none/FuceFYmFh5SGkhdaTzeC7N/e8WT66RnGUn6y5SwaaLx9F1
CUDA Optimization Core: Throughput • Latency
![co-newsletter-influencer-july-thumb-cptx-keynote-600×338-zhCN-3094203.jpg]

Session Overview
This session focuses on GPU CUDA optimization techniques, maximizing computational performance, memory bandwidth utilization, and minimizing latency. By exploring the co-evolution of GPU hardware and CUDA software programming alongside optimization principles, we demonstrate the synergy between hardware architecture and algorithm design. Practical examples using high-performance frameworks like CUTLASS will help developers accelerate AI training/inference in key scenarios (e.g., DeepSeek V3/R1 LLM optimization) and unlock GPU’s full potential.
- GPU Computing & Programming Model Evolution: Balancing throughput and latency in asynchronous computing
- GPU Memory System Evolution: Techniques for maximizing bandwidth utilization and latency hiding
- CUDA Abstraction Evolution: From C++ templates to Python CUTLASS development
Watch Full Video:
https://space.bilibili.com/1320140761/lists/5626365?type=season
LLM Training Special Session
Large-scale MoE models like DeepSeek-V3 are driving a new wave in AI, presenting unprecedented challenges to existing frameworks. This talk dives into performance breakthroughs for fine-grained MoE models, covering innovative optimizations in Megatron-Core, including:
- Memory-efficient management
- Compute-communication overlap
- Low-precision quantization
- Parallel strategy optimization

Key Topics:
- Megatron Core MoE in 2025: Architecture, features, performance optimizations, and best practices for DeepSeek-V3
- FP8 Mixed-Precision Training: Methodology and performance analysis
- FSDP Architecture Design in Megatron-Core
Watch Full Video:
https://space.bilibili.com/1320140761/lists/5626365?type=season
LLM Inference Special Session
As LLMs demonstrate powerful capabilities across applications, deploying them efficiently and cost-effectively has become a key industry focus. This session explores the latest advances in LLM inference, including:
- TensorRT-LLM’s development roadmap
- PyTorch workflow best practices
- Collaborative optimizations with DeepSeek and the open-source community
Key Topics:
- TensorRT-LLM Product Strategy Update
- TensorRT-LLM × PyTorch: A new paradigm for high-efficiency LLM inference
- Pushing DeepSeek’s Limits with TensorRT-LLM: Joint optimization with Tencent
Watch Full Video:
https://space.bilibili.com/1320140761/lists/5626365?type=season
熱門頭條新聞
- CGGE Represents China’s Blender Community at BCON Austin 2026, the Official Blender Conference in the United States
- 2026 Global Mobile App Market Report: Quality Enhancement from Existing Stock, AI-Driven Growth, Industry Enters a New Cycle of High-Quality Growth
- March 2026 Global Mobile Game Revenue Rankings: Chinese Publishers Lead the World as Market Grows Steadily
- PC Games to Mobile: to Port or Not to Port?
- Gamesforum Releases “2026 Top Mobile Game Challenges Report”: AI Reshapes Growth Logic, Refined Operations Become the Key to Breakthrough
- IDC 2025H2 Cloud Gaming Report Released: Tencent Cloud Leads in Usage Volume, Secures Dual Leadership Position in China and Globally
- Marvel Cuts 8% of Staff: Disney’s Global Cost Restructuring Reshapes Hollywood and Content Industry
- March Chinese Animation Chronicle: “Xian Ni” Dominates with a Landslide, Cultivation Dongs Take the Stage by Storm, Market Enters a New Era of “Multiple Powers Rising”