[Paper] Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

Posted on 2026-05-17 | In Tech , End2End

Abstract: This blog post offers an overview of the NVIDIA paper “Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail,” presenting a vision-language-action (VLA) model that integrates structured Chain of Causation (CoC) reasoning with trajectory planning to address long-tail safety-critical scenarios. It details the principles of causally grounded reasoning aligned with driving decisions; the modular architecture built on Cosmos-Reason backbone with efficient multi-camera tokenization and a diffusion-based action expert for real-time feasible trajectories; the hybrid CoC dataset construction via human-in-the-loop and auto-labeling; and the multi-stage training combining supervised fine-tuning for reasoning elicitation with GRPO-based RL post-training to optimize reasoning quality, reasoning-action consistency, and trajectory performance. Experiments demonstrate notable gains in planning accuracy and collision reduction in both open-loop and closed-loop settings, highlighting a practical path toward interpretable and robust Level 4 autonomy.

[Paper] UniAD: Planning-oriented Autonomous Driving

Posted on 2025-12-28 | In Tech , End2End

Abstract: This blog post offers an overview of the 2023 paper “UniAD: Planning-oriented Autonomous Driving” by Yihan Hu, etc, proposing a unified end-to-end framework that coordinates perception, prediction, and planning tasks around the ultimate goal of safe ego-vehicle planning. It details the query-based design connecting modules in BEV space: TrackFormer for joint detection and tracking, MapFormer for online mapping, MotionFormer for multi-agent multimodal trajectory prediction with agent-map and goal interactions, OccFormer for instance-aware occupancy forecasting, and a simple attention-based planner leveraging ego queries and occupancy for collision avoidance. Implementation uses two-stage training with shared matching, and experiments on nuScenes validate the planning-oriented philosophy through strong joint and modular results across all tasks.

[Paper] BEVFusion: A Simple and Robust LiDAR-Camera Fusion Framework

Posted on 2025-08-31 | In Tech , Perception

Abstract: TThis blog post offers an overview of the 2022 paper “BEVFusion: A Simple and Robust LiDAR-Camera Fusion Framework” by Tingting Liang and colleagues, introducing a disentangled fusion approach where camera and LiDAR streams independently extract features into a shared BEV space before dynamic fusion. It explains the principles of modality-independent processing to handle LiDAR malfunctions; the camera stream based on adapted Lift-Splat-Shoot with Dual-Swin backbone, view projection, and BEV encoder; the LiDAR stream using popular detectors like PointPillars or CenterPoint; the lightweight dynamic fusion module with channel-spatial fusion and adaptive selection; and experimental highlights on nuScenes, demonstrating superior performance under normal (69.2% mAP) and robust settings (+15.7% to 28.9% mAP) while maintaining strong generalization.

[Paper] BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

Posted on 2025-08-03 | In Tech , Perception

Abstract: This blog post offers an overview of the 2022 paper “BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers” by Zhiqi Li and colleagues at Shanghai AI Laboratory, introducing a Transformer-based encoder for unified BEV features in autonomous driving without explicit depth reliance. It explains the principles of grid-shaped BEV queries with spatial cross-attention for multi-view aggregation and temporal self-attention for recursive history fusion; the CNN-plus-encoder structure with task heads for detection and segmentation; implementation details like deformable sampling and ego alignment; and experimental highlights on nuScenes, including state-of-the-art NDS (56.9%), improved velocity estimation, and occlusion handling.

[Paper] BEVDet: High-Performance Multi-Camera3D Object Detection in Bird-Eye-View

Posted on 2025-08-02 | In Tech , Perception

Abstract: This blog offers an overview of the 2021 paper “BEVDet” , introducing a modular paradigm for unified 3D object detection in bird’s-eye-view (BEV) space from multi-camera inputs in autonomous driving. It details the four-stage pipeline: an image-view encoder for feature extraction; a view transformer (leveraging LSS) for implicit depth-based projection to BEV; a BEV encoder for spatial refinement; and a task head for predictions.

[Paper] LSS: Lift, Splat, Shoot: Encoding Images from Arbitrary Camera Rigs by Implicitly Unprojecting to 3D

Posted on 2025-01-19 | In Tech , Perception

Abstract: This blog provides an accessible explanation of the 2020 paper “Lift, Splat, Shoot” , focusing on an end-to-end architecture for generating bird’s-eye-view (BEV) representations from multi-camera images in autonomous driving. It highlights the framework’s three core steps: “Lift,” which implicitly unprojects 2D images into 3D frustums using latent depth distributions; “Splat,” which fuses these into a rasterized BEV grid via pillar pooling; and “Shoot,” which enables interpretable motion planning by evaluating template trajectories on a learned cost map.

[Paper] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Posted on 2024-12-21 | In Tech , Detection

Abstract: This blog post provides an overview of the 2021 paper “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows” by Ze Liu and colleagues at Microsoft Research Asia, proposing a versatile Transformer backbone for vision tasks by addressing scale variations and high-resolution challenges. It details the core principles of hierarchical representations and shifted-window attention for linear complexity; the multi-stage structure with patch merging and alternating W-MSA/SW-MSA blocks; implementation aspects like relative position bias and efficient cyclic shifts; and key experimental findings on superior performance in classification (ImageNet), detection (COCO), and segmentation (ADE20K), plus appendix insights on variants and ablation studies confirming the shifted-window efficacy.

[Paper] ViT: An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale

Posted on 2024-11-30 | In Tech , Detection

Abstract: This blog post provides a overview of the 2021 ICLR paper “An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale” by Alexey Dosovitskiy and team at Google Brain, adapting the Transformer to vision by treating images as sequences of patches. It covers the principles of patch embedding and global attention; the encoder-only structure with variants like ViT-Base/Large/Huge; implementation aspects including pre-training on large datasets and fine-tuning; and key experimental insights on scaling effects, where large-scale pre-training outperforms CNN inductive biases on benchmarks like ImageNet, plus appendix details on hyperparameters and visualizations.

Read more »

[Paper] DETR: End-to-End Object Detection with Transformers

Posted on 2024-06-02 | In Tech , Detection

Abstract: This blog post offers an overview of the 2020 paper “End-to-End Object Detection with Transformers” by Nicolas Carion and colleagues at Facebook AI, reimagining object detection as a direct set prediction task to eliminate hand-crafted components like anchors or NMS. It explains the core principles of bipartite matching loss for unique assignments and Transformer-based relational modeling; the CNN-backbone plus encoder-decoder structure with learned object queries; implementation details such as training schedules, positional encodings, and extensions to panoptic segmentation; and key experimental insights on COCO, including competitive AP with Faster R-CNN, stronger large-object performance, and ablation studies highlighting the matching loss’s importance.

[Paper] PointPillars: Fast Encoders for Object Detection from Point Clouds

Posted on 2023-09-30 | In Tech , Perception

Abstract: This blog post offers an overview of the 2019 paper “PointPillars: Fast Encoders for Object Detection from Point Clouds” by Alex H. Lang and colleagues, presenting a pillar-based encoder that learns features from vertical columns of LiDAR points using simplified PointNets, enabling efficient 2D convolutional detection. It covers the principles of converting sparse point clouds into a dense pseudo-image via pillar feature encoding (with point decorations); the 2D backbone with top-down and upsampling paths; the SSD detection head for oriented 3D boxes; key implementation details including data augmentation and loss design; and experimental results on KITTI, achieving state-of-the-art BEV and 3D performance at 62 Hz (up to 105 Hz), significantly outperforming prior methods in both speed and accuracy.

Read more »

Chiangbin Li

Dreams don't work unless you DO

RSS

Github Leiphone Email