Toqi Tahamid SarkerToqi Tahamid Sarker
  • About
  • Publications
  • Resume
  • Blog
AboutPublicationsResumeBlog
© 2026 Toqi Tahamid Sarker • All rights reserved
EntriesLinksNotesQuotesTILsGuides

My First Conference Experience at CVPR 2024

CVPR 2024 in Seattle was my first conference experience. On the first day, I attended four workshops: Autonomous Driving, Visual Localization and Mapping, Neural Architecture Search, and Computer Vision in the Wild. Here are the highlights from each.

Highlight from Autonomous Driving Workshop

I attended the Autonomous Driving workshop where NVIDIA researchers presented Hydra-MDP, an innovative framework for end-to-end multimodal planning. It won first place and the innovation award at the E2E Driving at Scale Challenge at CVPR 2024. Hydra-MDP combines multiple sensory inputs like LiDAR and camera data to build a comprehensive understanding of the driving environment and make informed decisions in real-time.

Key points of Hydra-MDP

  • Uses multi-teacher knowledge distillation by combining human and rule-based planners
  • Employs multimodal and multi-target planning for diverse driving conditions
  • Integrates LiDAR and camera inputs for enhanced environmental perception
  • Leverages extensive simulations and model ensembling to boost performance
  • Outperformed state-of-the-art planners on the nuPlan benchmark

I also discovered Video-LDM, which applies Latent Diffusion Models to high-resolution video generation. The team extended image-based LDMs with temporal dimensions and achieved state-of-the-art performance in driving video simulation. This approach also enables efficient text-to-video generation, including personalized content creation. NVIDIA additionally showcased fVDB, a GPU-optimized framework for deep learning on large-scale 3D data. It offers a complete set of differentiable primitives for common 3D learning tasks like convolution, pooling, attention, ray-tracing, and meshing. The framework integrates fully with PyTorch.

Highlight from Visual Localization and Mapping

This workshop introduced me to Meta's Project Aria, an initiative focused on developing augmented reality (AR) technologies through diverse real-world data collection. The team built AR glasses that capture video, audio, location, and eye-tracking information simultaneously. They use this data to create comprehensive datasets that advance AR research. At the workshop they released three datasets, Ego-Exo 4D, Nymeria and HOT3D.

I picked up the term Egocentric AI here for the first time. It refers to AI systems that process and interpret data from a first-person perspective, typically through wearable devices or cameras that see the world from the user's point of view. The team also released SceneScript, a novel method that reconstructs environments and represents physical space layouts using language. To measure progress on Egocentric Foundation Models, they proposed the EFM3D benchmark with two tasks, 3D bounding box detection and surface regression.

Highlight from Neural Architecture Search

Neural Architecture Search (NAS) aims to automate the design of neural network architectures. Algorithms like NAS, ENAS, and DARTS have shown real promise in discovering effective architectures. However, early NAS methods demanded enormous resources, requiring up to 800 GPUs and 28 days to complete. This drove researchers toward more efficient approaches. The search space typically focuses on CNN blocks using a micro search space approach, and researchers have introduced restricted search spaces to make the process more manageable. Data augmentation policies like Cutout and Auto Augment enhance the training process.

Several strategies emerged to tackle this inefficiency.

Weight sharing. Multiple architectures share parameters, removing the need to train each candidate from scratch.

One-Shot NAS. A single large model trains on all possible architectures, then the best subnetwork gets selected.

Zero-Shot NAS. Architectures get evaluated without any training at all, dramatically speeding up the search.

The field developed benchmarks like NAS-Bench 201 to standardize comparisons between methods. Balancing search time against architecture accuracy remains a key challenge.

Researchers use various metrics to evaluate and guide the search process, including

  • FLOPs (floating point operations) to measure computational complexity
  • Gradient-based metrics like Gradient Norm (e.g., SNIP) and Jacobian Covariant
  • Gradient-free calculations
  • Other metrics like Logdet, NN-Mass, and NN-Degree

The workshop also emphasized the importance of seeds and reproducibility in NAS experiments.

Highlight from Computer Vision in the Wild

The Computer Vision in the Wild workshop showcased several advanced multimodal large language models, demonstrating significant progress in combining vision and language understanding. The session highlighted how generalist foundation models have evolved to handle diverse tasks and inputs.

The team introduced LLaVA 1.5, an improved version of the original LLaVA that serves as a large language and vision assistant with enhanced image understanding capabilities. They also presented ViP-LLaVA, an extension of LLaVA that can process complex image prompts including masks, bounding boxes, and contours.

ViP-LLaVA spatial reasoning examples comparing GPT-4V, LLaVA-1.5-13B, ViP-LLaVA-7B, and ViP-LLaVA-13B on visual input tasks
Comparison of spatial reasoning capabilities across different multimodal models, including ViP-LLaVA variants.

Beyond these models, they introduced Yo'LLaVA as a personalized Large Multimodal Model. This model tackles the novel task of personalizing LMMs so they can hold conversations about a specific subject.

Yo'LLaVA personalization pipeline showing personalized concepts, text conversations, and visual conversations
Yo'LLaVA enables personalized multimodal conversations by learning user-specific concepts from a few training images.

They also discussed Matryoshka multimodal models, which address limitations in spatial understanding that plagued earlier CLIP models. The session covered influential papers like ODISE and BLIP-2 and illustrated the rapid progress from BLIP-2 to LLaVA. They pointed to the Open VLM leaderboard as their benchmarking tool for these models.

Timeline of multimodal large language models from 2022 to early 2024
The rapid evolution of multimodal LLMs from 2022 to early 2024, showing publicly available and unavailable models.

Main Conference

At the main conference, I found these papers particularly interesting, though many more remain on my reading list.

  1. Specularity Factorization for Low-Light Enhancement
  2. Point Transformer V3, Simpler, Faster, Stronger
  3. Matching 2D Images in 3D, Metric Relative Pose from Metric Correspondences
  4. Seeing the World through Your Eyes
  5. BioCLIP, A Vision Foundation Model for the Tree of Life
  6. MemSAM, Taming Segment Anything Model for Echocardiography Video Segmentation

You will find all papers from the main conference in this link.

Posted 20th June 2024

Recent articles

  • ▪Polars — A Lightning-Fast DataFrame Library — 28th February 2026

This is My First Conference Experience at CVPR 2024 by Toqi Tahamid Sarker, posted on 20th June 2024.

cvpr1research1workshops1

Table of Contents

Next: Polars — A Lightning-Fast DataFrame Library