What is your research focus?

My research focuses on computer vision and deep learning for precision agriculture and environmental monitoring. I specialize in developing efficient multi-task learning frameworks for semantic segmentation, object detection, and tracking, with applications in livestock methane emission monitoring and precision agriculture.

How do you measure livestock methane emissions using computer vision?

We use optical gas imaging (OGI) cameras combined with deep learning models to detect and quantify methane plumes from livestock. Our AI systems can segment methane emissions in real-time and classify different dietary treatments, providing a low-cost alternative to traditional measurement methods.

What programming languages and frameworks do you use?

I primarily work with Python for deep learning research, using PyTorch as my main framework. I also have experience with JavaScript, TypeScript, and Next.js for web development. For computer vision projects, I utilize MMDetection, MMSegmentation, and other OpenMMLab tools.

Where do you publish your research?

I publish in top-tier computer vision conferences and journals including CVPR, ICCV, ICMLA, IET Image Processing, and Scientific Reports. My work has been presented at both main conferences and specialized workshops focusing on agricultural applications of AI.

Are you available for research collaboration?

Yes, I'm open to research collaborations in computer vision, deep learning, and precision agriculture. I'm particularly interested in projects involving multi-task learning, efficient architectures, and real-world applications of AI in agriculture and environmental monitoring.

Do you accept graduate students or interns?

As a PhD student, I don't directly supervise students, but I collaborate with undergraduate researchers through the BASE Lab at SIU Carbondale. For graduate opportunities, please contact my advisor Dr. Khaled Ahmed. I'm also open to Summer 2026 internship opportunities in AI/ML and computer vision.

What datasets do you work with?

I work with both public datasets and custom datasets we've created for agricultural applications. This includes our methane emission datasets using OGI cameras, weed detection datasets with multi-stage growth annotations, and soybean growth stage datasets. I also utilize standard computer vision benchmarks for comparative evaluation.

How can I cite your work?

You can find all my publications with proper citations on my Google Scholar profile or the publications page of this website. Each paper includes BibTeX citations for easy reference. My ORCID is 0000-0003-2482-8059 for academic identification.

What is precision agriculture and how does AI help?

Precision agriculture uses technology to optimize crop yields and reduce environmental impact. AI and computer vision enable automated monitoring of crop health, pest detection, growth stage classification, and resource optimization. Our research specifically focuses on livestock emission monitoring and sustainable farming practices.

How accurate are your computer vision models?

Our models achieve state-of-the-art performance: GasTwinFormer reaches 74.47% mIoU for methane segmentation, WeedSense achieves 89.78% mIoU for weed segmentation, and SoyStageNet attains 83.2% AP for growth stage detection. We prioritize both accuracy and efficiency for real-world deployment.

My First Conference Experience at CVPR 2024

CVPR 2024 in Seattle was my first conference experience. On the first day, I attended four workshops: Autonomous Driving, Visual Localization and Mapping, Neural Architecture Search, and Computer Vision in the Wild. Here are the highlights from each.

Highlight from Autonomous Driving Workshop

I attended the Autonomous Driving workshop where NVIDIA researchers presented Hydra-MDP, an innovative framework for end-to-end multimodal planning. It won first place and the innovation award at the E2E Driving at Scale Challenge at CVPR 2024. Hydra-MDP combines multiple sensory inputs like LiDAR and camera data to build a comprehensive understanding of the driving environment and make informed decisions in real-time.

Key points of Hydra-MDP

Uses multi-teacher knowledge distillation by combining human and rule-based planners
Employs multimodal and multi-target planning for diverse driving conditions
Integrates LiDAR and camera inputs for enhanced environmental perception
Leverages extensive simulations and model ensembling to boost performance
Outperformed state-of-the-art planners on the nuPlan benchmark

I also discovered Video-LDM, which applies Latent Diffusion Models to high-resolution video generation. The team extended image-based LDMs with temporal dimensions and achieved state-of-the-art performance in driving video simulation. This approach also enables efficient text-to-video generation, including personalized content creation. NVIDIA additionally showcased fVDB, a GPU-optimized framework for deep learning on large-scale 3D data. It offers a complete set of differentiable primitives for common 3D learning tasks like convolution, pooling, attention, ray-tracing, and meshing. The framework integrates fully with PyTorch.

Highlight from Visual Localization and Mapping

This workshop introduced me to Meta's Project Aria, an initiative focused on developing augmented reality (AR) technologies through diverse real-world data collection. The team built AR glasses that capture video, audio, location, and eye-tracking information simultaneously. They use this data to create comprehensive datasets that advance AR research. At the workshop they released three datasets, Ego-Exo 4D, Nymeria and HOT3D.

I picked up the term Egocentric AI here for the first time. It refers to AI systems that process and interpret data from a first-person perspective, typically through wearable devices or cameras that see the world from the user's point of view. The team also released SceneScript, a novel method that reconstructs environments and represents physical space layouts using language. To measure progress on Egocentric Foundation Models, they proposed the EFM3D benchmark with two tasks, 3D bounding box detection and surface regression.

Highlight from Neural Architecture Search

Neural Architecture Search (NAS) aims to automate the design of neural network architectures. Algorithms like NAS, ENAS, and DARTS have shown real promise in discovering effective architectures. However, early NAS methods demanded enormous resources, requiring up to 800 GPUs and 28 days to complete. This drove researchers toward more efficient approaches. The search space typically focuses on CNN blocks using a micro search space approach, and researchers have introduced restricted search spaces to make the process more manageable. Data augmentation policies like Cutout and Auto Augment enhance the training process.

Several strategies emerged to tackle this inefficiency.

Weight sharing. Multiple architectures share parameters, removing the need to train each candidate from scratch.

One-Shot NAS. A single large model trains on all possible architectures, then the best subnetwork gets selected.

Zero-Shot NAS. Architectures get evaluated without any training at all, dramatically speeding up the search.

The field developed benchmarks like NAS-Bench 201 to standardize comparisons between methods. Balancing search time against architecture accuracy remains a key challenge.

Researchers use various metrics to evaluate and guide the search process, including

FLOPs (floating point operations) to measure computational complexity
Gradient-based metrics like Gradient Norm (e.g., SNIP) and Jacobian Covariant
Gradient-free calculations
Other metrics like Logdet, NN-Mass, and NN-Degree

The workshop also emphasized the importance of seeds and reproducibility in NAS experiments.

Highlight from Computer Vision in the Wild

The Computer Vision in the Wild workshop showcased several advanced multimodal large language models, demonstrating significant progress in combining vision and language understanding. The session highlighted how generalist foundation models have evolved to handle diverse tasks and inputs.

The team introduced LLaVA 1.5, an improved version of the original LLaVA that serves as a large language and vision assistant with enhanced image understanding capabilities. They also presented ViP-LLaVA, an extension of LLaVA that can process complex image prompts including masks, bounding boxes, and contours.

ViP-LLaVA spatial reasoning examples comparing GPT-4V, LLaVA-1.5-13B, ViP-LLaVA-7B, and ViP-LLaVA-13B on visual input tasks — Comparison of spatial reasoning capabilities across different multimodal models, including ViP-LLaVA variants.

Beyond these models, they introduced Yo'LLaVA as a personalized Large Multimodal Model. This model tackles the novel task of personalizing LMMs so they can hold conversations about a specific subject.

Yo'LLaVA personalization pipeline showing personalized concepts, text conversations, and visual conversations — Yo'LLaVA enables personalized multimodal conversations by learning user-specific concepts from a few training images.

They also discussed Matryoshka multimodal models, which address limitations in spatial understanding that plagued earlier CLIP models. The session covered influential papers like ODISE and BLIP-2 and illustrated the rapid progress from BLIP-2 to LLaVA. They pointed to the Open VLM leaderboard as their benchmarking tool for these models.

Timeline of multimodal large language models from 2022 to early 2024 — The rapid evolution of multimodal LLMs from 2022 to early 2024, showing publicly available and unavailable models.

Main Conference

At the main conference, I found these papers particularly interesting, though many more remain on my reading list.

You will find all papers from the main conference in this link.