$ whoami
Robotics Perception & GPU Inference Engineer
M.S. in Robotics from UC Riverside, working at the intersection of SLAM, 3D reconstruction, and GPU systems. I take research ideas all the way to shipped, production code — from drone-based 3D reconstruction to custom CUDA kernels for real-time perception and LLM inference.
● tracking · pose locked
I graduated with a Master's degree in Robotics from the University of California, Riverside, where I specialized in computer vision, SLAM (Simultaneous Localization and Mapping), and autonomous systems. My thesis, Celesta, is a fully differentiable optimization framework that integrates distributed bundle adjustment with Leiden-based graph partitioning for scalable, GPU-accelerated visual SLAM using NVIDIA Thrust.
Previously, I worked as a Data Scientist at Jio Platforms Ltd., where I developed and shipped computer vision solutions for drone-based tower reconstruction using SLAM. I contributed to video analytics for surveillance and visual document understanding, delivering production-ready pipelines from prototype to deployment.
Lately I've gone deep on the GPU layer underneath perception and ML systems — writing custom CUDA kernels for LLM inference (fused attention, KV-cache compression, quantized GEMM on Llama 3.1 8B) and profiling every optimization with Nsight Compute. My toolkit spans C++, Python, CUDA, and ROS across NVIDIA platforms from datacenter GPUs to Jetson edge.
I care about writing maintainable, performant code and bringing research ideas into deployed systems — and I'm always chasing the next hard problem in robotics, perception, and GPU computing.
Robotics Degree
Years Experience
Research Projects
Custom CUDA kernels for LLM inference — fused FlashAttention-style attention, INT4 KV-cache compression, and W4A16 quantized matmul on Llama 3.1 8B. 1.91× over PyTorch SDPA, 6.97× over fp16 cuBLAS, −51% peak VRAM end-to-end, with every step attributed to a specific Nsight Compute metric.
A from-scratch dense 3D reconstruction engine built from image collections — a direct continuation of the production reconstruction work I shipped at Jio, with the same co-author. The engine is private; the public Open3D visualizer (linked) walks through reconstructed scenes including a ~4.2M-point capture of Yatra Garden.
Dockerized demo of Celesta — distributed, GPU-accelerated bundle adjustment built on DABA with Leiden graph partitioning for better load balancing across GPUs. Validated on BAL "Ladybug" (1,723 cameras, 678K measurements); custom CUDA kernels with NCCL + MPI sync. My M.S. thesis.
GPU-accelerated bundle adjustment for structure-from-motion and SLAM — a CUDA-parallelized nonlinear least-squares solver hitting 10× over CPU on the Washington BAL dataset (RTX 4090). The direct predecessor to Celesta.
A loosely-coupled 15-state Extended Kalman Filter fusing GNSS position fixes with inertial measurements on the KITTI raw dataset. Position RMS 2.26 m fused vs 1693 m IMU-only dead-reckoning, with a GPS-dropout demo showing graceful degradation and recovery. Implemented in NumPy/SciPy.
A computer-vision agent that analyzes photos end-to-end — facial analysis, aesthetic scoring, CLIP/BLIP semantic tagging, DINOv2 + FAISS similarity search, and DETR/ViT scene understanding.
Vehicle-to-vehicle communication systems for intelligent collaborative driving. Autonomous vehicles in CARLA simulation with multi-agent coordination.
A compact Vision Transformer trained on Fashion-MNIST, with training and inference separated from visualization and optional Nsight profiling hooks for GPU timeline analysis.
Bayesian drift modeling and analysis for semiconductor applications. Jupyter-based workflows for inference and visualization.
A fun project implementing a simple neural network in CUDA for MNIST digit classification, written in C++ with GPU acceleration.
A ROS playground for forward/inverse kinematics, open- and closed-loop control, and 2D path planning, visualized in Gazebo. From UC Riverside's EE283A (Foundations of Robotics).
A fully differentiable optimization framework that integrates Distributed Accelerated Bundle Adjustment (DABA) with the Leiden algorithm for improved graph partitioning in visual SLAM. Implemented with NVIDIA Thrust, Celesta achieves scalable, GPU-accelerated bundle adjustment with balanced workloads and better convergence than Louvain-based partitioning. Master's thesis, UC Riverside.
Download thesis (PDF)Long-form notes on the systems I build — first-principles write-ups, not blog filler.
The full book behind my LLM Inference Kernels project: what attention actually computes, the GPU mental model, decode attention naive → fast, KV-cache compression, and the cross-cutting workflow. Its central idea — most kernels aren't bandwidth- or compute-bound, they're dependency-chain-bound.
Read the write-upA short field report: trading occupancy for speed, why an INT8 KV cache shrank memory but didn't move latency, and the discipline of predicting the direction and magnitude of every change before running it.
Read the write-up
ROS
MATLAB ®
PyTorch
OpenCV
TensorFlow
I'm always interested in new opportunities and exciting problems in robotics, perception, and AI. Whether you have a question or just want to say hi, feel free to reach out.