multimodality

True intelligence emerges when neural nets can perceive and reason across modalities (vision, language & action), autonomously invent novel tasks, learn to solve them and adapt in unfamiliar, real-world environments. My current interests include:

  • Multimodal Learning
    • Vision/Video–Language integration (VLMs, MLLMs, VLAMs) & representation learning.
  • Computer Vision
    • 3D-aware perception to improve compositional/spatial reasoning, few/zero-shot learning, video understanding & long-horizon prediction.
  • Reinforcement and Open-ended Learning
    • Enabling agents to self-generate tasks and steadily build competence across changing environments, World Model-based RL.

I believe progress here could meaningfully accelerate scientific discovery, embodied AI, and healthcare at scale. The impact could surpass the current aspirations for AGI/ASI. Open to research collaborations and internships!

📌Check out Multimodal/VLMs Research Hub. I thought having a community-driven hub for multimodal researchers would be great. Contributions or suggestions are welcome!

✍️ I enjoy jotting down my thoughts and keeping an organized “second brain”, unlike my primary one:) You can find some interesting stuff in the brain dump section above.

Outside of work, you’ll find me clicking random pictures, reading, trekking or playing and watching a variety of sports (football, cricket, MMA, & Esports).