Multimodal Research Hub - Vision-Language Models (VLMs)

A living resource for Vision-Language Models & multimodal learning

Yash Thube

Pathological Truth Bias in Vision-Language Models

MATS, a behavioral audit for vision language models, identifies systematic failures in spatial consistency and suggests repair paths through activation patching.

Yash Thube (2025)

SSL-Vision

Implementation of four self-supervised learning (SSL) algorithms; SimCLR, MoCo-v2, BYOL, and DINO on the CIFAR-10 dataset using a ResNet-18 backbone. All scripts are designed to run on Google Colab Free Tier (single GPU, ~12 GB RAM) with CIFAR-10.

Yash Thube

Task-Aware Segment Anything with LoRA

Abstract PyTorch pipeline that uses a hypernetwork to generate task-specific LoRA adapters for Meta’s Segment Anything Model from natural language prompts, targeted segmentation on COCO instances and benchmarked mIoU via pycocotools. Github repo You can find the repo here.

Yash Thube