CLIP Implementation
This repository implements a CLIP-like model that aligns image and text representations using PyTorch and Hugging Face Transformers. The model uses a Vision Transformer (ViT) as the image encoder and DistilRoBERTa as the text encoder. It is trained on the Flickr30k dataset.
Blog - Understanding CLIP