This project implements a transformer-based model to estimate crowd density in images. The model leverages self-attention to capture both local and global features, enabling accurate crowd counting even in dense, complex scenes. By focusing on efficiency, it reduces computational costs while maintaining high accuracy. It’s trained on large-scale datasets and generalizes well across various crowd scenarios, making it robust for real-world applications.