Visual impairment significantly restricts independent mobility and comprehension of the environment. While traditional tools like white canes and guide dogs offer basic support, they lack the capacity to provide detailed, contextual scene descriptions. This project presents an AI-powered Accessibility Assistant designed to transform visual environments into human-like verbal guidance. Initially, the study evaluated classification-based vision-language models such as CLIP, which proved inadequate due to their inability to generate descriptive, safety-oriented captions. To address these limitations, the project proposes a generative pipeline utilizing Bootstrapping Language-Image Pre-training (BLIP) for scene understanding, integrated with LLaMA for linguistic refinement. The system utilizes a Vision Transformer (ViT) and a Multimodal Encoder-Decoder (MED) architecture to ensure accurate image-text alignment. Quantitative evaluations using BLEU, ROUGE, and METEOR metrics demonstrate that while lexical overlap may decrease during linguistic refinement, semantic clarity and human-readability improve significantly. The resulting system provides real-time, audio-based spatial awareness, facilitating safer and more independent navigation for visually impaired users.
Tools: Python,BLIP,CLIP,LLaMA,Google Text-to-Speech (gTTS),OpenCV,Matplotlib,Seaborn
Department: Department of Mathematics
Poster