Skip to yearly menu bar Skip to main content


Poster

Semantic and Expressive Variations in Image Captions Across Languages

Andre Ye · Sebastin Santy · Jena D. Hwang · Amy X Zhang · Ranjay Krishna


Abstract:

Most vision-language models today are primarily trained on English image-text pairs, with non-English pairs often filtered out. Evidence from cross-cultural psychology suggests that this approach will bias models against perceptual modes exhibited by people who speak other (non-English) languages.We investigate semantic and expressive variation in image captions across different languages; we analyze both human-annotated datasets and model-produced captions.By analyzing captions across seven languages (English, French, German, Russian, Chinese, Japanese, Korean) in high-quality image captioning datasets (Crossmodal and Visual Genome), we find that multilingual caption sets tend to provide richer visual descriptions than monolingual (including English-only) ones; multilingual sets contain 46.0% more objects66.1% more relationships, and66.8% more attributes.We observe the same results with multilingual captions produced by LLaVA and the Google Vertex API: for example, compared to monolingual captions, they cover21.9% more objects,18.8% more relations, and20.1% more attributes.These suggest that, across a large number of samples, different languages bias people and models to focus on different visual concepts.Finally, we show that models trained on image-text data in one language perform distinctly better on that language's test set.Our work points towards the potential value of training vision models on multilingual data sources to widen the range/variation of descriptive information those models are exposed to.

Live content is unavailable. Log in and register to view live content