From Pixels to Voice: A Simple and Efficient End-to-End Spoken Image Description Approach via
Vision Codec Language Models

Chung Tran (tran.quang_chung.tq9@naist.ac.jp), Sakriani Sakti (ssakti@is.naist.jp)

Paper Slide Poster
Personal Website Lab Website

Spoken Image Description Samples

Image Generated Audio ASR Transcription
a brown dog is running across the green fence
a boy wearing a helmet is playing on a steep bord
two dogs running in grass
thre children playing in the sand
two young boys playing outside
a boy in man jumps ar riding a yellow kepacord
two girls playing sock
a blond holding girl standing on top of a bed
clamber or climbing a neste