From Pixels to Voice

Image	Generated Audio	ASR Transcription
		a brown dog is running across the green fence
		a boy wearing a helmet is playing on a steep bord
		two dogs running in grass
		thre children playing in the sand
		two young boys playing outside
		a boy in man jumps ar riding a yellow kepacord
		two girls playing sock
		a blond holding girl standing on top of a bed
		clamber or climbing a neste

From Pixels to Voice: A Simple and Efficient End-to-End Spoken Image Description Approach via
Vision Codec Language Models