Grounding Semantic Roles in Images

30 October 2018
Begin time: 

Title: Grounding Semantic Roles in Images

Images of everyday scenes can be interpreted and described in many ways, depending on the perceiver and the context in which the image is presented, where the context may be natural language data or a visual sequence. The interpretation of a (visual) scene is related to the determination of who did what to whom, etc. This may require a joint processing or reasoning  with possibly multiple (extra-)linguistic information sources (e.g.,~text, images).

To facilitate the joint processing over multiple sources, it is desirable to induce representations of texts and visual scenes which do encode this kind of information, and in, essentially, a congruent and generic way. In this talk I will present our approach towards this goal: We address the task of visual semantic role labeling (vSRL), and learn frame--semantic representations of images. Our model renders candidate participants as image regions of objects, and is trained towards grounding roles in the regions which depict the corresponding participant. I will present experimental results which demonstrate that we can train a vSRL model without reliance on prohibitive image-based role annotations, by utilizing noisy data which we extract automatically from image captions using a linguistic SRL system. Furthermore, the frame--semantic visual representations which our model induces yield overall better results on supervised visual verb sense disambiguation compared to previous work.