Clippers Tuesday: Micha Elsner on Saccadic Models for Referring Expression Generation

“Saccadic models for referring expression generation”

Referring expression generation (REG) is the task of describing an object in a scene so that an observer can pick it out. We have many experimental results showing that REG is constrained by the sequential nature of human vision (that is, the human eye cannot take in the whole image at once, but must look from place to place— saccade— to see more parts of the image clearly). Yet current neural network models for computer vision begin precisely by analyzing the entire image at once; thus, they cannot be used directly as models of the human REG algorithm. A recent model for computer vision (Mnih et al 2014) has a limited field of vision and makes saccades around the image; I propose to adapt this model to the REG task and use it as a psycholinguistic model of human processing. I will present some background literature, a pilot model architecture and results on some contrived tasks with synthetic data. I will discuss possible ways forward for the model and hope to get some interesting feedback from the group.