Ankit Khare

Graduation Semester and Year




Document Type


Degree Name

Master of Science in Computer Science


Computer Science and Engineering

First Advisor

Manfred Huber


Several combinations of visual and semantic attention have been geared towards developing better image captioning architectures. In this work we introduce a novel combination of word-level semantic context with image feature-level visual context, which provides a more holistic overall context for image caption generation. This approach does not require training any explicit network structure, using any external resource for training semantic attributes, or supervision during any training step. The proposed architecture addresses the significance of learning to find context at three levels to achieve a better trade-off as well as a balance between the two lines of attentiveness (word-level and image feature-level). The structure of the visual information is very different from the structure of the captions to be generated. Encoded visual information is unlikely to contain the maximum level of structural information needed for correctly generating the textual description in the subsequent decoding phase. Attention mechanisms aim at streamlining the two modalities of language and vision but often fail to find a balance between them. Our novel approach to establish this balance where the encoder-decoder pipeline learns to pay balanced attention to the two modalities leads to the captions not drifting towards the language model irrespective of the visual content of the image or towards the image objects regardless of the saliency observed in the generated sentence history. We demonstrate how the encoder's convolutional feature space attended in a top-down fashion and in parallel conditioned over the entire n-gram word space, can provide maximum context for sophisticated language generation. Effective architectural variations to produce hybrid attention mechanisms streamline a model towards better utilization of rich image features while generating final captions. The impact of this mechanism is demonstrated through extensive analysis using the MS-COCO dataset. The proposed system outperforms state-of-the-art results, illustrating how this context-based architectural design opens up new ways of addressing context and the overall task of image captioning.


Image captioning, Encoder-decoder architecture, Attention mechanisms, Deep learning


Computer Sciences | Physical Sciences and Mathematics


Degree granted by The University of Texas at Arlington