Graduation Semester and Year
2019
Language
English
Document Type
Thesis
Degree Name
Master of Science in Computer Science
Department
Computer Science and Engineering
First Advisor
Manfred Huber
Abstract
Several combinations of visual and semantic attention have been geared towards developing better image captioning architectures. In this work we introduce a novel combination of word-level semantic context with image feature-level visual context, which provides a more holistic overall context for image caption generation. This approach does not require training any explicit network structure, using any external resource for training semantic attributes, or supervision during any training step. The proposed architecture addresses the significance of learning to find context at three levels to achieve a better trade-off as well as a balance between the two lines of attentiveness (word-level and image feature-level). The structure of the visual information is very different from the structure of the captions to be generated. Encoded visual information is unlikely to contain the maximum level of structural information needed for correctly generating the textual description in the subsequent decoding phase. Attention mechanisms aim at streamlining the two modalities of language and vision but often fail to find a balance between them. Our novel approach to establish this balance where the encoder-decoder pipeline learns to pay balanced attention to the two modalities leads to the captions not drifting towards the language model irrespective of the visual content of the image or towards the image objects regardless of the saliency observed in the generated sentence history. We demonstrate how the encoder's convolutional feature space attended in a top-down fashion and in parallel conditioned over the entire n-gram word space, can provide maximum context for sophisticated language generation. Effective architectural variations to produce hybrid attention mechanisms streamline a model towards better utilization of rich image features while generating final captions. The impact of this mechanism is demonstrated through extensive analysis using the MS-COCO dataset. The proposed system outperforms state-of-the-art results, illustrating how this context-based architectural design opens up new ways of addressing context and the overall task of image captioning.
Keywords
Image captioning, Encoder-decoder architecture, Attention mechanisms, Deep learning
Disciplines
Computer Sciences | Physical Sciences and Mathematics
License
This work is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 4.0 International License.
Recommended Citation
Khare, Ankit, "ULTRA-CONTEXT: MAXIMIZING THE CONTEXT FOR BETTER IMAGE CAPTION GENERATION" (2019). Computer Science and Engineering Theses. 496.
https://mavmatrix.uta.edu/cse_theses/496
Comments
Degree granted by The University of Texas at Arlington