Graduation Semester and Year

2017

Language

English

Document Type

Dissertation

Degree Name

Doctor of Philosophy in Computer Science

Department

Computer Science and Engineering

First Advisor

Manfred Huber

Abstract

Even though recent works on neural architectures have shown promising results at tasks like image recognition, object detection, playing Atari games, etc., learning a mapping from a visual space to a language space or vice versa remains challenging in problems like image/video captioning or question-answering tasks. Furthermore, transferring knowledge between seen and unseen classes in a setting like zero-shot learning is quite challenging given the fact that a model should be able to make a prediction for novel test data belonging to classes for which no examples have been seen during training. To address these issues, this dissertation will first introduce a novel memory-based attention model for video description. Specifically, attention-based models have shown promising and interesting results for image captioning. However, they are not able to model the higher-order interactions involved in problems such as video description/captioning, where the relationship between parts of the video and the concepts being depicted is complex. The proposed model here utilizes memories of past attention when reasoning about where to attend to, in the current time step. Secondly, this dissertation will introduce an end-to-end deep neural network model for attribute-based zero-shot learning with layer-specific regularization that encourages the higher, class-level layers to generalize beyond the training classes. This architecture enables the model to 'transfer' knowledge learned from seen training images to a set of novel, unseen test images.

Keywords

Video captioning, Attention model, Deep learning, Transfer learning, Imposing structure, Differentiable memory

Disciplines

Computer Sciences | Physical Sciences and Mathematics

License

This work is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 4.0 International License.

Comments

Degree granted by The University of Texas at Arlington

Recommended Citation

Fakoor, Rasool, "Neural Image and Video Understanding" (2017). Computer Science and Engineering Dissertations. 307.
https://mavmatrix.uta.edu/cse_dissertations/307

27000-2.zip (8967 kB)

Download

Included in

Computer Sciences Commons

COinS

Computer Science and Engineering Dissertations

Neural Image and Video Understanding