Saif Sayed

ORCID Identifier(s)


Graduation Semester and Year




Document Type


Degree Name

Doctor of Philosophy in Computer Science


Computer Science and Engineering

First Advisor

Vassilis Athitsos


Automatic understanding of human behavior has several applications in medicine and surveillance. Analysing human actions can enable cognitive assessment of children by measuring their hyperactivity and response inhibition which can give physicians better understanding of their cognitive state. Automatic and non-invasive assessment for cognitive disorders will increase the affordability and reach for these detection methods and can prove life-changing in child’s development. Human activity can also be analysed in common settings such as cooking in kitchen and understanding the information of human object interaction can give priors on the underlying activity they are performing. In the first section, we focus on cognitive assessment. We introduce specifically a new dataset towards development of automated system for the Activate Test of Embodied Cognition (ATEC), a measurement that evaluates cognitive skills through physical activity. Evaluating cognitive skills through physical activity requires subjects performing wide variety of tasks with varying levels of complexity. To make the system afford- able and reachable to larger population, we created an automated system that can score these human activities as accurately as an expert. To this end, we developed and activity recognition system for one of the most challenging task in ATEC, called Cross-Your-Body which can evaluate attention, response inhibition, rhythm and co-ordination, task switch- ing, working memory. We created and annotated the dataset that enabled us for training of vision based activity segmentation models. First, we developed a very accurate system that requires trimmed video as input where every video has only one action and predicts the human activity by tracking the human pose features. Second, we improved the system to create an end-to-end method that can track multiple activities in an untrimmed video which enabled the generation of scores that can directly transfer to the expert human’s score with high inter-rater reliability. In the second section, we study action segmentation in instructional videos under timestamp supervision. In the action segmentation domain, the goal is to temporally divide the input video into set of sequential actions. In fully supervised setting the training labels are given for every frame while in weakly supervised settings, the labels are at video level and are sequence of actions. While the weakly supervised labels reduces the annotation time for labeling videos, it lacks test performance as comparable to a fully supervised setting by a big gap. To alleviate this problem, in addition to the sequence of actions, timestamp supervision also adds a single frame number for each action which adds significant constraints on when each activity may happen. We study timestamp supervision under several scenarios. First, we created a new approach that utilizes human object interaction (HOI) as a source of information other than the exisiting flow and rgb information. The system creates new pseudo-groundtruth by expanding the the timestamp annotations using the information from an off-the-shelf pre-trained HOI detector, that requires no additional HOI-related annotations. We also improved the temporal modelling system from temporal convolution based to transformer one which further improved the performance. Second, to enable the research on HOI and multi-view action segmentation, we created a first of it’s kind dataset called (3+1)Rec, which has 1799 long-length, high quality videos comprising of 3 third person view and 1 egocentric for each dish the subject is making in a kitchen environment.


Action segmentation, Computer vision, Cognitive assessment


Computer Sciences | Physical Sciences and Mathematics


Degree granted by The University of Texas at Arlington