ORCID Identifier(s)

ORCID 0000-0002-0426-9029

Graduation Semester and Year

Summer 2025

Language

English

Document Type

Dissertation

Degree Name

Doctor of Philosophy in Computer Science

Department

Computer Science and Engineering

First Advisor

Yingying Zhu

Second Advisor

Jia Rao

Third Advisor

Vassilis Athitsos

Fourth Advisor

Junzhou Huang

Abstract

Multi-modal learning has gained significant attention in deep learning for its ability to integrate and process information from multiple modalities, such as text, images, and videos. By leveraging complementary information from different modalities, it enables a more comprehensive understanding of complex data in various tasks. Simultaneously, graph learning, a prominent paradigm that models structured data as graphs, captures both local and global dependencies, providing a natural framework to represent intricate interactions and contextual relationships. When combined with multi-modal learning, these graph-based approaches have the potential to enhance feature representation and reasoning by effectively fusing heterogeneous data, leading to more robust performance in tasks that demand nuanced understanding, such as diagnostic interpretation. This dissertation advances multi-modal graph learning in both general and medical domains through four interconnected contributions: (1) a novel multi-modal alignment framework for video domain adaptation, (2) graph-based reasoning architectures for medical visual question answering (VQA), (3) an LLM-powered pipeline for scalable medical VQA dataset construction, and (4) a foundation vision-language model tailored for transparent chest X-ray analysis.

First, addressing the limitations of unimodal approaches in unsupervised video domain adaptation (DA), I propose a dual-alignment framework that unifies RGB and optical flow modalities. By integrating frame-level and region-level graph representations, the model learns spatiotemporally invariant features, effectively achieving state-of-the-art cross-domain performance on action recognition tasks. This work underscores the critical role of multi-modal interaction in general-domain video understanding.

Second, to bridge the gap between conventional single-image VQA and real-world medical diagnostics, I introduce Difference Visual Question Answering (Difference VQA)—a novel task requiring comparative analysis of paired medical images. I develop a graph learning architecture that explicitly models anatomical relationships and embeds radiological expert knowledge as structured semantic graphs. This framework, paired with a novel dataset I constructed to suppport the proposed task, facilitates comparative reasoning across paired X-ray images and textual questions, establishing foundational benchmarks for multi-modal analysis in radiological contexts.

Third, to overcome the rigidity of rule-based medical dataset creation, I leverage large language models (LLMs) to automate the extraction of diverse, clinically relevant information from textual reports. This approach achieves 77\% diagnostic consistency (vs. 15\% for rule-based methods on the same keyword coverage), enabling comprehensive and scalable training for multimodal large language models (MLLMs).

Building on these advances, I propose CLARITY, a novel Multimodal Large Language Model (MLLM) for chest X-ray diagnosis that prioritizes reasoning and uncertainty awareness. By fine-tuning this MLLM on chest X-ray data, this work aims to establish a robust diagnostic reasoning vision-language model that supports deep medical analysis and assists the research community.

Collectively, these studies demonstrate the transformative potential of multi-modal graph learning in unifying vision and language, paving the way for innovative applications in both general video analysis and medical diagnosis.

Keywords

Multimodal learning, Graph learning, Visual question answering, Computer vision, Medical imaging analysis, Large language models, Vision language models

Disciplines

Computer Sciences

License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Recommended Citation

Hu, Xinyue, "MULTI-MODAL GRAPH LEARNING FOR VISION LANGUAGE MODEL IN GENERAL AND MEDICAL DOMAINS" (2025). Computer Science and Engineering Dissertations. 417.
https://mavmatrix.uta.edu/cse_dissertations/417

Computer Science and Engineering Dissertations

MULTI-MODAL GRAPH LEARNING FOR VISION LANGUAGE MODEL IN GENERAL AND MEDICAL DOMAINS

ORCID Identifier(s)

Graduation Semester and Year

Language

Document Type

Degree Name

Department

First Advisor

Second Advisor

Third Advisor

Fourth Advisor

Abstract

Keywords

Disciplines

License

Recommended Citation

Included in

Search

Browse

Author & Creator Corner

Links

Computer Science and Engineering Dissertations

MULTI-MODAL GRAPH LEARNING FOR VISION LANGUAGE MODEL IN GENERAL AND MEDICAL DOMAINS

Author

ORCID Identifier(s)

Graduation Semester and Year

Language

Document Type

Degree Name

Department

First Advisor

Second Advisor

Third Advisor

Fourth Advisor

Abstract

Keywords

Disciplines

License

Recommended Citation

Included in

Share

Search

Browse

Author & Creator Corner

Links