ORCID Identifier(s)

0000-0001-7095-4385

Graduation Semester and Year

Spring 2026

Language

English

Document Type

Dissertation

Degree Name

Doctor of Philosophy in Computer Science

Department

Computer Science and Engineering

First Advisor

Junzhou Huang

Abstract

I present my work on building multimodal guideline-aligned agentic systems designed to enable AI agents to solve complex real-world tasks. My research addresses two critical perspectives: (1) Instruction-Aware Embedding Models for flexible and universal embedding tasks, and (2) Guideline-Driven LLM Agents that leverage domain-specific guidelines to perform expert-level tasks. These components address embedding and generation tasks, respectively, and lay the foundation for a hybrid agent capable of tackling challenging real-world applications.

From the embedding perspective, I first address the instruction-following capabilities of embedding models. While Large Language Models (LLMs) excel at instruction following, they are primarily designed for generation rather than embedding. Existing embedding models typically encode samples into static representations without accepting instructions. However, many applications require instruction-aware embeddings. For instance, Composed Image Retrieval (CIR) involves finding a target image based on a source image and a textual modification. It demands understanding how the modification affects the source image. To develop instruction-aware embedding models, I propose using Multimodal LLMs (MLLMs) as embedding models and introduce a novel two-stage training strategy. In the first stage, the model learns to align image-caption pairs. In the second stage, I derive triplet data using Chain-of-Thought (CoT) prompting and apply instruction contrastive tuning to enable the model to follow instructions. While MLLM-based embedding models can accept instructions, their large size and slow inference make them unsuitable for real-time retrieval systems. To address this limitation, I propose a dual-stream distillation strategy that transfers instruction-following capabilities from MLLMs to more efficient CLIP-based models.

From the generation perspective, I focus on enabling LLMs to follow complex guidelines. Different from general instructions, domain-specific guidelines are usually long, structured, and contain prior knowledge. Existing LLMs can hardly follow these guidelines to perform expert annotation tasks. For example, following a medical guideline to annotate diseases. I propose a guideline-driven learning paradigm that enables LLMs to learn primarily from domain-specific guidelines while requiring only few-shot examples for optimization. I introduce three abstract structures to model existing guidelines, allowing them to be represented in a CoT prompt format. To discover the prompt that best represents a guideline, I develop a tree-search algorithm inspired by Monte Carlo Tree Search (MCTS). The algorithm initializes with a summary prompt derived from the guideline, then leverages a Retrieval-Augmented System (RAS) to iteratively optimize both the content and structure of the prompt.

Finally, I propose a hybrid agentic framework for universal clustering that integrates instruction-aware embedding models and guideline-driven LLM agents. This framework enables diverse image clustering tasks without task-specific training, demonstrating the synergy between embedding and generation components in solving complex, real-world problems.

Keywords

Instruction-aware models, Multimodal learning, Guideline-driven learning, Computer vision, LLM agents

Disciplines

Artificial Intelligence and Robotics | Computer Sciences

License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Recommended Citation

Zhong, Wenliang, "TOWARDS MULTIMODAL GUIDELINE-ALIGNED AGENTIC SYSTEMS" (2026). Computer Science and Engineering Dissertations - Archive. 434.
https://mavmatrix.uta.edu/cse_dissertations/434

Download

Included in

Artificial Intelligence and Robotics Commons

COinS

Computer Science and Engineering Dissertations - Archive

TOWARDS MULTIMODAL GUIDELINE-ALIGNED AGENTIC SYSTEMS

ORCID Identifier(s)

Graduation Semester and Year

Language

Document Type

Degree Name

Department

First Advisor

Abstract

Keywords

Disciplines

License

Recommended Citation

Included in

Search

Browse

Author & Creator Corner

Links

Computer Science and Engineering Dissertations - Archive

TOWARDS MULTIMODAL GUIDELINE-ALIGNED AGENTIC SYSTEMS

Author

ORCID Identifier(s)

Graduation Semester and Year

Language

Document Type

Degree Name

Department

First Advisor

Abstract

Keywords

Disciplines

License

Recommended Citation

Included in

Share

Search

Browse

Author & Creator Corner

Links