Graduation Semester and Year
Spring 2026
Language
English
Document Type
Dissertation
Degree Name
Doctor of Philosophy in Computer Science
Department
Computer Science and Engineering
First Advisor
Junzhou Huang
Abstract
I present my work on building multimodal guideline-aligned agentic systems designed to enable AI agents to solve complex real-world tasks. My research addresses two critical perspectives: (1) Instruction-Aware Embedding Models for flexible and universal embedding tasks, and (2) Guideline-Driven LLM Agents that leverage domain-specific guidelines to perform expert-level tasks. These components address embedding and generation tasks, respectively, and lay the foundation for a hybrid agent capable of tackling challenging real-world applications.
From the embedding perspective, I first address the instruction-following capabilities of embedding models. While Large Language Models (LLMs) excel at instruction following, they are primarily designed for generation rather than embedding. Existing embedding models typically encode samples into static representations without accepting instructions. However, many applications require instruction-aware embeddings. For instance, Composed Image Retrieval (CIR) involves finding a target image based on a source image and a textual modification. It demands understanding how the modification affects the source image. To develop instruction-aware embedding models, I propose using Multimodal LLMs (MLLMs) as embedding models and introduce a novel two-stage training strategy. In the first stage, the model learns to align image-caption pairs. In the second stage, I derive triplet data using Chain-of-Thought (CoT) prompting and apply instruction contrastive tuning to enable the model to follow instructions. While MLLM-based embedding models can accept instructions, their large size and slow inference make them unsuitable for real-time retrieval systems. To address this limitation, I propose a dual-stream distillation strategy that transfers instruction-following capabilities from MLLMs to more efficient CLIP-based models.
From the generation perspective, I focus on enabling LLMs to follow complex guidelines. Different from general instructions, domain-specific guidelines are usually long, structured, and contain prior knowledge. Existing LLMs can hardly follow these guidelines to perform expert annotation tasks. For example, following a medical guideline to annotate diseases. I propose a guideline-driven learning paradigm that enables LLMs to learn primarily from domain-specific guidelines while requiring only few-shot examples for optimization. I introduce three abstract structures to model existing guidelines, allowing them to be represented in a CoT prompt format. To discover the prompt that best represents a guideline, I develop a tree-search algorithm inspired by Monte Carlo Tree Search (MCTS). The algorithm initializes with a summary prompt derived from the guideline, then leverages a Retrieval-Augmented System (RAS) to iteratively optimize both the content and structure of the prompt.
Finally, I propose a hybrid agentic framework for universal clustering that integrates instruction-aware embedding models and guideline-driven LLM agents. This framework enables diverse image clustering tasks without task-specific training, demonstrating the synergy between embedding and generation components in solving complex, real-world problems.
Keywords
Instruction-aware models, Multimodal learning, Guideline-driven learning, Computer vision, LLM agents
Disciplines
Artificial Intelligence and Robotics | Computer Sciences
License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Recommended Citation
Zhong, Wenliang, "TOWARDS MULTIMODAL GUIDELINE-ALIGNED AGENTIC SYSTEMS" (2026). Computer Science and Engineering Dissertations - Archive. 434.
https://mavmatrix.uta.edu/cse_dissertations/434