ORCID Identifier(s)

0000-0001-7095-4385

Graduation Semester and Year

Spring 2026

Language

English

Document Type

Dissertation

Degree Name

Doctor of Philosophy in Computer Science

Department

Computer Science and Engineering

First Advisor

Junzhou Huang

Abstract

I present my work on building multimodal guideline-aligned agentic systems designed to enable AI agents to solve complex real-world tasks. My research addresses two critical perspectives: (1) Instruction-Aware Embedding Models for flexible and universal embedding tasks, and (2) Guideline-Driven LLM Agents that leverage domain-specific guidelines to perform expert-level tasks. These components address embedding and generation tasks, respectively, and lay the foundation for a hybrid agent capable of tackling challenging real-world applications.

From the embedding perspective, I first address the instruction-following capabilities of embedding models. While Large Language Models (LLMs) excel at instruction following, they are primarily designed for generation rather than embedding. Existing embedding models typically encode samples into static representations without accepting instructions. However, many applications require instruction-aware embeddings. For instance, Composed Image Retrieval (CIR) involves finding a target image based on a source image and a textual modification. It demands understanding how the modification affects the source image. To develop instruction-aware embedding models, I propose using Multimodal LLMs (MLLMs) as embedding models and introduce a novel two-stage training strategy. In the first stage, the model learns to align image-caption pairs. In the second stage, I derive triplet data using Chain-of-Thought (CoT) prompting and apply instruction contrastive tuning to enable the model to follow instructions. While MLLM-based embedding models can accept instructions, their large size and slow inference make them unsuitable for real-time retrieval systems. To address this limitation, I propose a dual-stream distillation strategy that transfers instruction-following capabilities from MLLMs to more efficient CLIP-based models.

From the generation perspective, I focus on enabling LLMs to follow complex guidelines. Different from general instructions, domain-specific guidelines are usually long, structured, and contain prior knowledge. Existing LLMs can hardly follow these guidelines to perform expert annotation tasks. For example, following a medical guideline to annotate diseases. I propose a guideline-driven learning paradigm that enables LLMs to learn primarily from domain-specific guidelines while requiring only few-shot examples for optimization. I introduce three abstract structures to model existing guidelines, allowing them to be represented in a CoT prompt format. To discover the prompt that best represents a guideline, I develop a tree-search algorithm inspired by Monte Carlo Tree Search (MCTS). The algorithm initializes with a summary prompt derived from the guideline, then leverages a Retrieval-Augmented System (RAS) to iteratively optimize both the content and structure of the prompt.

Finally, I propose a hybrid agentic framework for universal clustering that integrates instruction-aware embedding models and guideline-driven LLM agents. This framework enables diverse image clustering tasks without task-specific training, demonstrating the synergy between embedding and generation components in solving complex, real-world problems.

Keywords

Instruction-aware models, Multimodal learning, Guideline-driven learning, Computer vision, LLM agents

Disciplines

Artificial Intelligence and Robotics | Computer Sciences

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.