Graduation Semester and Year
Spring 2026
Language
English
Document Type
Thesis
Degree Name
Master of Science in Computer Science
Department
Computer Science and Engineering
First Advisor
Kenny Q. Zhu
Second Advisor
Vassilis Athitsos
Third Advisor
Shirin Nilizadeh
Abstract
Universal Sound Separation (USS) -- the task of disentangling arbitrary sound sources from a single-channel acoustic mixture -- remains an open challenge due to the ill-posed nature of the problem and the distributional gap between synthetic training data and real-world recordings. This thesis addresses three distinct bottlenecks in the USS pipeline: training data realism, inference strategy, and conditioning richness. We first present two knowledge-guided approaches to sound source separation. The first is a distance-aware mixing strategy that leverages Large Language Models (LLMs) to assign plausible loudness relationships between audio sources during training data synthesis. By querying an LLM about the natural acoustic distance between sound events, we generate Mixture of Mixtures (MoMs) that better approximate real-world acoustic scenes. Human evaluation shows that models trained with this strategy are preferred over randomly-trained baselines in up to 75% of comparisons on three real-world benchmark categories. The second is a co-occurrence conditioning framework that injects information about non-target sounds present in a mixture into the encoder of AudioSep via FiLM modulation, complementing the standard target conditioning. We propose a CLAP-based estimation procedure that approximates co-occurrence embeddings at inference time from only the mixture and the target text, matching the practical setting of USS; an exploratory evaluation shows improved separation of five of six USS benchmarks. We then introduce Chain-of-Inference (CoI), a training-free multi-step inference framework motivated by the human auditory system's sensitivity to sudden changes in the acoustic scene and structurally analogous to Chain-of-Thought prompting in language models. CoI iteratively re-introduces a proportion of the original mixture -- governed by cosine similarity between the current output and the input -- progressively decomposing the separation problem into easier sub-problems. Without any additional training, CoI consistently improves AudioSep across all five evaluated tasks and SAM-Audio on four of five. An interactive online demonstration system is released alongside this work, allowing users to experience the perceptual improvements on arbitrary audio. Taken together, these contributions show that USS performance can be improved from two distinct angles: incorporating external knowledge -- LLM commonsense priors and contrastive audio-text embeddings -- to improve training data and conditioning, and exploiting underutilised capacity already present in frozen models through principled inference-time refinement.
Keywords
Universal sound separation, Data synthesis, Chain-of-inference
Disciplines
Artificial Intelligence and Robotics | Signal Processing
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
Recommended Citation
Park, Wonjun, "Universal Sound Separation: Distance-Aware Mixture Simulation, Co-occurrence Conditioning, and Chain-of-Inference" (2026). Computer Science and Engineering Theses-Archive. 542.
https://mavmatrix.uta.edu/cse_theses/542