Graduation Semester and Year
2014
Language
English
Document Type
Dissertation
Degree Name
Doctor of Philosophy in Computer Science
Department
Computer Science and Engineering
First Advisor
Vassilis Athitsos
Abstract
This thesis investigates the problem of similarity search in large and noisy sequence databases. A key application domain of interest in this work is the very challenging Query-By-Humming (QBH) problem, according to which, given a hummed part of a song, we would like to identify the closest matches in a large music repository. The problem of selecting the most appropriate, based on each specific query, distance measure out of a pool of measures for classification in time series data is also investigated. In addition, searching time series databases via examples, which may be either time series or models, is also part of this work. Moreover, motivated by the nature of music notes in the noisy QBH application, we explore the problem ofinterval-based sequence matching in a variety of application domains.More specifically, this thesis makes contributions both by defining novel similarity measures that are used to identify the best database matches, and by proposing methods to improve efficiency. Referring to the similarity measures, the thesis contributes a method, named SMBGT (shorthand for Subsequence Matching with Bounded Gaps and Tolerances), for finding the closest subsequence matches from a large database to a given query. This measure is applied to the noisy QBH application domain, where the music pieces are represented as 2-dimensional time series. Since efficiency should be a key characteristic when searching large time series databases,ISMBGT is also proposed, which performs indexing on top of the SMBGT method in a filter-and-refine manner. Applying ISMBGT on synthetic and real query sets for QBH shows the usefulness of our indexing scheme.A second contribution of this thesis is the "Hum-a-song" QBH system, which allows the user to select among a variety of distance measures and music representations in order to retrieve the closest songs to a hummed query. Apart from that, SMBGT is also exploited to perform genre classification of music pieces, which, among others, can be beneficial in the area of assistive environments. For example, music systems with therapeutic and educational functionalities need to be adapted based on the music preferences of the end-users, such as children, patients, or disabled. Beingable to identify the genres of songs that t to each category of end-users is imperative in such settings as it can assist effectively in medical treatments and learning tasks.Moreover, we address the problem of selecting the most appropriate distance measure out of a pool of measures, which depends on each specific query, for classification of time series. The proposed approach is, to the best of our knowledge, the first one to deal with this problem. The reason for the importance of such a problem is given through an example. Assuming that there are two available distance measures performing whole sequence matching, there may be cases where the first one may correctly classify a given query, while it may fail to classify another query, which is correctly classified by the second distance measure. Thus, it would behighly desirable to be able to choose the "best" measure for each query. The classification accuracy of the proposed method compared to three state-of-the-art distance measures is significantly competitive on actual data.Furthermore, due to the fact that performing classification through the use of distance measures can be computationally expensive due to their quadratic complexity, we have explored the use of models, and specifically Hidden Markov Models (HMMs). HMMs are widely known and have been applied to a variety of domains, such as biology, speech recognition, and music retrieval, since they are able to model the underlying structure of sequences determining the relationships between their observations. Thus, in order to perform classification, we present a way of representing each class of a time series database with an HMM, and then perform searching basedon the constructed models. The proposed indexing framework MTSI (shorthand for Model-based Time Series Indexing) works in a filter-and-refine manner, and is shown to be both effective and efficient compared to a variety of distance-based measures on a large number of time series databases. Additionally, we have exploited HMMs so as to extract knowledge on what has happened in the past or to recognize what is happening in the present. Specifically, instead of searching a database of time series by providing a time series corresponding, e.g., to a specific activity, for finding thetime series that are similar to that query, we can provide a model representing this activity. Comparing HMMs and several distance measures for such searches shows that our approach is particularly meaningful with excellent accuracy results.Finally, the problem of interval-based whole sequence matching has received limited attention, although it appears in many application domains, such as music informatics, sign language, medicine, geo-informatics, cognitive science, linguistics. Another contribution is a novel distance measure for event-interval sequences, calledIBSM (abbreviation of Interval-Based Sequence Matching), which captures the similarity between event intervals, which are characterized by a label and a duration. Thorough experimental evaluation of IBSM on a variety of datasets demonstrates state-of-the-art performance.
Disciplines
Computer Sciences | Physical Sciences and Mathematics
License
This work is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 4.0 International License.
Recommended Citation
Kotsifakos, Alexios, "Online Efficient And Effective Search In Large And Noisy Sequence Databases" (2014). Computer Science and Engineering Dissertations. 212.
https://mavmatrix.uta.edu/cse_dissertations/212
Comments
Degree granted by The University of Texas at Arlington