Graduation Semester and Year




Document Type


Degree Name

Doctor of Philosophy in Computer Science


Computer Science and Engineering

First Advisor

Gautam Das


Feature engineering and feature selection are two important aspects of data science pipeline. Due to the advancement of data collection techniques in recent years, huge amount of data is becoming available in different industries. Consequently, the importance of data science is increasing for business analytic purpose. Different tools and techniques are being developed to assist data scientists to complete their tasks efficiently. One of the main human involvements in the data science task is for feature engineering and selection. These pre-processing steps will prepare the data in the format desired to be fed into various machine learning algorithms to accomplish predictive tasks. The aim of this dissertation is twofold - first to develop an effective framework to assist data scientist for feature engineering task, and then to develop a new measure to select these features efficiently. The term attribute is used to denote feature in tabular data format. In this dissertation, a semi-automated, “human-in-the-loop” framework for attribute design is developed that assists human analysts to transform raw attributes into effective derived attributes for classification problems. The proposed framework is optimization guided and fully agnostic to the underlying classification model. An algebra with various operators (arithmetic, relational, and logical) to transform raw attributes into derived attributes is presented and two technical problems are solved: (a) the top-k buckets design problem aims at presenting human analysts with k buckets, each bucket containing promising choices of raw attributes that she can focus on only without having to look at all raw attributes; and (b) the top-l snippets generation problem, which iteratively aids human analysts with top-l derived attributes involving an attribute. For the former problem, an effective exact bottom-up algorithm empowered by pruning capability is presented, as well as random walk based heuristic algorithms that are intuitive and work well in practice. For the latter, a greedy heuristic algorithm is presented that is scalable and effective. Rigorous evaluations are conducted involving 6 different real world datasets to showcase that proposed framework generates effective derived attributes compared to fully manual or fully automated methods. Next, a demonstration of the semi-automated, “human-in-the-loop” attribute design framework, namely iFE is proposed. iFE is a desktop application that enables a human analyst to find interpretable derived attributes much quicker than fully manual method. The system first finds k buckets, each containing promising choices of raw attributes that the analyst can focus on only without having to look at all raw attributes. To achieve this, iFE implements a random walk based heuristic algorithm that is intuitive and works well in practice. In the next step, the system iteratively aids the analyst to generate top-l derived attributes within a bucket using arithmetic, relational, and logical operators. The user interface in our system guides the analyst to the final derived attributes in a few number of iterations which saves time and effort as well as boost productivity for the analyst. Finally, a new measure is proposed for efficient approximate mutual information based feature selection. Feature selection is an important step in the data science pipeline, and it is critical to develop efficient algorithms for this step. Mutual Information (MI) is one of the important measures used for feature selection, where attributes are sorted according to descending score of MI, and top-k attributes are retained. The goal of this work is to develop a new measure Attribute Average Conflict, Aac to effectively approximate top-k attributes, without actually calculating MI. The proposed method is based on using the database concept of approximate functional dependency to quantify MI rank of attributes which to our knowledge has not been studied before. The effectiveness of the proposed measure is demonstrated with a Monte-Carlo simulation. Extensive experiments are performed using high dimensional synthetic and real datasets with millions of records. Experimental results show that the proposed method demonstrates perfect accuracy in selecting the top-k attributes, yet is significantly more efficient than state-of-art baselines, including exact methods for computing Mutual Information based feature selection, as well as adaptive random-sampling based approaches. An analysis is provided for the upper and lower bounds of the proposed new measure and it is shown that tighter bounds can be derived by using marginal frequency of attributes in specific arrangements. The bounds on the proposed measure can be used to select top-k attributes without full scan of the dataset in a single pass. Experimental evaluation on real datasets is conducted to show the accuracy and effectiveness of this approach.


Human-in-the-loop, Feature engineering, Feature selection


Computer Sciences | Physical Sciences and Mathematics


Degree granted by The University of Texas at Arlington