Zhongwei Li

Graduation Semester and Year




Document Type


Degree Name

Master of Science in Computer Science


Computer Science and Engineering

First Advisor

Hao Che

Second Advisor

(Jeff) Yu Lei


Performance evaluation and resource provisioning are two most critical factors to be considered for designers of distributed systems at modern warehouse data centers. The ever-increasing volumes of data in recent years have pushed many businesses to move their computing tasks to the Cloud, which offers many benefits including the low system management and maintenance costs and better scalability. As a result, most recent prominently emerging workloads are data-intensive, calling for scaling out the workload to a large number of servers for parallel processing. Questions can be asked as what factors impact the system scaling performance, and how to efficiently schedule tasks to the distributed comping resources. This dissertation introduces a new performance model to address the former problem and an effective hierarchical job scheduler for the latter. The major contribution of this dissertation is to introduce our new performance modeling approach designed for data-intensive applications, which consists of two phases: 1) In-Proportion and Scale-Out-induced scaling model (IPSO), 2) Unified Scaling model for Big data Analytics (USBA). The first model we build is based on the traditional performance models including both Amdahl's and Gustafson's laws. We clearly demonstrate in this research why these classic models are insufficient and inadequate in today's parallel computing environment and how IPSO model may fill the gap. While at the second phase we extend IPSO for today's multi-staged workloads, such model can be easily adopted at modeling data analytic applications running at Spark platform. Both models are supported by our evaluations on well-known benchmarks and evidences from other publications. To the best of our knowledge, IPSO is the first variation of the classic Amdahl's model that can be directly applied to modern data-intensive applications. A light-weighted tool is also developed at the end of this research, which can be used for generating IPSO inputs or a Spark application log analyzer. The tool is developed as an open source project and accessible in public repository. The second contribution of this dissertation is the Pigeon job scheduler we propose for the modern data centers. Pigeon is a distributed, hierarchical job scheduler based on a two-layer design. It offloads the service pressure in widely adopted centralized data center scheduler by quickly dispatching the incoming tasks to selected nodes known as masters, then guarantees the efficiency of task execution by enforcing its unique queuing mechanism on these masters. Pigeon can minimize the chance of head-of-line blocking for short jobs and avoid starvation for long jobs, and outperform Sparrow (distributed scheduler) and Eagle (hybrid scheduler) based on our evaluations. Pigeon is also an open sourced tool that can be accessed from public repository. This dissertation is presented in an article-based format and includes three research papers. The first chapter is an introduction to all contents in this dissertation. The second chapter reports our performance evaluation model (IPSO). The third chapter reports IPSO's extended model for multi-staged workloads (USBA). The fourth chapter reports our work on Pigeon scheduler. Finally the fifth concludes all work and the plan for the following research target.


Data-intensive, Big data, Performance modeling, Resource provisioning, Job scheduler, HPC, Distributed systems


Computer Sciences | Physical Sciences and Mathematics


Degree granted by The University of Texas at Arlington