ORCID Identifier(s)

0000-0002-4907-492X

Graduation Semester and Year

Spring 2024

Language

English

Document Type

Thesis

Degree Name

Master of Science in Computer Science

Department

Computer Science and Engineering

First Advisor

Hong Jiang

Abstract

The increasing performance gap between computation and I/O creates huge data management challenges for simulation-based scientific discovery. Data reduction, among others, is deemed to be a promising technique to bridge the gap through reducing the amount of data migrated to persistent storage. However, the reduction performance is still far from what is being demanded from production applications. To this end, we propose a new methodology that aggressively reduces data despite the substantial loss of information, and re-computes the original accuracy on-demand. As a result, our scheme creates an illusion of a fast and large storage medium with the availability of high-accuracy data. We further design an adaptive load-aware data reduction strategy that monitors the I/O overhead at runtime, and dynamically adjusts the reduction ratio.

We verify the efficacy of data reduction and re-computation through adaptive mesh refinement, a popular numerical technique for solving partial differential equations. We evaluate data reduction and selective data re-computation on Titan, using a real application in FLASH and mini applications in Chombo. To clearly demonstrate the benefits of re-computation, we compare it with other state-of-the-art data reduction methods including SZ, ZFP, FPC and Deduplication, and it is shown to be superior in both write and read speeds, particularly when a small amount of data (e.g., 1%) need to be retrieved, as well as reduction ratio. Data reduction and re-computation can reach up to 6X compression ratio than lossy compressors and up to 16X compression ratio than lossless compressors. In addition, data reduction and re-computation can save around 6X of total time for the evaluation applications. Our results confirm that data reduction and selective data re-computation can 1) reduce the performance gap between I/O and compute via aggressively reducing AMR levels, and more importantly 2) can recover the target accuracy efficiently for AMR through re-computation.

We evaluate load-aware data reduction and adaptive data reduction, which is designed for optimizing performance of applications and HPC systems and utilization of resources on HPC platforms. We apply a machine learning method, LSTM to verify load-aware data reduction by prediction of I/O congestion based on the two-month I/O trace collected due to the repeatability of I/O activities on the HPC platform. In addition, we evaluate adaptive data reduction by sending probe packages and measuring the feedback write rate of them to monitor the complex I/O environment for real-time data reduction achieving the performance goal with sufficient utilization of resources. Our results demonstrate that adaptive load-aware data reduction can successfully mitigate the performance degradation of applications due to I/O congestion and optimize resources utilization via runtime reduction adjustment on HPC platforms.

Keywords

High performance computing, Data reduction, AMR

Disciplines

Computer and Systems Architecture | Data Storage Systems

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.