Author

Fan Ni

ORCID Identifier(s)

0000-0002-9302-8445

Graduation Semester and Year

2019

Language

English

Document Type

Dissertation

Degree Name

Doctor of Philosophy in Computer Engineering

Department

Computer Science and Engineering

First Advisor

Song Jiang

Abstract

Data deduplication has been widely used in various storage systems for saving storage space, I/O bandwidth, and network traffic. However, existing deduplication techniques are inadequate as they introduce significant computation and I/O cost. First, to detect duplicates the input data (files) are usually partitioned into small chunks in the chunking process. It can be very time consuming if the content-defined chunking (CDC) method is adopted, where the chunk boundaries are determined by checking the data content byte-by-byte, for detecting duplicates among modified files. Second, for each chunk generated in the chunking process, we need to apply a collision resistant hash function on it to generate a hash value (fingerprint). Chunks with the same fingerprint are deemed as having the same contents and only one copy of the data is stored on the disk. The fingerprinting process of calculating the collision-resistant hash value for each chunk is compute-intensive. Both the chunking and fingerprinting process in existing deduplication systems introduce heavy computation burdens to the system and degrade the overall performance of the system. Third, in addition to the extra cost introduced by the chunking and fingerprinting processes, a deduplication system introduces extra I/O overheads for persisting and retrieving its metadata, which can significantly offset its advantage of saving I/O bandwidth. To this end, a deduplication system demands efficient computation and I/O operations. In this dissertation, we made several efforts to reduce the computation and I/O overheads in deduplication systems. First, two efforts have been made to accelerate the chunking process in the CDC-based deduplication. We designed a new parallel CDC algorithm that can be deployed on the SIMD platform to fully exploit its instruction-level parallelism without compromising the deduplication ratio. Further, we designed a highly efficient CDC chunking method that removes the speed barrier imposed by the existing byte-by-byte chunk boundary detection technique through exploiting the duplication history. Second, we identified an opportunity to use fast non-collision-resistant hash functions for efficient deduplication of journaled data in a journaling file system to achieve much higher file access performance without compromise of data correctness and reliability. Third, to avoid the performance degradation caused by the frequent writes of small metadata in primary deduplication systems, we proposed to opportunistically compress the fixed-size data block to make room for embedding the metadata. With our proposed method, in most cases the explicit metadata writes on the critical path can be avoided to significantly improve the I/O efficiency.

Keywords

Deduplication, Chunking, Fingerprinting, Content-defined chunking

Disciplines

Computer Sciences | Physical Sciences and Mathematics

Comments

Degree granted by The University of Texas at Arlington

Share

COinS