ORCID Identifier(s)

0000-0002-2806-9312

Graduation Semester and Year

2018

Language

English

Document Type

Dissertation

Degree Name

Doctor of Philosophy in Computer Science

Department

Computer Science and Engineering

First Advisor

Hong Jiang

Abstract

With the development of the Internet and information technology, a large amount of unstructured data is generated and stored in various storage systems. In particular, data reduction techniques such as compression and deduplication have become an effective way to address the combined challenges of explosive growth in data volume but lagging network bandwidth growth to increase the space and bandwidth efficiency of various storage systems. However, we have found that existing deduplication systems cannot effectively process compressed data and image data because existing deduplication systems only analyze the hash value of the bitstream to detect redundant data. At the same time, we found that it is hard to integrate deduplication in resource-constrained solid-state drives (SSDs) due to their internal structure although it is worthwhile because deduplication not only can expand their logical capacity but also extend their lifetime by reducing the program and erase (P/E) operations. Inspired by these problems, this thesis will focus on building a versatile deduplication system to addressing these issues. With respect to the problem of deduplicating compressed data, we propose Z-dedup approach, which leverages the existing invariable metadata such as original file's length and checksum within the compressed packages to help detect and eliminate the potential duplicated files across all compressed packages. Moreover, for the complicated solid compression mode, Z-dedup injects such metadata into the solid compressed packages to make their internal contents to be analyzed by our versatile deduplication system. With respect to the problem of deduplicating image data, we propose WM-dedup approach, which injects an invariable chunking and content description information in the form of a steganographic watermark to help identify and remove the perceptible redundant image data. This is a lossy deduplication scheme that may tolerate some information losses while the super-resolution and impaint techniques can help recover the perceptible equivalent image data to enable deduplicating image data in our versatile deduplication system. With respect to the problem of efficiently integrating deduplicating in SSDs, we propose SES-dedup, which bypasses the data scrambler module to enable the low-cost ECC-based data deduplication. Specifically, we propose two design solutions, one on the host side and the other on the device side, to enable ECC-based deduplication. Based on our approach, we can effectively exploit SSD's built-in ECC module to calculate the hash values of stored data for data deduplication in our versatile deduplication system.

Keywords

Data deduplication, Storage system

Disciplines

Computer Sciences | Physical Sciences and Mathematics

License

This work is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 4.0 International License.

Comments

Degree granted by The University of Texas at Arlington

Recommended Citation

Yan, Zhichao, "Building a Versatile Deduplication System" (2018). Computer Science and Engineering Dissertations. 302.
https://mavmatrix.uta.edu/cse_dissertations/302

27814-2.zip (2971 kB)

Download

Included in

Computer Sciences Commons

COinS

Computer Science and Engineering Dissertations

Building a Versatile Deduplication System