Graduation Semester and Year
2018
Language
English
Document Type
Dissertation
Degree Name
Doctor of Philosophy in Computer Science
Department
Computer Science and Engineering
First Advisor
Hong Jiang
Abstract
With the development of the Internet and information technology, a large amount of unstructured data is generated and stored in various storage systems. In particular, data reduction techniques such as compression and deduplication have become an effective way to address the combined challenges of explosive growth in data volume but lagging network bandwidth growth to increase the space and bandwidth efficiency of various storage systems. However, we have found that existing deduplication systems cannot effectively process compressed data and image data because existing deduplication systems only analyze the hash value of the bitstream to detect redundant data. At the same time, we found that it is hard to integrate deduplication in resource-constrained solid-state drives (SSDs) due to their internal structure although it is worthwhile because deduplication not only can expand their logical capacity but also extend their lifetime by reducing the program and erase (P/E) operations. Inspired by these problems, this thesis will focus on building a versatile deduplication system to addressing these issues. With respect to the problem of deduplicating compressed data, we propose Z-dedup approach, which leverages the existing invariable metadata such as original file's length and checksum within the compressed packages to help detect and eliminate the potential duplicated files across all compressed packages. Moreover, for the complicated solid compression mode, Z-dedup injects such metadata into the solid compressed packages to make their internal contents to be analyzed by our versatile deduplication system. With respect to the problem of deduplicating image data, we propose WM-dedup approach, which injects an invariable chunking and content description information in the form of a steganographic watermark to help identify and remove the perceptible redundant image data. This is a lossy deduplication scheme that may tolerate some information losses while the super-resolution and impaint techniques can help recover the perceptible equivalent image data to enable deduplicating image data in our versatile deduplication system. With respect to the problem of efficiently integrating deduplicating in SSDs, we propose SES-dedup, which bypasses the data scrambler module to enable the low-cost ECC-based data deduplication. Specifically, we propose two design solutions, one on the host side and the other on the device side, to enable ECC-based deduplication. Based on our approach, we can effectively exploit SSD's built-in ECC module to calculate the hash values of stored data for data deduplication in our versatile deduplication system.
Keywords
Data deduplication, Storage system
Disciplines
Computer Sciences | Physical Sciences and Mathematics
License
This work is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 4.0 International License.
Recommended Citation
Yan, Zhichao, "Building a Versatile Deduplication System" (2018). Computer Science and Engineering Dissertations. 302.
https://mavmatrix.uta.edu/cse_dissertations/302
Comments
Degree granted by The University of Texas at Arlington