Zhichao Yan

ORCID Identifier(s)


Graduation Semester and Year




Document Type


Degree Name

Doctor of Philosophy in Computer Science


Computer Science and Engineering

First Advisor

Hong Jiang


With the development of the Internet and information technology, a large amount of unstructured data is generated and stored in various storage systems. In particular, data reduction techniques such as compression and deduplication have become an effective way to address the combined challenges of explosive growth in data volume but lagging network bandwidth growth to increase the space and bandwidth efficiency of various storage systems. However, we have found that existing deduplication systems cannot effectively process compressed data and image data because existing deduplication systems only analyze the hash value of the bitstream to detect redundant data. At the same time, we found that it is hard to integrate deduplication in resource-constrained solid-state drives (SSDs) due to their internal structure although it is worthwhile because deduplication not only can expand their logical capacity but also extend their lifetime by reducing the program and erase (P/E) operations. Inspired by these problems, this thesis will focus on building a versatile deduplication system to addressing these issues. With respect to the problem of deduplicating compressed data, we propose Z-dedup approach, which leverages the existing invariable metadata such as original file's length and checksum within the compressed packages to help detect and eliminate the potential duplicated files across all compressed packages. Moreover, for the complicated solid compression mode, Z-dedup injects such metadata into the solid compressed packages to make their internal contents to be analyzed by our versatile deduplication system. With respect to the problem of deduplicating image data, we propose WM-dedup approach, which injects an invariable chunking and content description information in the form of a steganographic watermark to help identify and remove the perceptible redundant image data. This is a lossy deduplication scheme that may tolerate some information losses while the super-resolution and impaint techniques can help recover the perceptible equivalent image data to enable deduplicating image data in our versatile deduplication system. With respect to the problem of efficiently integrating deduplicating in SSDs, we propose SES-dedup, which bypasses the data scrambler module to enable the low-cost ECC-based data deduplication. Specifically, we propose two design solutions, one on the host side and the other on the device side, to enable ECC-based deduplication. Based on our approach, we can effectively exploit SSD's built-in ECC module to calculate the hash values of stored data for data deduplication in our versatile deduplication system.


Data deduplication, Storage system


Computer Sciences | Physical Sciences and Mathematics


Degree granted by The University of Texas at Arlington