Graduation Semester and Year

2019

Language

English

Document Type

Thesis

Degree Name

Master of Science in Computer Science

Department

Computer Science and Engineering

First Advisor

Song Jiang

Abstract

The amount of data being produced and consumed is increasing every day. As a result, there can be a large amount of redundant data in the storage system. Storing and accessing these duplicate data unnecessarily consumes disk space and I/O bandwidth. Deduplication techniques are widely deployed to remove the redundancy. In particular, the deduplication solutions that work at the block level are proven to be effective. These solutions aim to effectively use disk space and write bandwidth by avoiding duplicate data writes to the storage. However, such a design might not help in improving the read performance, which is critical for many modern-day applications. The Linux kernel implements an in-memory cache of pages, called the page cache, to improve I/O performance by minimizing disk accesses. The page cache has pages originating from regular file systems, and it is indexed by a file and the offset within the file. However, due to such a design, deduplication information is currently not available to the page cache. Due to this, the kernel cannot avoid read requests from going to the disk on offsets that are not present in the page cache, even though the requested data duplicates another offset that is already cached. Consequently, the overall I/O performance of the applications running on these systems can be compromised. To address this issue, we propose a lightweight scheme called Dual-Dedup, that efficiently coordinates the deduplication information with the page cache. It discloses the redundancy knowledge detected by the block-level deduplication layer to the page cache, which can then prevent unnecessary read requests. Results from extensive experiments show that Dual-Dedup significantly improves read performance. On FIO tests with 25% duplicate data, our system shows an improvement of 34% in the read throughput when compared with Linux EXT4.

Keywords

Page cache, Storage system, Operating system, Performance, Linux kernel

Disciplines

Computer Sciences | Physical Sciences and Mathematics

Comments

Degree granted by The University of Texas at Arlington

Share

COinS