Graduation Semester and Year

2018

Language

English

Document Type

Thesis

Degree Name

Master of Science in Computer Science

Department

Computer Science and Engineering

First Advisor

David Levine

Abstract

Monitoring the health of data center clusters is an integral part of any industrial facility. ATLAS is one of the High Energy Physics experiments at the Large Hadron Collider (LHC) at CERN. ATLAS DDM (Distributed Data Management) is a system that manages data transfer, staging, deletions and experimental data on the LHC grid. Currently, the DDM system relies on Rucio software, with Cloud based object storage and No-SQL solutions. It is a cumbersome process in the current system, to fetch and analyze the transfer, staging and deletion metrics of a specific site for any regional center. In this thesis, a web-based cluster health monitoring framework is designed to monitor the health of the sites at the Tier 2 facility in the Southwest region of US, which eases these problems. A large volume of data flows in and out of each of these sites. If the transfer / deletion rate of files goes below the user-defined threshold at any source or destination site, the data center monitor is alerted automatically. This thesis also analyses the failures that have happened between any two performing sites. A machine learning algorithm finds the pattern of transfer / deletion with the existing data and detects the sites that may possibly fail due to diminishing transfer / deletion of files.

Keywords

Health monitoring, Clusters, Visualization, Machine learning, Failure, Analysis

Disciplines

Computer Sciences | Physical Sciences and Mathematics

Comments

Degree granted by The University of Texas at Arlington

Share

COinS