Graduation Semester and Year
2018
Language
English
Document Type
Thesis
Degree Name
Master of Science in Computer Science
Department
Computer Science and Engineering
First Advisor
David Levine
Abstract
Monitoring the health of data center clusters is an integral part of any industrial facility. ATLAS is one of the High Energy Physics experiments at the Large Hadron Collider (LHC) at CERN. ATLAS DDM (Distributed Data Management) is a system that manages data transfer, staging, deletions and experimental data on the LHC grid. Currently, the DDM system relies on Rucio software, with Cloud based object storage and No-SQL solutions. It is a cumbersome process in the current system, to fetch and analyze the transfer, staging and deletion metrics of a specific site for any regional center. In this thesis, a web-based cluster health monitoring framework is designed to monitor the health of the sites at the Tier 2 facility in the Southwest region of US, which eases these problems. A large volume of data flows in and out of each of these sites. If the transfer / deletion rate of files goes below the user-defined threshold at any source or destination site, the data center monitor is alerted automatically. This thesis also analyses the failures that have happened between any two performing sites. A machine learning algorithm finds the pattern of transfer / deletion with the existing data and detects the sites that may possibly fail due to diminishing transfer / deletion of files.
Keywords
Health monitoring, Clusters, Visualization, Machine learning, Failure, Analysis
Disciplines
Computer Sciences | Physical Sciences and Mathematics
License
This work is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 4.0 International License.
Recommended Citation
Balasubramanian, Meenakshi, "HEALTH MONITORING OF ATLAS DATA CENTER CLUSTERS AND FAILURE ANALYSIS" (2018). Computer Science and Engineering Theses. 414.
https://mavmatrix.uta.edu/cse_theses/414
Comments
Degree granted by The University of Texas at Arlington