Graduation Semester and Year




Document Type


Degree Name

Master of Science in Computer Science


Computer Science and Engineering

First Advisor

David Levine


Monitoring the health of data center clusters is an integral part of any industrial facility. ATLAS is one of the High Energy Physics experiments at the Large Hadron Collider (LHC) at CERN. ATLAS DDM (Distributed Data Management) is a system that manages data transfer, staging, deletions and experimental data on the LHC grid. Currently, the DDM system relies on Rucio software, with Cloud based object storage and No-SQL solutions. It is a cumbersome process in the current system, to fetch and analyze the transfer, staging and deletion metrics of a specific site for any regional center. In this thesis, a web-based cluster health monitoring framework is designed to monitor the health of the sites at the Tier 2 facility in the Southwest region of US, which eases these problems. A large volume of data flows in and out of each of these sites. If the transfer / deletion rate of files goes below the user-defined threshold at any source or destination site, the data center monitor is alerted automatically. This thesis also analyses the failures that have happened between any two performing sites. A machine learning algorithm finds the pattern of transfer / deletion with the existing data and detects the sites that may possibly fail due to diminishing transfer / deletion of files.


Health monitoring, Clusters, Visualization, Machine learning, Failure, Analysis


Computer Sciences | Physical Sciences and Mathematics


Degree granted by The University of Texas at Arlington