Machine learning for ultraviolet spectral prediction
Degree granted by The University of Texas at Arlington
Abstract
Machine Learning has found wide applications in material science, including dielectric polymers, superconducting materials, and drug property prediction. The use of data analytics and machine learning methods to predict Vacuum Ultraviolet (VUV) spectra by encoding molecular structure is gaining interest because high-quality VUV spectral prediction capability would enable the study of new molecules without costly wet-lab measurements. This dissertation aims to study feature representations for molecular structures that enhance the prediction of VUV spectra via machine learning models. Both interpretable machine learning and deep learning are studied. Chapter 1 provides an overview of VUV/UV spectra retrieval, and Chapter 2 reviews relevant machine-learning models and conventional techniques in molecular analysis from the existing literature. Chapter 3 presents the primary contribution of this dissertation, which introduces a new set of features that captures molecular characteristics that are potentially important for accurate VUV spectral prediction. These new features are combined with features derived from the literature and prediction comparisons are studied for a variety of machine learning models. Findings demonstrate improvements in accuracy, highlight important features, provide comparisons in computational effort for different methods, and identify directions for future work. Chapter 4 takes a closer look at two of the deep learning methods studied in Chapter 3, namely graph-based and molformer methods. Because deep learning embeds feature engineering within the algorithm, the existing form of these methods cannot take advantage of the features studied in Chapter 3. In order to leverage the success of incorporating these features in VUV spectral prediction, a complementary structure is developed with the deep learning architecture. In addition, the graph-based method is improved by introducing a new edge feature that specifically identifies aromatic cycles. Findings show increased prediction accuracy with the complementary structure, which indicates a potentially generalizable benefit for deep learning. Finally, Chapter 5 provides closing remarks on future research. This dissertation contributes to the application of machine learning in predicting VUV spectra, providing interpretable models, and facilitating molecular analysis in the domain of Cheminformatics.