Document Type
Article
Source Publication Title
BCB 2022
Abstract
Obtaining informative representations of gene expression is crucial in predicting various downstream regulatory-related tasks such as promoter prediction and transcription factor binding sites prediction. Nevertheless, current supervised learning with insufficient labeled genomes limits the generalization capability of training a robust predictive model. Recently researchers model DNA sequences by self-supervised training and transfer the pre-trained genome representations to various downstream tasks. Instead of directly shifting the mask language learning to DNA sequence learning, we incorporate prior knowledge into genome language modeling representations. We propose a novel Motif-oriented DNA (MoDNA) pre-training framework, which is designed self-supervised and can be fine-tuned for different downstream tasks MoDNA effectively learns the semantic level genome representations from enormous unlabelled genome data, and is more computationally efficient than previous methods. We pre-train MoDNA on human genome data and fine-tune it on downstream tasks. Extensive experimental results on promoter prediction and transcription factor binding sites prediction demonstrate the state-of-the-art performance of MoDNA.
Publication Date
8-10-2022
Language
English
License
This work is licensed under a Creative Commons Attribution 4.0 International License.
Recommended Citation
An, Weizhi; Guo, Yuzhi; Bian, Yatao; Ma, Hehuan; Yang, Jinyu; Li, Chunyuan; and Huang, Junzhou, "MoDNA: Motif-Oriented Pre-training For DNA Language Model" (2022). Association of Computing Machinery Open Access Agreement Publications. 7.
https://mavmatrix.uta.edu/utalibraries_acmoapubs/7