Abanish Singh

Graduation Semester and Year




Document Type


Degree Name

Doctor of Philosophy in Computer Science


Computer Science and Engineering

First Advisor

Nikola Stojanovic


The genetic code consists of long chains of deoxyribonucleic acid (DNA) present in every cell of a living organism. These chains contain both functional and non-functional DNA sequences, and their proportion in the mix varies widely along the tree of life. Generally, more complex organisms tend to feature large amounts of "junk" DNA, whose importance is still subject of a debate in the scientific circles. The functional sequences include coding sequences (genes) and various types of signals, mostly, but not exclusively, controlling the regulation of coding sequences, i.e. activating and deactivating the expression of genes, during the developmental stage, in response to external stimuli, or during housekeeping activities in a cell or organism. Such expression leads to the production of various ribonucleic acids (RNAs), out of which the most common is messenger RNA (mRNA) which serves as a template for chains of amino acids, or polypeptides. The polypeptides themselves fold and group into proteins, providing structural components and functionalities to the living cells and tissues. Regulatory signals in DNA tend to act as parts of complex networks, whose structure and dynamics have been subject to biomolecular studies for many decades. Recently, especially after sequencing of several major eukaryotic genomes has been completed, these studies have become increasingly computational. The applied techniques focus on sequence features, such as periodicity, motif over-representation, phylogenetic conservation, sequence or structural homology, or the experimental data about binding effects, patterns of gene co-expression, and, more recently, epigenetic information.Over the last several years, the search for functional elements in human and other genomes by exploiting motif over-representation became increasingly popular. Although there has been some success in this field, the existing tools are still neither sensitive nor specific enough, usually suffering from the detection of a large number of false positive signals. Given the properties of genomic sequences, some of which we analyze in this document, this is not unexpected, but one can still find interesting signals worthy of further computational and laboratory investigation.In this thesis we present several algorithms for DNA sequence analysis, and in particular the identification and characterization of short motifs. We start with presenting an efficient algorithm to find significant variable motifs shared within target sequences, generally taken from the upstream regions of co-expressed genes. Various filtering techniques have been applied to this problem in the past, but in our view it is important that we generate complete data, upon which separate selection criteria can be applied, depending on the nature of the sites one wants to locate. Though we primarily intended to develop software to locate the significant motifs based on their over-representation in the given DNA sequences, we also attempted to elucidate why such software often fails in locating the real elements. We have thus performed a study of the repetitive structure and distribution of short motifs in human genomic sequences. In most mammalian species about half of the genome consists of known or readily recognizable repeated elements, and we demonstrate that in addition to these repeats human genomic sequences feature many short motifs which are significantly over-represented, and that their frequency varies only slightly between random repeat--masked sequences and regions located immediately upstream of the known genes.Recent studies have established the existence of evolutionary (and thus presumably functional) constraint on only about 5% of the human genome. If a half of it consists of known repeated sequences that leaves an open question about the source of the remaining 45%, for which we postulate that it should have mostly originated from ancient transpositional or other duplication activity. The original copies could have become so broken over time that they cannot be recognized as such any more, giving rise to seemingly unique sequences which nevertheless share large numbers of greatly over-represented short motifs. We have developed an algorithm, and written software which efficiently associates these motifs and reconstructs the consensus sequences of possible ancient broken repeats. We have found a significant number of new large repeated sequences, in addition to the previously characterized transposable elements and other duplications in the human genome, and we have built their consensus sequences and attempted to characterize them. We believe that in view of a recently proposed model postulating that transposable elements have been a significant source of transcriptional regulatory signals, further study of broken genomic repeats would be very useful.The software implementing our methods have been made available in the public domain, and we have also developed a web server to enable on-line access to our tools by other investigators.


Computer Sciences | Physical Sciences and Mathematics


Degree granted by The University of Texas at Arlington