Long Term Objectives

The recent recognition of Bioinformatics as an important and necessary area of scientific research and development, and the advent of multiple organismal sequencing projects demands innovative approaches in basic research. While development of software for meaningful large-scale analytical projects is necessary, the deposition of the results of Bioinformatics analyses in appropriate databases, thereby creating new research and development tools, is increasingly important. My Bioinformatic research agenda has two components: the analysis of life-forms that require an RNA replication intermediate in their lifecycle; and the assessment and development of tools necessary for studying large amounts of sequence information in a biologically meaningful manner. 
Specific Analytical Research

Mapping of All Genomic Retroid Agents: Prototype Human Genome
 

The Retroid Agents (e.g., HIV, Hepatitis B and retrotransposons), encode a gene for the reverse transcriptase protein (RT) and are the interface between RNA and DNA replication systems. My goal is to identify, classify and map all Retroid Agents of all complete genomes using the Human Genome as a prototype. In addition, the hidden Markov modeling of the sequences of the enzymatic core of these agents to generate more robust multiple alignments for relative rate determination is ongoing.

A Functional Genomics Challenge: the Transcription/Replication Complex of the Order Mononegavirales

Rabies, measles and Ebola viruses are true RNA-based life forms that have no DNA stage belonging to the order Mononegavirales. To date, little has been learned about the distribution of functions within or the actual structure of the replication/transcription complex of this order from either laboratory experiments or in silico studies. The size of the complex is beyond the limits of current structure determination methods (i.e., X-ray crystallography or NMR spectroscopy). The second goal of my research agenda is to bring the complexity of this problem to the attention of the Bioinformatic community to accelerate progress in understanding the lifecycle of these medically important viruses. Creation of robust multiple sequence alignments, generated by the hidden Markov process, for the three proteins of the replication/transcription complex is underway. Multiple alignments are necessary as data for methods predicting both inter and intra-protein protein contacts and functional constraints. A variety of software for predicting potential regions or residues of protein:protein interaction is being evaluated. My strategy utilizes a variety of approaches, e.g., compensatory mutation analyses, prediction of protein disorder, etc., that should converge, indicating specific residues and/or regions of the three proteins of the replication/transcription complex that are potentially involved in protein:protein interactions. Once these studies are complete, arrangements have been made with various laboratories to test these functional genomic predictions among the proteins involved in replication/ transcription for VSV, rabies, Sendai and measles viruses. 
 

As results from proteomic studies begin to accumulate, there will be an abundance of sequence information but little structural information on higher-order protein complexes. A combination of physical chemical approaches and functional genomic predictions will be necessary to elucidate the interaction in protein complexes. My evaluation and utilization of methods for predicting potential protein:protein contacts will be of great value in the effort to understand higher order protein interactions. 
 

Specific Technical Research 

Development of Automation for Analyzing Genomic Retroid Data: the Genome Parsing Suite 
The 1 st goal of my technical research is the development of software necessary for identifying, classifying and mapping all Retroid Agents of a given genome. The Genome Parsing Suite (GPS) is the prototype software used in our pilot studies to demonstrate the validity of taking such a global approach to identifying all Retroid Agents in a given genome. This generic software applies well-established, robust methods in novel ways to access and categorize large amounts of sequence data. The basic GPS strategy is to 1) identify all potential RT genes; 2) filter the data for probable functional RTs; 3) compare the "new" RT to a library of "known" RTs, thereby predicting Retroid type and genomic length; 4) annotating the "new" agent's genome as to Retroid gene compliment, and 5) surveying the surrounding host genome to classify the probable role of the agent. Although still in development, this software is being used in collaborations with the Stanford University's Saccharomyces genome project, and Northeastern University's Ice fish genome project.
 

Creating the Prototype Browsible Database for the Retroid Agents

In this age of information technology it is important to deposit newly generated in silico data into a framework that creates new research and development tools. The 2nd goal is to design and implement a meaningful prototype browsible database of all existing and new data relevant to identify and classify all Retroid Agents. These data will be available for display through the UCSC Genome Browser.

Constructing Evolutionary HMMs for Distantly Related Sequences

The 3rd goal is to construct evolutionary hidden Markov models (HMMs) that incorporate information on variable and invariant regions of a protein so that better multiple alignments can be generated for phylogenetic reconstruction, and protein-protein interaction determination. 

A wide variety of hypotheses can be addressed as Retroid genomes are identified and classified. For example, the issue of the distribution of active versus inactive agents can be addressed on a genome wide and per chromosome basis. The frequency of insertional mutagenesis and regulation of host genes by Retroid Agents can also be ascertained. 
 
Background 

Bioinformatics is the interplay between empirically derived biological knowledge and the availability of various types of data (e.g., sequence, structure etc) on the one hand, and the development of strategies and methods for the analysis and interpretation of these data sources, on the other. Bioinformatic research includes the development and testing of software tools necessary to generate new knowledge from primary source information deposited in databases and the literature. In the broadest sense Bioinformatics is the generation of new knowledge from existing data. All of my research is conducted in silico, i.e., within the computer environment. Many of the results of my analyses are functional genomic predictions that have been or can be tested by laboratory experimentation. For example, in 1987 I predicted the correct order of the two domains of the RNA-dependent-DNA-polymerase (RdDp): the reverse transcriptase (RT) domain is in the amino portion while the ribonuclease H (RH) function resides in the carboxyl portion. This prediction was experimentally verified to be correct in 1988. This work was one of the earliest studies to predict function from comparison of distantly related protein sequences that was subsequently experimentally verified to be correct, contrary to earlier published experimental studies. 

Given the pace at which biological information is deposited into international databases it is imperative that 21st century scientists train in both discovery and hypothesis-based research that fully utilize these growing data resources. Inadequate, even faulty, Bioinformatic analysis will severely limit what can be learned from existing data sources. My laboratory has been engaged in Bioinformatic analysis for over a decade. My in silico research is dedicated to the Bioinformatic analyses of RNA-based life forms, i.e., all RNA viruses, and other genetic agents that require an RNA-intermediate in their life-cycle. My research is the interplay between empirically derived biological data sources, Bioinformatic tools, and human decision making in the creation of new knowledge about the evolution, structure, and function of RNA-based life forms and their impact on DNA genomes. The analytical portion of my research agenda has two major components: the analysis of all genetic information encoding the RT, and the elucidation of the protein relationships and interactions of the transcription/replication complex of the order Mononegavirales (e.g., measles, rabies and Ebola) using Bioinformatic tools. The major emphasis of my research agenda is the description of a new strategy for the identification, classification and mapping of all RT-encoding sequences in all genomes. 

Once considered to be "junk" DNA, but well studied for over a decade now, the phylogenetic relationships among Retroid Agents are well documented. Retroid Agents are co-evolving with host genomes, some through mutualism. The relationships among and between Retroid and host genomes are complex. Retroid Agents are ubiquitous in Eukaryotes. They appear to have evolved as both endogenous and exogenous genetic information maintaining a stable enzymatic core of genes while occasionally acquiring unique genes from their hosts. These discreet RNA-dependent genomes are involved in regulation of general host cell genes and reproduction, as well as DNA repair, and the disease process. As data accumulate from organismal genome projects, the variety of roles and the underlying molecular mechanisms responsible for Retroid Agent action expands. In the last two years alone over three hundred publications have described the richness of the cellular niches that these agents occupy in a wide variety of Eukaryotes. The impact of the co-evolution of Retroid Agents and DNA-based life forms is just beginning to be revealed. Given this long-term association between these two types of genetic information systems, Retroid Agents can also be used as evolutionary markers for their hosts. My interest in Retroid Agent evolution is focused on the significance that these agents have on the regulation and disease processes in humans given that they comprise approximately 20% of the Human Genome. 

Based upon available data the Human Genome contains only two types of Retroid Agents; retroviruses (exogenous and endogenous), and retroposons. While the pathogenicity of HIV and HTLV are well known, only recently has an MMLV-like endogenous virus been found in association with some human breast cancers. Recent data indicate that other human endogenous retroviruses (HERVs) are associated with a wide variety of human diseases: testicular tumors, insulin dependent diabetes mellitus, and multiple sclerosis. One of the most fascinating examples suggests that three different alleles of endogenous HRES-1, a human T-cell lymphotropic type I virus, confer susceptibility to systemic lupus erythematosus (SLE) and autoreactivity. The data suggest that allelic type 1 of HRES-1 may confer protection against SLE. 

The first suggestion that HERVs play a role in host cell regulation was demonstrated for the amylase gene cluster. Other HERVs have been demonstrated to regulate the HLA-DRB gene, the leptin gene, obese, the HHLA 2 and 3 genes of functions unknown, the apolipoprotein CI gene, and the endothelin B receptor of placental cells. HERVs are also involved in the regulation of human reproduction. Two very specific roles of HERVs in reproduction reveal the depth of co-evolution between the host and the virus. HERV expression is involved in human sperm-egg binding and fusion. It has been demonstrated that HERV-W encodes the adhesion protein, syncytin. This virally encoded protein adheres to the trophoblast forming the syncytiotrophoblast that binds to the endometrium in the early stage of human placental development. To date no human gene has been discovered that encodes this function. 

As data accumulate to support the importance of HERV involvement in human reproduction, development and disease, the evidence for the co-evolution of the distantly related retroposons with the Human Genome is also mounting. To date retroposon insertion into genes has accounted for the cause of several human diseases. Hemophilia A is caused by the insertion of a long-interspersed- n uclear- element or LINE, into the X-factor. First suggested in 1993, LINE insertion into dystrophin gene is responsible for Duchenne muscular dystrophy. Fukuyama-type congenital muscular dystrophy, a common autosomal recessive disorder in the Japanese population, is caused by insertion of a truncated LINE into the fukutin gene. Interestingly, in the cases of Hemophilia A and Duchenne muscular dystrophy, the insertional mutagenesis has occurred very recently since the mutation is absent in the genes of the parents of those afflicted. Two different X-linked disorders are also caused by LINE insertional mutagenesis. Alport Syndrome-Diffuse Leiomyomatosis presents as hematuria, progressive renal failure, hearing loss and ocular abnormalities in males while in females it is milder. The type IV collagen gene is disrupted by the presence of a LINE. Chronic Granulomatous Disease is a severe congenital immunodeficiency syndrome caused by defects in NADPH oxidase in phagocytic leukocytes. In this case, a new exon is created by a LINE insertion into human X-linked CYBB gene that encodes the gp91-phox subunit of the oxidase. At least two cases of LINEs regulation of host genes have been demonstrated to date: the thymidylate synthetase and Apo A genes. 

The reports on the number and distribution of retroposons in the Human Genome estimate that they make-up approximately 16% of total information. There are no analyses of the functionality or roles retroposons play in the Human Genome in either report on sequencing the Human Genome, however, the importance of cataloguing the evolutionary events that these agents have left on genomic landscapes is clearly stated. It is abundantly evident, as first suggested by Barbara McClintock, that transposable elements play a major role in the evolution and development of Eukaryotes. The full extent to which HERVs and retroposons play critical roles in human evolution, development and disease will only be known when all such agents are identified, mapped and evaluated as to probable function. The pilot studies and experimental design I have developed clearly show that such a global approach to this problem is feasible. The creation of new research and development tools of this nature will accelerate the rate at which we understand the pathways of complexity that have evolved between host and Retroid genomes, and the roles they play in the human disease process.

Retroid Agent Distribution On Human Chromosoes For Each Genome Iteration 

Power point presentation: Presentation



Copyright © 2004, The McClure Lab