|
|
|
The recent recognition of Bioinformatics as an important
and necessary area of scientific research and development, and the
advent of multiple organismal sequencing projects demands innovative
approaches in basic research. While development of software for
meaningful large-scale analytical projects is necessary, the deposition
of the results of Bioinformatics analyses in appropriate databases,
thereby creating new research and development tools, is increasingly
important. My Bioinformatic research agenda has two components: the
analysis of life-forms that require an RNA replication intermediate in
their lifecycle; and the assessment and development of tools necessary
for studying large amounts of sequence information in a biologically
meaningful manner.
| Specific Analytical Research |
Mapping of All Genomic Retroid Agents: Prototype
Human Genome
The Retroid Agents (e.g., HIV, Hepatitis B and
retrotransposons), encode a gene for the reverse transcriptase protein
(RT) and are the interface between RNA and DNA replication systems. My
goal is to identify, classify and map all Retroid Agents of all complete
genomes using the Human Genome as a prototype. In addition, the hidden
Markov modeling of the sequences of the enzymatic core of these agents
to generate more robust multiple alignments for relative rate
determination is ongoing.
A Functional Genomics Challenge: the
Transcription/Replication Complex of the Order Mononegavirales
Rabies, measles and Ebola viruses are true RNA-based
life forms that have no DNA stage belonging to the order
Mononegavirales. To date, little has been learned about the distribution
of functions within or the actual structure of the
replication/transcription complex of this order from either laboratory
experiments or in silico studies. The size of the complex is beyond the
limits of current structure determination methods (i.e., X-ray
crystallography or NMR spectroscopy). The second goal of my research
agenda is to bring the complexity of this problem to the attention of
the Bioinformatic community to accelerate progress in understanding the
lifecycle of these medically important viruses. Creation of robust
multiple sequence alignments, generated by the hidden Markov process,
for the three proteins of the replication/transcription complex is
underway. Multiple alignments are necessary as data for methods
predicting both inter and intra-protein protein contacts and functional
constraints. A variety of software for predicting potential regions or
residues of protein:protein interaction is being evaluated. My strategy
utilizes a variety of approaches, e.g., compensatory mutation analyses,
prediction of protein disorder, etc., that should converge, indicating
specific residues and/or regions of the three proteins of the
replication/transcription complex that are potentially involved in
protein:protein interactions. Once these studies are complete,
arrangements have been made with various laboratories to test these
functional genomic predictions among the proteins involved in
replication/ transcription for VSV, rabies, Sendai and measles
viruses.
 |
 |
As results from proteomic studies begin to accumulate,
there will be an abundance of sequence information but little structural
information on higher-order protein complexes. A combination of
physical chemical approaches and functional genomic predictions will be
necessary to elucidate the interaction in protein complexes. My
evaluation and utilization of methods for predicting potential
protein:protein contacts will be of great value in the effort to
understand higher order protein interactions.
Specific Technical Research
Development of Automation for Analyzing Genomic Retroid
Data: the Genome Parsing Suite
The 1 st goal of my technical research is the development of software
necessary for identifying, classifying and mapping all Retroid Agents of
a given genome. The Genome Parsing Suite (GPS) is the prototype software
used in our pilot studies to demonstrate the validity of taking such a
global approach to identifying all Retroid Agents in a given genome.
This generic software applies well-established, robust methods in novel
ways to access and categorize large amounts of sequence data. The basic
GPS strategy is to 1) identify all potential RT genes; 2) filter the
data for probable functional RTs; 3) compare the "new" RT to a library
of "known" RTs, thereby predicting Retroid type and genomic length; 4)
annotating the "new" agent's genome as to Retroid gene compliment, and
5) surveying the surrounding host genome to classify the probable role
of the agent. Although still in development, this software is being used
in collaborations with the Stanford University's Saccharomyces genome
project, and Northeastern University's Ice fish genome project.
Creating the Prototype Browsible Database for the
Retroid Agents
In this age of information technology it is important to
deposit newly generated in silico data into a framework that creates new
research and development tools. The 2nd goal is to design and implement
a meaningful prototype browsible database of all existing and new data
relevant to identify and classify all Retroid Agents. These data will
be available for display through the UCSC Genome Browser.
Constructing Evolutionary HMMs for Distantly Related
Sequences
The 3rd goal is to construct evolutionary hidden Markov
models (HMMs) that incorporate information on variable and invariant
regions of a protein so that better multiple alignments can be generated
for phylogenetic reconstruction, and protein-protein interaction
determination.
A wide variety of hypotheses can be addressed as Retroid
genomes are identified and classified. For example, the issue of the
distribution of active versus inactive agents can be addressed on a
genome wide and per chromosome basis. The frequency of insertional
mutagenesis and regulation of host genes by Retroid Agents can also be
ascertained.
Bioinformatics is the interplay between empirically
derived biological knowledge and the availability of various types of
data (e.g., sequence, structure etc) on the one hand, and the
development of strategies and methods for the analysis and
interpretation of these data sources, on the other. Bioinformatic
research includes the development and testing of software tools
necessary to generate new knowledge from primary source information
deposited in databases and the literature. In the broadest sense
Bioinformatics is the generation of new knowledge from existing data.
All of my research is conducted in silico, i.e., within the computer
environment. Many of the results of my analyses are functional genomic
predictions that have been or can be tested by laboratory
experimentation. For example, in 1987 I predicted the correct order of
the two domains of the RNA-dependent-DNA-polymerase (RdDp): the reverse
transcriptase (RT) domain is in the amino portion while the ribonuclease
H (RH) function resides in the carboxyl portion. This prediction was
experimentally verified to be correct in 1988. This work was one of the
earliest studies to predict function from comparison of distantly
related protein sequences that was subsequently experimentally verified
to be correct, contrary to earlier published experimental
studies.
Given the pace at which biological information is
deposited into international databases it is imperative that 21st
century scientists train in both discovery and hypothesis-based research
that fully utilize these growing data resources. Inadequate, even
faulty, Bioinformatic analysis will severely limit what can be learned
from existing data sources. My laboratory has been engaged in
Bioinformatic analysis for over a decade. My in silico research is
dedicated to the Bioinformatic analyses of RNA-based life forms, i.e.,
all RNA viruses, and other genetic agents that require an
RNA-intermediate in their life-cycle. My research is the interplay
between empirically derived biological data sources, Bioinformatic
tools, and human decision making in the creation of new knowledge about
the evolution, structure, and function of RNA-based life forms and their
impact on DNA genomes. The analytical portion of my research agenda has
two major components: the analysis of all genetic information encoding
the RT, and the elucidation of the protein relationships and
interactions of the transcription/replication complex of the order
Mononegavirales (e.g., measles, rabies and Ebola) using Bioinformatic
tools. The major emphasis of my research agenda is the description of a
new strategy for the identification, classification and mapping of all
RT-encoding sequences in all genomes.
Once considered to be "junk" DNA, but well studied for
over a decade now, the phylogenetic relationships among Retroid Agents
are well documented. Retroid Agents are co-evolving with host genomes,
some through mutualism. The relationships among and between Retroid and
host genomes are complex. Retroid Agents are ubiquitous in Eukaryotes.
They appear to have evolved as both endogenous and exogenous genetic
information maintaining a stable enzymatic core of genes while
occasionally acquiring unique genes from their hosts. These discreet
RNA-dependent genomes are involved in regulation of general host cell
genes and reproduction, as well as DNA repair, and the disease process.
As data accumulate from organismal genome projects, the variety of roles
and the underlying molecular mechanisms responsible for Retroid Agent
action expands. In the last two years alone over three hundred
publications have described the richness of the cellular niches that
these agents occupy in a wide variety of Eukaryotes. The impact of the
co-evolution of Retroid Agents and DNA-based life forms is just
beginning to be revealed. Given this long-term association between these
two types of genetic information systems, Retroid Agents can also be
used as evolutionary markers for their hosts. My interest in Retroid
Agent evolution is focused on the significance that these agents have on
the regulation and disease processes in humans given that they comprise
approximately 20% of the Human Genome.
Based upon available data the Human Genome contains only
two types of Retroid Agents; retroviruses (exogenous and endogenous),
and retroposons. While the pathogenicity of HIV and HTLV are well known,
only recently has an MMLV-like endogenous virus been found in
association with some human breast cancers. Recent data indicate that
other human endogenous retroviruses (HERVs) are associated with a wide
variety of human diseases: testicular tumors, insulin dependent diabetes
mellitus, and multiple sclerosis. One of the most fascinating examples
suggests that three different alleles of endogenous HRES-1, a human
T-cell lymphotropic type I virus, confer susceptibility to systemic
lupus erythematosus (SLE) and autoreactivity. The data suggest that
allelic type 1 of HRES-1 may confer protection against SLE.
The first suggestion that HERVs play a role in host cell
regulation was demonstrated for the amylase gene cluster. Other HERVs
have been demonstrated to regulate the HLA-DRB gene, the leptin gene,
obese, the HHLA 2 and 3 genes of functions unknown, the apolipoprotein
CI gene, and the endothelin B receptor of placental cells. HERVs are
also involved in the regulation of human reproduction. Two very specific
roles of HERVs in reproduction reveal the depth of co-evolution between
the host and the virus. HERV expression is involved in human sperm-egg
binding and fusion. It has been demonstrated that HERV-W encodes the
adhesion protein, syncytin. This virally encoded protein adheres to the
trophoblast forming the syncytiotrophoblast that binds to the
endometrium in the early stage of human placental development. To date
no human gene has been discovered that encodes this function.
As data accumulate to support the importance of HERV
involvement in human reproduction, development and disease, the evidence
for the co-evolution of the distantly related retroposons with the
Human Genome is also mounting. To date retroposon insertion into genes
has accounted for the cause of several human diseases. Hemophilia A is
caused by the insertion of a long-interspersed- n uclear- element or
LINE, into the X-factor. First suggested in 1993, LINE insertion into
dystrophin gene is responsible for Duchenne muscular dystrophy.
Fukuyama-type congenital muscular dystrophy, a common autosomal
recessive disorder in the Japanese population, is caused by insertion of
a truncated LINE into the fukutin gene. Interestingly, in the cases of
Hemophilia A and Duchenne muscular dystrophy, the insertional
mutagenesis has occurred very recently since the mutation is absent in
the genes of the parents of those afflicted. Two different X-linked
disorders are also caused by LINE insertional mutagenesis. Alport
Syndrome-Diffuse Leiomyomatosis presents as hematuria, progressive renal
failure, hearing loss and ocular abnormalities in males while in
females it is milder. The type IV collagen gene is disrupted by the
presence of a LINE. Chronic Granulomatous Disease is a severe congenital
immunodeficiency syndrome caused by defects in NADPH oxidase in
phagocytic leukocytes. In this case, a new exon is created by a LINE
insertion into human X-linked CYBB gene that encodes the gp91-phox
subunit of the oxidase. At least two cases of LINEs regulation of host
genes have been demonstrated to date: the thymidylate synthetase and Apo
A genes.
The reports on the number and distribution of
retroposons in the Human Genome estimate that they make-up approximately
16% of total information. There are no analyses of the functionality or
roles retroposons play in the Human Genome in either report on
sequencing the Human Genome, however, the importance of cataloguing the
evolutionary events that these agents have left on genomic landscapes
is clearly stated. It is abundantly evident, as first suggested by
Barbara McClintock, that transposable elements play a major role in the
evolution and development of Eukaryotes. The full extent to which HERVs
and retroposons play critical roles in human evolution, development and
disease will only be known when all such agents are identified, mapped
and evaluated as to probable function. The pilot studies and
experimental design I have developed clearly show that such a global
approach to this problem is feasible. The creation of new research and
development tools of this nature will accelerate the rate at which we
understand the pathways of complexity that have evolved between host
and Retroid genomes, and the roles they play in the human disease
process.
| Retroid Agent Distribution On Human Chromosoes For Each Genome Iteration |
Power point presentation: Presentation
|
|