Joe
Hallock
COM 302TS
University of Washington
Final Project - Print Version
December, 15 2002
Home Page
This web-based essay focuses on an array of
issues to help people become familiar with bioinformatics and how
it’s changing complex biological research. Simply put, bioinformatics
is a blend of biology, information, and mathematics to analyze complex
biological issues. Mr. F. C. Kohli, the former Deputy Chairman of
Tata Consultancy Services (India), said that bioinformatics can
be broadly defined as the interface between Life Sciences and Computational
Sciences.( BioInformatics , page 1) There’s a correlation
between the increase in technology and the increased pace at which
data is being produced. The need for faster, and more accurate,
analysis is the foundation bioinformatics. With computing power,
and the right software tools, scientists have increased the rate
at which new discoveries are made and questions answered.
Before you start reading the essay I’d
like to go over some common biological information so you don’t
feel like you’re in the dark. That said, an organism’s
hereditary and functional information is stored as DNA, RNA, and
proteins. All these are linear chains composed of smaller molecules.
Each one of these chains are assembled from a fixed alphabet of
well understood chemicals. So for example, DNA is composed of four
deoxyribonucleotides (adenine, thymine, cytosine, and guanine) or
if you remember from biology class in 8th grade A, T, C, and G.
Because these chains are defined components, they can be represented
as sequences of symbols. Furthermore, these sequences can then be
compared to find similarities that suggest the molecules are related
by form or function1. After that you may be feeling a little overwhelmed.
Don’t worry, I plan to cover the basics of bioinformatics
and won’t spend much time on words that have more than 4 syllables.
Navigation
Using the tabs at the top of this page, you’ll
be able to navigate the different sections of this site. The first
tab, “What,” explores the question What is Bioinformatics?
This section describes the past, present, and future possibilities
of bioinformatics. The second tab explores Where bioinformatics
is used and how different institutions cope with their bioinformatics
issues. The next tab focuses on Why bioinformatics is important.
Next I explore the Technology that’s used to gather, compute,
store biological information. The Glossary tab will help you with
technical terms found throughout this essay. And finally the Credits
tab displays parenthetical references and my contact information.
Past
In the late 1960’s Margaret O. Dayhoff
gathered all the available sequence data she could find to create
the first bioinformatics database. The data she gathered came from
the first 10 years of protein sequences (including the first ever
reported sequence in 1956 of bovine insulin) and the first nucleic
acid sequence of yeast alanine tRNA . In 1972 the Protein DataBank,
which studied 3-deminsional protein structures, was released with
10 entries. This Database has grown to over 10,000 . In the late
70’s and early 80’s databases were spread out all over
the world and it wasn’t until 1986 when a large collection
of protein databases were consolidated into a formal database called
SWISS-PROT. The United States government was aware of the importance
of biomedical research, but it wasn’t until November 4th,
1988 that Senator Claude Pepper and President Ronald Regan passed
legislation for establishment of the National Center for Biotechnology
Information (NCBI). This was to be a division of the National Library
of Medicine (NLM) which is located at the National Institutes of
Health (NIH), the world's largest biomedical research facility.
NCBI is now the official agency for creating new technologies that
help control health and disease through molecular and genetic processes.
Databases multiplied quickly from the late 1950
until today, but the tools to help researchers developed at a slower
pace. The first tools to search these databases looked for keyword
matches and short sequence words. BLAST (Basic Local Alignment Search
Tool) was (and still is) used to compare sequences. BLAST uses a
heuristic algorithm which seeks local as opposed to global alignments
and is therefore able to detect relationships among sequences which
share only isolated regions of similarity .
Present
Since the early efforts, significant advances
have been made in automating the collection of sequence information.
The sequencers, computers, and rate at which data is being produced
have all increased in speed. Since 1988, the sequence information
generated by the human genome research has been stored so future
applications and experiments can proceed without repeating the work.
The volume of data stored is so massive that if compiled into books,
the data would translate to 200 volumes, each containing 1000 pages.
Moreover, if one were to read each book it would require 26 years
of around the clock work to complete the set (this figure doesn’t
include the time it would take to understand the science beforehand).
In fact, if you took the population of the earth and sequenced each
person and gathered the differences between each person is would
contain 15 quadrillion entries (that’s 15,000,000,000,000,000).3
As you can see, the present challenges facing
bioinformaticians is to improve database design, develop better
software for database access / manipulation, and device data-entry
procedures to compensate for the varied computer procedures and
systems used in different laboratories.3 Like most industries, the
technology and procedures that run bioinformatics today are based
on the technology and procedures that were developed 20-30 years
ago. Steve Gardner, the author of The Evolution of Bioinformatics
states, “Isolated, ad hoc systems based around CGI, Perl and
flat file based analysis systems may provide adequate interfaces
for individual databases, but they will not support the degree of
scalability, flexibility or integration that is required for modern
pharmaceutical research.” Today there are many labs that still
use paper to track their data. Over time huge collections of three-ring
binders develop and the ability to take their data and analyze becomes
very difficult. Today’s role for most bioinformatics companies
is to try to pull these labs away from their paper-based reporting
systems and move them electronic, web-based systems. Adding, storing,
retrieving, and sharing information becomes easy once the data is
made electronic. Geospiza specializes in laboratory information
management. Their web-based Finch Server manages sequences, instruments,
data files and will even assemble shorter sequences together into
contigs (A continuous sequence of DNA that has been assembled from
overlapping cloned DNA fragments. The Encyclopedia of Molecular
Biology (1994,Blackwell)).
Future
The future of bioinformatics has predictable
outcomes. By this I mean we know that the speed of computers will
increase and the quality of the data will increase and that the
volume of data produced will increase. The unpredictable part of
the future is that we don’t know what this will mean for humanity,
the economy, or medicine. It’s safe to say that most people
believe that the outcomes (for all three) will be positive.
I interviewed to people who currently work in
bioinformatics: Todd Smith, Ph.D. and Sandra Porter Ph.D. I asked
them what they thought was the future had to offer bioinformatics.
Todd Smith Ph.D. on the Future of Bioinformatics (8):
| It
will continue to grow. Molecular biology, biochemistry and other
fields are becoming increasingly information driven. They’re
many efforts to try and organize knowledge and make it easier
to access, but I think there will be few successes and many
failures. Just as the English language continues to grow and
change, so will our biological information. One significant
change I see is how researchers publish. The
paper-based journal articles will have decreasing value and
be replaced with web-based information resources. Since I
am describing an evolutionary process, I do not see anything
big changing the course - except for being wrong. What I cannot
predict is what the "hot" projects will be. Computers
will continue to increase in power and our need to ask increasingly
complex questions will drive the need for continual software
development. |
Sandra Porter Ph.D. on the Future of Bioinformatics
(9):
Having
used bioinformatics for eighteen years, I find it interesting
that the major bioinformatics tools I used as a graduate student,
FASTA (now BLAST), and MedLine (now PubMed) are still the
major bioinformatics tools used by most biologists. I would
guess that may be true ten years from now as well.
Changes will most likely come
about in terms of computing speed, Experimental scale and
meeting the needs of FDA for clinical diagnosis.
In terms of sequence analysis,
a BLAST search comparing one gene against a database is quick,
but the speed drops as the complexity of the experiment increases.
Comparing many genes to a database is slow, genome to a database,
slower yet, and translating many genes into all six potential
protein sequences, and comparing them to a database is far
slower. If one were to do this work with a genome like yeast
and compare it to GenBank, it could require months of computing
time.
The day will come when we will
use faster computers and be able to compare several genomes
to other genomes in a few minutes time.
Another area where scale is driving
change is in data storage. People are doing experiments that
require larger quantities of data, for example microarray
experiments with 5000 genes tested in one experiment. The
tools for storing, processing, analyzing, and retrieving data
from these types of experiments are still in the making.
Lastly, the area that will see
an enormous change is clinical diagnostics. The time will
come when everyone will have their genome sequenced or probed
in some way. The emergence of personalized medicine and designer
drugs will make it possible to get a personal prescription
for the types of foods to eat, drugs that are most effective,
and things to avoid. In order to do this, we will need informatics
to process and analyze this information.
The FDA has very stringent requirements
for tracking information and Ensuring privacy and security.
Few bioinformatics tools are robust and secure enough to meet
the FDA's requirements for electronic records. There are opportunities
for companies to work in this area since there are many patients
and physicians who wish to use genetic information in medicine. |
Simply stated, the availability of cash will
be the motivating factor in the future of bioinformatics. You may
think to your self that my previous statement is completely obvious…after
re-reading it I thought the same thing. In more specific terms,
economics will drive science. Money in the form of Federal Grants,
Venture Capitalist infusions, and profit from the sale of software
will push for new technology, the next big prescription drug, and
the ability to search a few petabytes (that’s 100,000 gigabytes)
in an afternoon. As large pharmaceutical companies and research
institutes demand larger data storage options, more powerful computational
technologies, faster sequencers, and robust software solutions,
companies like IBM, HP, Applied Biosystems, and Geospiza will be
there waiting to serve them. These companies will supply what ever
is needed…as long as the lab can pay.
Commercial
The majority of bioinformatics technology is
used in pharmaceutical companies and core labs. Pharmaceutical companies
tend to push for fast action in their sequencing labs so testing
and discovery of new drugs can come to the market before their competitors.
The Research and Development (R&D) of new drugs (which is also
the point at which competitive advantage can be gained most easily),
has moved from the creation of data with regards to a small number
of lead compounds, to the manipulation and analysis of data to identify
new targets. (2) Gardner states that the methods and efficiency
of a large pharmaceutical company’s ability move towards this
new culture is critical to their long-term profitability. Moreover,
he states, that the effective integration and use of information
will become the single biggest differentiator of pharmaceutical
R&D competitive advantage in the next decade.
Core labs are the next service industry in biotechnology. As universities,
research institutions and small biotech companies conduct their
research, the costs can become colossal. With sequencing machines
costing as much as $250,000 and other lab costs (like employment
costs of senior scientists and lab technicians increasing annually),
the ability for a small lab to compete is directly related to their
capacity to spend money. Those costs and the related administrative
costs, such as lab space and materials add up quickly. All these
processes have to occur before any data can be created. As you can
see, the ability for a small lab to function becomes very difficult.
One solution is to set up one large lab and charge these small companies
and universities a fee to sequence their data for them; thereby
changing the lab from a massive operational cost to an affordable
research expense.
Academic / Not for Profit
Bioinformatics has become an integral part
of universities and not-for-profit organizations. Universities using
computers to store and analyze data saves money and doubles as an
educational tool for students. The University of Washington was
an early adopter (and creator) of bioinformatics technologies. Software
like Phrap, Phred, and Consed all emerged from the University of
Washington and are used world-wide for testing the quality and DNA
sequence data. (For more information on Phrap, Phred and Consed
click on the glossary tab at the top of the page or visit http://www.phrap.org)
Not-for-profit organizations and government funded labs also take
advantage to the cost (and time value) that bioinformatics technology
offers. The Institute for Genomic Research (TIGR) is a not-for-profit
research institute whose primary research interests are in structural,
functional and comparative analysis of genomes and gene products
from a wide variety of organisms including viruses, eubacteria (both
pathogens and non-pathogens, archaea (the so-called third domain
of life), and eukaryotes (plants, animals, fungi and protists such
as the malarial parasite).(6)
Research
Over the last couple sections you’ve seen what
bioinformatics is and where it’s used. In this section I’ll
go into some detail as to why bioinformatics is important. The ability
to improve research is the fundamental goal of a bioinformatician.
So far bioinformatics has done this by increasing the speed of the
analysis, retaining the quality of the data, and making it easier
to do science instead of busy work. One could say that the goal
of biology, in the area of genomics, is to develop a quantitative
understanding of how living things are build from the genome that
encodes them. This goal is going to take a lot of human thinking
and computational processing to realize. Andrew Leonard, the Senior
Systems Engineer at Geospiza, told me once that a large amount of
information isn’t necessarily the goal; it's the meaning in
that information that's important. Bioinformatics is helping researchers
find the meaning in their information and allowing them to do better
research.
Economic Benefits
The economic benefits are clear but sometimes
hard to calculate. Setting up a lab is expensive, but so is the
technology behind bioinformatics. Rack mounted computer, high throughput
sequencers, and the technical staff to run the operation are all
expensive. The savings is available if the lab is managed well over
a period of time. This period of time is directly proportionate
to the ability for that lab to create income, and become profitable.
Businesspeople around the world (especially in
the Northwest United States) have a vision of the potential of bioinformatics.
According to data cited by Chemical & Engineering News [Feb,
7, 2000], bioinformatics sales have a market potential of $2.5 billion
by 2005.(10) As I stated earlier, the Northwest (Seattle in particular)
has become a hotbed of biotechnology companies. As many as 80 biotechnology
firms are in Seattle today and unlike the dot-coms (with which they
compete for funding) biotechnology firms may not be known for loosing
money. WaBio.com reported that in 1998 7,100 people were employed
by biotechnology firms, they estimate that this number will triple
by 2005 and that the indirect employment should double that figure.
Even Bill Gates has put his hand in the bucket. He sits on the board
of ICOS Corporation and states that, “Biotechnology is a booming
business, and a large part of its promise depends on its use of
bioinformatics.”
Humanitarian Benefits
Humanitarian benefits are the most important. How
will bioinformatics help the human race as we grow in population
and our environment becomes more and more complicated by artificial
chemicals and pollutants? I went back to Sandra Porter and asked
her what she would tell the government of a developing country to
persuade them to use bioinformatics?
| I would say that many scientists
in developing countries are already aware that bioinformatics
is important. About three years ago one of my students, an immigrant
from Somalia, asked me how hard it would be to find genes in
GenBank related to drought resistance in plants. We found plant
drought resistance genes in five minutes. With current technology,
scientists could clone those genes and put them into plants
in a matter of months, if the legal issues could be worked out.
If I showed leaders how
to find this information, I don't think they would need any
more convincing.
If they were still unsure, I would
show them recent examples of genomics studies with the malaria
parasite and the bacterium, Mycobacterium tuberculosis. In
both of these cases, bioinformatics has recently been used
to identify potential drug targets for treating these diseases. |
In this section I’ll gloss over some of the
technology used in bioinformatics. I’ve broken this section
into 2 parts: Sequencing instruments and Computing Instruments.
Sequencing Instruments
Applied Biosystems is a leader in the sequencer instrument
field. Applied Biosystems’ 3730 DNA Analyzer (pictured below)
represents the next generation of high-throughput sequencing and
fragment analysis platforms.
Beckman Coulter, Inc. is another company who specialize
in laboratory instruments. Their sequencing product (the CEQ 2000XL
– pictured below) is a fully-automated system that achieves
high-resolution DNA analysis.
Computing Instruments
Computing power is one of the most important factors
when doing large experiments. The completion of the Human genome
sequence (formally announced on June 26, 2000) involved more than
500 million ! trillion calculations during the process of assembling
the sequences alone. This was the biggest exercise in the history
of computational biology. (3) Understanding the importance of processing
power, IBM announced in December of 1999 that they were going to
fund a $100 million dollar exploratory research imitative to build
a supercomputer which is 500 times more powerful than the world’s
fastest existing computer and 2 million times faster than today’s
fastest desktop PC.(3)
Apple is getting into the picture as well. Their
new line of servers (the Xserve – pictured below) is one of
the fastest tools for performing BLAST searches (21 times faster
than IBM current comparable server). BLAST searches account for
70% of all computing time in the biotechnology industry (14) and
Apple is going to be there when companies decide to upgrade.
|