An Introduction to Bioinformatics

Joe Hallock
COM 302TS
University of Washington
Final Project - Print Version
December, 15 2002

Home Page

This web-based essay focuses on an array of issues to help people become familiar with bioinformatics and how it’s changing complex biological research. Simply put, bioinformatics is a blend of biology, information, and mathematics to analyze complex biological issues. Mr. F. C. Kohli, the former Deputy Chairman of Tata Consultancy Services (India), said that bioinformatics can be broadly defined as the interface between Life Sciences and Computational Sciences.( BioInformatics , page 1) There’s a correlation between the increase in technology and the increased pace at which data is being produced. The need for faster, and more accurate, analysis is the foundation bioinformatics. With computing power, and the right software tools, scientists have increased the rate at which new discoveries are made and questions answered.

Before you start reading the essay I’d like to go over some common biological information so you don’t feel like you’re in the dark. That said, an organism’s hereditary and functional information is stored as DNA, RNA, and proteins. All these are linear chains composed of smaller molecules. Each one of these chains are assembled from a fixed alphabet of well understood chemicals. So for example, DNA is composed of four deoxyribonucleotides (adenine, thymine, cytosine, and guanine) or if you remember from biology class in 8th grade A, T, C, and G. Because these chains are defined components, they can be represented as sequences of symbols. Furthermore, these sequences can then be compared to find similarities that suggest the molecules are related by form or function1. After that you may be feeling a little overwhelmed. Don’t worry, I plan to cover the basics of bioinformatics and won’t spend much time on words that have more than 4 syllables.

Navigation

Using the tabs at the top of this page, you’ll be able to navigate the different sections of this site. The first tab, “What,” explores the question What is Bioinformatics? This section describes the past, present, and future possibilities of bioinformatics. The second tab explores Where bioinformatics is used and how different institutions cope with their bioinformatics issues. The next tab focuses on Why bioinformatics is important. Next I explore the Technology that’s used to gather, compute, store biological information. The Glossary tab will help you with technical terms found throughout this essay. And finally the Credits tab displays parenthetical references and my contact information.

Past

In the late 1960’s Margaret O. Dayhoff gathered all the available sequence data she could find to create the first bioinformatics database. The data she gathered came from the first 10 years of protein sequences (including the first ever reported sequence in 1956 of bovine insulin) and the first nucleic acid sequence of yeast alanine tRNA . In 1972 the Protein DataBank, which studied 3-deminsional protein structures, was released with 10 entries. This Database has grown to over 10,000 . In the late 70’s and early 80’s databases were spread out all over the world and it wasn’t until 1986 when a large collection of protein databases were consolidated into a formal database called SWISS-PROT. The United States government was aware of the importance of biomedical research, but it wasn’t until November 4th, 1988 that Senator Claude Pepper and President Ronald Regan passed legislation for establishment of the National Center for Biotechnology Information (NCBI). This was to be a division of the National Library of Medicine (NLM) which is located at the National Institutes of Health (NIH), the world's largest biomedical research facility. NCBI is now the official agency for creating new technologies that help control health and disease through molecular and genetic processes.

Databases multiplied quickly from the late 1950 until today, but the tools to help researchers developed at a slower pace. The first tools to search these databases looked for keyword matches and short sequence words. BLAST (Basic Local Alignment Search Tool) was (and still is) used to compare sequences. BLAST uses a heuristic algorithm which seeks local as opposed to global alignments and is therefore able to detect relationships among sequences which share only isolated regions of similarity .

Present

Since the early efforts, significant advances have been made in automating the collection of sequence information. The sequencers, computers, and rate at which data is being produced have all increased in speed. Since 1988, the sequence information generated by the human genome research has been stored so future applications and experiments can proceed without repeating the work. The volume of data stored is so massive that if compiled into books, the data would translate to 200 volumes, each containing 1000 pages. Moreover, if one were to read each book it would require 26 years of around the clock work to complete the set (this figure doesn’t include the time it would take to understand the science beforehand). In fact, if you took the population of the earth and sequenced each person and gathered the differences between each person is would contain 15 quadrillion entries (that’s 15,000,000,000,000,000).3

As you can see, the present challenges facing bioinformaticians is to improve database design, develop better software for database access / manipulation, and device data-entry procedures to compensate for the varied computer procedures and systems used in different laboratories.3 Like most industries, the technology and procedures that run bioinformatics today are based on the technology and procedures that were developed 20-30 years ago. Steve Gardner, the author of The Evolution of Bioinformatics states, “Isolated, ad hoc systems based around CGI, Perl and flat file based analysis systems may provide adequate interfaces for individual databases, but they will not support the degree of scalability, flexibility or integration that is required for modern pharmaceutical research.” Today there are many labs that still use paper to track their data. Over time huge collections of three-ring binders develop and the ability to take their data and analyze becomes very difficult. Today’s role for most bioinformatics companies is to try to pull these labs away from their paper-based reporting systems and move them electronic, web-based systems. Adding, storing, retrieving, and sharing information becomes easy once the data is made electronic. Geospiza specializes in laboratory information management. Their web-based Finch Server manages sequences, instruments, data files and will even assemble shorter sequences together into contigs (A continuous sequence of DNA that has been assembled from overlapping cloned DNA fragments. The Encyclopedia of Molecular Biology (1994,Blackwell)).

Future

The future of bioinformatics has predictable outcomes. By this I mean we know that the speed of computers will increase and the quality of the data will increase and that the volume of data produced will increase. The unpredictable part of the future is that we don’t know what this will mean for humanity, the economy, or medicine. It’s safe to say that most people believe that the outcomes (for all three) will be positive.

I interviewed to people who currently work in bioinformatics: Todd Smith, Ph.D. and Sandra Porter Ph.D. I asked them what they thought was the future had to offer bioinformatics.

Todd Smith Ph.D. on the Future of Bioinformatics (8):

It will continue to grow. Molecular biology, biochemistry and other fields are becoming increasingly information driven. They’re many efforts to try and organize knowledge and make it easier to access, but I think there will be few successes and many failures. Just as the English language continues to grow and change, so will our biological information. One significant change I see is how researchers publish.

The paper-based journal articles will have decreasing value and be replaced with web-based information resources. Since I am describing an evolutionary process, I do not see anything big changing the course - except for being wrong. What I cannot predict is what the "hot" projects will be. Computers will continue to increase in power and our need to ask increasingly complex questions will drive the need for continual software development.

Sandra Porter Ph.D. on the Future of Bioinformatics (9):

Having used bioinformatics for eighteen years, I find it interesting that the major bioinformatics tools I used as a graduate student, FASTA (now BLAST), and MedLine (now PubMed) are still the major bioinformatics tools used by most biologists. I would guess that may be true ten years from now as well.

Changes will most likely come about in terms of computing speed, Experimental scale and meeting the needs of FDA for clinical diagnosis.

In terms of sequence analysis, a BLAST search comparing one gene against a database is quick, but the speed drops as the complexity of the experiment increases. Comparing many genes to a database is slow, genome to a database, slower yet, and translating many genes into all six potential protein sequences, and comparing them to a database is far slower. If one were to do this work with a genome like yeast and compare it to GenBank, it could require months of computing time.

The day will come when we will use faster computers and be able to compare several genomes to other genomes in a few minutes time.

Another area where scale is driving change is in data storage. People are doing experiments that require larger quantities of data, for example microarray experiments with 5000 genes tested in one experiment. The tools for storing, processing, analyzing, and retrieving data from these types of experiments are still in the making.

Lastly, the area that will see an enormous change is clinical diagnostics. The time will come when everyone will have their genome sequenced or probed in some way. The emergence of personalized medicine and designer drugs will make it possible to get a personal prescription for the types of foods to eat, drugs that are most effective, and things to avoid. In order to do this, we will need informatics to process and analyze this information.

The FDA has very stringent requirements for tracking information and Ensuring privacy and security. Few bioinformatics tools are robust and secure enough to meet the FDA's requirements for electronic records. There are opportunities for companies to work in this area since there are many patients and physicians who wish to use genetic information in medicine.

Simply stated, the availability of cash will be the motivating factor in the future of bioinformatics. You may think to your self that my previous statement is completely obvious…after re-reading it I thought the same thing. In more specific terms, economics will drive science. Money in the form of Federal Grants, Venture Capitalist infusions, and profit from the sale of software will push for new technology, the next big prescription drug, and the ability to search a few petabytes (that’s 100,000 gigabytes) in an afternoon. As large pharmaceutical companies and research institutes demand larger data storage options, more powerful computational technologies, faster sequencers, and robust software solutions, companies like IBM, HP, Applied Biosystems, and Geospiza will be there waiting to serve them. These companies will supply what ever is needed…as long as the lab can pay.

Commercial

The majority of bioinformatics technology is used in pharmaceutical companies and core labs. Pharmaceutical companies tend to push for fast action in their sequencing labs so testing and discovery of new drugs can come to the market before their competitors. The Research and Development (R&D) of new drugs (which is also the point at which competitive advantage can be gained most easily), has moved from the creation of data with regards to a small number of lead compounds, to the manipulation and analysis of data to identify new targets. (2) Gardner states that the methods and efficiency of a large pharmaceutical company’s ability move towards this new culture is critical to their long-term profitability. Moreover, he states, that the effective integration and use of information will become the single biggest differentiator of pharmaceutical R&D competitive advantage in the next decade.

Core labs are the next service industry in biotechnology. As universities, research institutions and small biotech companies conduct their research, the costs can become colossal. With sequencing machines costing as much as $250,000 and other lab costs (like employment costs of senior scientists and lab technicians increasing annually), the ability for a small lab to compete is directly related to their capacity to spend money. Those costs and the related administrative costs, such as lab space and materials add up quickly. All these processes have to occur before any data can be created. As you can see, the ability for a small lab to function becomes very difficult. One solution is to set up one large lab and charge these small companies and universities a fee to sequence their data for them; thereby changing the lab from a massive operational cost to an affordable research expense.

Academic / Not for Profit

Bioinformatics has become an integral part of universities and not-for-profit organizations. Universities using computers to store and analyze data saves money and doubles as an educational tool for students. The University of Washington was an early adopter (and creator) of bioinformatics technologies. Software like Phrap, Phred, and Consed all emerged from the University of Washington and are used world-wide for testing the quality and DNA sequence data. (For more information on Phrap, Phred and Consed click on the glossary tab at the top of the page or visit http://www.phrap.org)

Not-for-profit organizations and government funded labs also take advantage to the cost (and time value) that bioinformatics technology offers. The Institute for Genomic Research (TIGR) is a not-for-profit research institute whose primary research interests are in structural, functional and comparative analysis of genomes and gene products from a wide variety of organisms including viruses, eubacteria (both pathogens and non-pathogens, archaea (the so-called third domain of life), and eukaryotes (plants, animals, fungi and protists such as the malarial parasite).(6)

Research

Over the last couple sections you’ve seen what bioinformatics is and where it’s used. In this section I’ll go into some detail as to why bioinformatics is important. The ability to improve research is the fundamental goal of a bioinformatician. So far bioinformatics has done this by increasing the speed of the analysis, retaining the quality of the data, and making it easier to do science instead of busy work. One could say that the goal of biology, in the area of genomics, is to develop a quantitative understanding of how living things are build from the genome that encodes them. This goal is going to take a lot of human thinking and computational processing to realize. Andrew Leonard, the Senior Systems Engineer at Geospiza, told me once that a large amount of information isn’t necessarily the goal; it's the meaning in that information that's important. Bioinformatics is helping researchers find the meaning in their information and allowing them to do better research.

Economic Benefits

The economic benefits are clear but sometimes hard to calculate. Setting up a lab is expensive, but so is the technology behind bioinformatics. Rack mounted computer, high throughput sequencers, and the technical staff to run the operation are all expensive. The savings is available if the lab is managed well over a period of time. This period of time is directly proportionate to the ability for that lab to create income, and become profitable.

Businesspeople around the world (especially in the Northwest United States) have a vision of the potential of bioinformatics. According to data cited by Chemical & Engineering News [Feb, 7, 2000], bioinformatics sales have a market potential of $2.5 billion by 2005.(10) As I stated earlier, the Northwest (Seattle in particular) has become a hotbed of biotechnology companies. As many as 80 biotechnology firms are in Seattle today and unlike the dot-coms (with which they compete for funding) biotechnology firms may not be known for loosing money. WaBio.com reported that in 1998 7,100 people were employed by biotechnology firms, they estimate that this number will triple by 2005 and that the indirect employment should double that figure. Even Bill Gates has put his hand in the bucket. He sits on the board of ICOS Corporation and states that, “Biotechnology is a booming business, and a large part of its promise depends on its use of bioinformatics.”

Humanitarian Benefits

Humanitarian benefits are the most important. How will bioinformatics help the human race as we grow in population and our environment becomes more and more complicated by artificial chemicals and pollutants? I went back to Sandra Porter and asked her what she would tell the government of a developing country to persuade them to use bioinformatics?

I would say that many scientists in developing countries are already aware that bioinformatics is important. About three years ago one of my students, an immigrant from Somalia, asked me how hard it would be to find genes in GenBank related to drought resistance in plants. We found plant drought resistance genes in five minutes. With current technology, scientists could clone those genes and put them into plants in a matter of months, if the legal issues could be worked out.

If I showed leaders how to find this information, I don't think they would need any more convincing.

If they were still unsure, I would show them recent examples of genomics studies with the malaria parasite and the bacterium, Mycobacterium tuberculosis. In both of these cases, bioinformatics has recently been used to identify potential drug targets for treating these diseases.

The Technology behind Bioinformatics

In this section I’ll gloss over some of the technology used in bioinformatics. I’ve broken this section into 2 parts: Sequencing instruments and Computing Instruments.

Sequencing Instruments

Applied Biosystems is a leader in the sequencer instrument field. Applied Biosystems’ 3730 DNA Analyzer (pictured below) represents the next generation of high-throughput sequencing and fragment analysis platforms.

Beckman Coulter, Inc. is another company who specialize in laboratory instruments. Their sequencing product (the CEQ 2000XL – pictured below) is a fully-automated system that achieves high-resolution DNA analysis.

Computing Instruments

Computing power is one of the most important factors when doing large experiments. The completion of the Human genome sequence (formally announced on June 26, 2000) involved more than 500 million ! trillion calculations during the process of assembling the sequences alone. This was the biggest exercise in the history of computational biology. (3) Understanding the importance of processing power, IBM announced in December of 1999 that they were going to fund a $100 million dollar exploratory research imitative to build a supercomputer which is 500 times more powerful than the world’s fastest existing computer and 2 million times faster than today’s fastest desktop PC.(3)

Apple is getting into the picture as well. Their new line of servers (the Xserve – pictured below) is one of the fastest tools for performing BLAST searches (21 times faster than IBM current comparable server). BLAST searches account for 70% of all computing time in the biotechnology industry (14) and Apple is going to be there when companies decide to upgrade.

1.	BioInformatics: http://www.ewh.ieee.org/r10/bombay/news4/Bioinformatics.htm November-19-2002 (Date of Access).
2.	The Evolution of Bioinformatics: http://www.bitsjournal.com/sgard.html November-04-2002 (Date of Access).
3.	Bioinformatics – An aid for biological research: http://ebioarticles.ebioinfogen.com/bioinfo.htm December-14-2002 (Date of Access).
4.	Bioinformatics Birth and Mission: http://bear.cba.ufl.edu/teets/projects/ISM6222F102/farrelas/History.html December-14-2002 (Date of Access).
5.	BLAST Basic Overview: http://www.ncbi.nlm.nih.gov/BLAST/blast_overview.html December-14-2002 (Date of Access).
6.	About TIGR and Contact Information: http://www.tigr.org/about/ December-15-2002 (Date of Access).
7.	Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources : http://www.ncbi.nlm.nih.gov/About/primer/bioinformatics.html November-04-2002 (Date of Access).
8.	Smith Ph.D., Todd. Personal Interview. December-13-2002.
9.	Porter Ph.D., Sandra. Personal Interview. December-13-2002.
10.	Wagner, Tracey. "Inside the Double Helix." Northwest Science and Technology March-2002: 17 - 24.
11.	XServe Performance Testing: http://www.apple.com/xserve/performance.html December-14-2002 (Date of Access).
12.	Geospiza Finch Suite: http://www.geospiza.com/products/finch-suite/index.htm December-15-2002 (Date of Access).