Wednesday, June 13, 2007

DNA Microarray

DNA Microarray - A technology that is reshaping molecular biology

It is widely believed that thousands of genes and their products (i.e., RNA and proteins) in a given living organism function in a complicated and orchestrated way that creates the mystery of life. However, traditional methods in molecular biology generally work on a "one gene in one experiment" basis, which means that the throughput is very limited and the "whole picture" of gene function is hard to obtain. In the past several years, a new technology, called DNA microarray, has attracted tremendous interests among biologists. This technology promises to monitor the whole genome on a single chip so that researchers can have a better picture of the interactions among thousands of genes simultaneously.

Terminologies that have been used in the literature to describe this technology include, but not limited to: biochip, DNA chip, DNA microarray, and gene array. which refers to its high density, oligonucleotide-based DNA arrays. However, in some articles appeared in professional journals, popular magazines, and the WWW the term "gene chip(s)" has been used as a general terminology that refers to the microarray technology. Affymetrix strongly opposes such usage of the term "gene chip(s)". More recently, I prefer the term "genome chip", indicating that this technology is meant to monitor the whole genome on a single chip. GenomeChip would also include the increasingly important and feasible protein chip technology.

Base-pairing (i.e., A-T and G-C for DNA; A-U and G-C for RNA) or hybridization is the underlining principle of DNA microarray.

An array is an orderly arrangement of samples. It provides a medium for matching known and unknown DNA samples based on base-pairing rules and automating the process of identifying the unknowns. An array experiment can make use of common assay systems such as microplates or standard blotting membranes, and can be created by hand or make use of robotics to deposit the sample. In general, arrays are described as macroarrays or microarrays, the difference being the size of the sample spots. Macroarrays contain sample spot sizes of about 300 microns or larger and can be easily imaged by existing gel and blot scanners. The sample spot sizes in microarray are typically less than 200 microns in diameter and these arrays usually contains thousands of spots. Microarrays require specialized robotics and imaging equipment that generally are not commercially available as a complete system.

DNA microarray, or DNA chips are fabricated by high-speed robotics, generally on glass but sometimes on nylon substrates, for which probes* with known identity are used to determine complementary binding, thus allowing massively parallel gene expression and gene discovery studies. An experiment with a single DNA chip can provide researchers information on thousands of genes simultaneously - a dramatic increase in throughput. There are two major application forms for the DNA microarray technology:

1) Identification of sequence (gene / gene mutation) and

2) Determination of expression level (abundance) of genes.

There are two variants* of the DNA microarray technology, in terms of the property of arrayed DNA sequence with known identity:

Format I: probe cDNA (500~5,000 bases long) is immobilized to a solid surface such as glass using robot spotting and exposed to a set of targets either separately or in a mixture. This method, "traditionally" called DNA microarray, is widely considered as developed at Stanford University. A recent article by R. Ekins and F.W. Chu seems to provide some generally forgotten facts.

Format II: an array of oligonucleotide (20~80-mer oligos) or peptide nucleic acid (PNA) probes is synthesized either in situ (on-chip) or by conventional synthesis followed by on-chip immobilization. The array is exposed to labeled sample DNA, hybridized, and the identity/abundance of complementary sequences are determined.

The microarray (DNA chip) technology is having a significant impact on genomics study. Many fields, including drug discovery and toxicological research, will certainly benefit from the use of DNA micro array technology.

Design of a DNA Microarray System

There are several steps in the design and implementation of a DNA microarray experiment. Many strategies have been investigated at each of these steps. 1) DNA types; 2) Chip fabrication; 3) Sample preparation; 4) Assay; 5) Readout; and 6) Software (informatics)

Table 1. Steps in the design and implementation of a DNA microarray experiment

Probe (cDNA/oligo with known identity)

Chip fabrication
(Putting probes on the chip

Target (fluorescently labeled sample) Assay Readout Informatics
Small oligos,
(whole organism on a chip?)

Photolith-ography, pipette, drop-touch, piezoelectric (ink-jet), electric, ...

RNA, (mRNA==>)

Hybridization, long, short, ligase, base addition, electric, MS, electrophoresis, fluocytometry, PCR-DIRECT, TaqMan, ...

Fluorescence, probeless (conductance, MS, electrophoresis), electronic, ...

Robotics control,

Image processing, DBMS, WWW, bioinformatics, data mining and visualization

There are so many options and combinations,as can been seen from the number of companies involved in this business. It seems too early to judge who will be the winner(s) in this game. The forecast is further complicated by recent fights among companies on intellectual property issues.

Applications of DNA Microarray Technology

Disease diagnosis

Many "microfluidics" devices fall in this category. Although they are not the "traditional" gene chip or microarray, I decided to list related links at this site because of their close connection and integration to the gene chip (microarray) technology.

Drug discovery: Pharmacogenomics

Why some drugs work better in some patients than in others? And why some drugs may even be highly toxic to certain patients? My favorite definition (modified): Pharmacogenomics is the hybridization of functional genomics and molecular pharmacology. The goal of pharmacogenomics is to find correlations between therapeutic responses to drugs and the genetic profiles of patients.

Toxicological research: Toxicogenomics

Have you seen anybody using this terminology? Now let's try to give it a definition: Toxicogenomics is the hybridization of functional genomics and molecular toxicology. The goal of toxicogenomics is to find correlations between toxic responses to toxicants and changes in the genetic profiles of the objects exposed to such toxicants.


The objective of pharmacogenomics is ultimately to target drugs specifically to those patients with a genetic make-up (genotype) such that they will have close to 100% repose with no side effects. The real long-term potential for pharmacogenomics is to stratify diseases by mechanism and develop therapies, or even preventative approaches, based on genetic risk factors. More immediately, pharmacogenomics can be used to improve the clinical development processes.

Anticipated Benefits of Pharmacogenomics:

  • More powerful Medicines

Pharmaceutical companies will be able to create drugs based on the proteins, enzymes, and RNA molecules associated with genes and diseases. This will facilitate drug discovery and allow drug makers to produce a therapy more targeted to specific diseases. This accuracy not only will maximize therapeutic effects but also decrease to nearby healthy cells.

  • Better, Safer Drugs the First Time

Instead of the standard trial-and-error method of matching patients with the right drugs, doctors will be able to analyze a patient’s genetic profile and prescribe the best available drug therapy from the beginning. Not only will this take the guesswork out of finding the right drug, it will speedup recovery time and increase safety as the likelihood of adverse reactions is eliminated.

  • More Accurate Methods of Determining Appropriate Drug dosages

Current methods of basing dosages on weight and age will be replaced with dosages based on a person’s genetics-how the body process the medicine and the time it takes to metabolize it. This will maximize the therapy’s value and decrease the likelihood of overdose.

  • Advanced screening for Disease

Knowing one’s genetic code will allow a person to make adequate lifestyle and environmental changes at an early age so as to avoid or lessen the severity of a genetic disease., Likewise, advance knowledge of particular disease susceptibility will allow careful monitoring, and treatments can be introduced at the most appropriate stage to maximize their therapy.

  • Better vaccines

Vaccines made of genetic material, either DNA or RNA; promise all the benefits of existing vaccines without all the risks. They will activate the immune system but will be unable to cause infections. They will be inexpensive, stable, easy to store and capable of being engineered to carry several strains of a pathogen at once.

  • Improvements in the Drug Discovery and Approval Process

Pharmaceutical companies will be able to discover potential therapies more easily using genome targets. Previously failed drug candidates may be received as they are matched with the niche population they serve. The drug approval process should be facilitated as trials are targeted for specific genetic population groups – providing greater degrees of success. The cost and risk of clinical trials will be reduced by targeting only those persons capable of responding to a drug.

  • Decrease in the Overall Cost of Health Care

Decreases in the number of adverse drug reactions, the number of failed drug trials, the time it takes to get a drug approved, the length of time patients are on medication, the number of medications patients must take to find an effective therapy, the effects of a disease on the body (through early detection), and an increase in the range of possible drug targets will promote a net decrease in the cost of health care.

Why Pharmaceutical companies required Bioinformatics

Modern bioinformatic teams play a critical role in creating a framework that can support the needs of information based R&D. This role is an agent who changes, identifies and evaluate new tech, gives advises on their potential and integration, trains internal staffs and also helps to present and interpret the data for users. One of the goals of Bioinformatics is to make sense of Human Genome project that means what it is and what it does and without BI it would be impossible to find the answer in the vast sea of data i.e., being generated.

The availability of the genome sequence is just the beginning. Scientist want to understand genes. Their function & the role the play in the prevention, diagnosis & treatment of disease. The ultimate goal is to identify the pattern in the information that can be used to develop more therapeatics-drug.

Scientists now trying to understand PROTEOMICS i.e., study of proteins, their functions and interaction, the rules for figuring out the relationship between protein gene sequence and protein function. Drug makers believe that proteomic understanding will lead to new therapies that will revolutionize the way the disease is diagnosed and treated. With the help of Bioinformatics, they are trying to develop more effective therapeutics - drugs that work more quickly, are safer, less toxic and have better bioavailability. So, the greatest bottleneck is the discovery of gene function. The drug discovery companies depend on Bioinformatics companies to help them filter the huge number of genes which are associated with diseases and which would be good drug targets.

Only the thing is there is a need of common language so that pieces of data can be expressed in terms of that language. Then computational tool is needed to interpret these data, i.e., Bio-informatics provides necessary tools to get knowledge from (masses of) raw information. The wealth of information is a challenge to pharma and biotech companies. The abundance of data has a great advantage—better-targeted drug treatments will be possible. The global pharmaceutical industry is worth more that $ 150 billion per year.

Proteomics and Genomics


The dream of having the complete genome sequence is now a reality. The complete sequence of several genomes including the human one is known. However, the understanding of probably half a million human proteins encoded by some 80,000 genes is still a long way away and the hard work to unravel the complexity of biological systems is yet to come.
A new fundamental concept called proteome (Protein Complement to a genome) has recently emerged that should drastically help phonemics to unravel biochemical and physiological mechanisms of complex multivariate diseases at the functional molecular level. A new discipline, proteomics, has been initiated that complements physical genomic research. Proteomics can be defined as the qualitative and quantitative comparison of proteomes under different conditions to further unravel biological processes.

In a border sense proteomics is defined as the analysis of complete complements of proteins. Proteomics include not only the identification and quantification of proteins, but also the determination of their localization, modifications, interactions, activities, and, ultimately, their function. Initially encompassing just two-dimensional (2D) gel electrophoresis for protein separation and identification, proteomics now refers to any procedure that characterizes large sets of proteins. The explosive growth of this field is driven by multiple forces – genomics and its revelation of more and more new proteins; powerful protein technologies, such as newly developed mass-spectrometry approaches, global [yeast] two-hybrid techniques, and spin-offs from DNA arrays, and innovative computational tools and methods to process, analyze, and interpret prodigious amounts of data.

Types: As compared with genomics, proteomics is not differentiated completely. Presently only two divisions are prominent i.e., functional and comparative proteomics.

Functional proteomics: Relating function to gene expression and protein-protein interactions is yielding large database of interacting proteins. Extensive pathway maps of these interactions are being scored and deciphered by novel high throughput technologies. However, traditional methods of screening have not been very successful in identifying protein-protein interaction inhibitors. The proteomic pipeline is under way to reveal and identify and understand biological mechanisms that exist between proteins, protein folding and how protein structure relates to function. This explosion in genomic and proteomic data, the exponential increase of Known protein structures should make it easier to develop highly specific, safer and more effective pharmaceuticals.

Comparative proteomics: It is comparing proteome of two different organisms for the functional and structural studies. For Example the C. elegans proteome was used as an alignment template to assist in novel human gene identification. Among the available 18,452 C. elegans protein sequences, results indicate that at least 83% had human homologous genes, with 7954 records of C. elegans proteins matching known human gene transcripts.



Genomics is operationally defined as investigations in to the structure and function of very large number of genes undertaken in to the structure and function of very large number o genes undertaken in a simultaneous fashion. Genomics has its origin in the US Government sponsored Human Genome Project Project (HGP). Initiated in the mid – 1980s, its initial intent was to map, sequence, and characterize all human chromosomes in order to facilitate more effective discovery of genes. Genomics encompasses various technologies used to discover and charcterize genes, with a view to identify those that cause or predispose to diseases. It includes new approaches to the understanding of gene expression, gene function and the selection and validation of genes leading to the design of efficacious and specific drug. Geonomics tools also find applications in the discovery of better ways to fight infectious diseases. Recent advances in genomics are bringing about a revolution in our understanding of the molecular mechanisms of disease, including the complex interplay of genetic and environmental factors. Geonomics is also stimulating the discovery of breakthrough healthcare products by revelaing thousands of new biological targets for the development of drugs, and by giving scientists innovative way to design new drugs, vaccines and DNA diagnostics. Genomics based therapaeutics includes “traditional”small chemical drug, protein drugs, and potentially gene therapy.

Types: Geonmics is broadly classified into functional, structural and comparitive genomics and subdivided into biochemical genomics, physiological genomics, evolutionary genomics and phylogenomics.

Functional genomics: functional genomics aims to discover the biological function of particular genes and to uncover how sets of genes and their products work together in health and disease. In its broadest definition, functional genomics encompasses many traditional molecular genetic and other biological approaches.

Structural geonomics: Involves quick determination of 3D structures of large numbers of proteins ( or other complex biological molecules, such as nucleic acids), ultimately accounting for an organism’s entire proteome. As traditionally defined, the term structural geonomics is referred to the use of sequencing and mapping technologies, with the support of bioinformatics to develop complete genome maps (genetic, physical, and transcript maps) and to elucidate geonomics sequences of different organisms, particularly humans. Now, however, the term is increasingly used to refer to high-throughput methods for determining protein structures.

Comparative geonomics: Comparative studies of whole genomes help researchers understand what parts of the genome in one organism are similar to those in another, how the overall structure of genes and genomes have evolved, and how to interfere with these events in the model organism or humans. Comparative genomics is also a critical enabling field for functional genomics, because it gives researchers an indication of which model organism is most appropriate for a particular study.

Biochemical genomics: Biochemical genomics approaches to identify genes by the activities of their products with respect of their involvement in metabolism.

Evolutionary genomics: Looking at how genes have been preserved through evolution, or how genes or their functions have diverged.

Phylogenomics: The study of the evolution of genes and gene families using DNA sequence information from organisms selected at major branch points along the phylogenetic continuum.

Physiological genomics: Indicates that it covers “ a wide variety of studies from human and from informative model systems with techniques linking genes and pathways to physiology, from prokaryotes to eukaryotes.

Database, techniques and software used

  • Data bases
  • Genome sequence at Entrez genome and TIGR databases.
  • Analysis Techniques
    • Basecalling: to convert fluorscene intensities from the sequencing experiment into four-letter sequence code.
    • Genome mapping and assembly: to organize the sequences of short fragments of raw DNA sequence data into a cohrent whole.
    • Genome annotation: to connect functional information about the genome to specific sequence location.
    • Genome comparison: to identify components of genome structure that differentiate one organism from another.
    • Micro array image analysis : to identify and quantitative spots in raw micro array data.
    • Clustering analysis of micro array data: to identify genes that appear to be expressed as linked groups.

  • Tools/software
  • Basecalling: Phred
  • Genome mapping and assembly: Phrad, Staden package
  • Genome annotation: MAGPIE
  • Genome comparison: PipMaker, MUMmer.
  • Microarray images analysis: CrazyQuant, Spsotfinder, Array View.

Plasmid Mapping Primer Design

Restriction mapping is the process of obtaining structural information on a piece of DNA by the use of restriction enzymes. OR Generate graphical and text-based maps for restriction endonuclease cleavage of DNA.

A restriction map is a description of restriction endonuclease cleavage sites within a piece of DNA. Generating such a map is usually the first step in characterizing an unknown DNA, and a prereq/uisite to manipulating it for other purposes. Typically, restriction enzymes that cleave DNA infrequently (e.g. those with 6 bp recognition sites) and are relatively inexpensive are used to produce at a map.

A restriction map is a description of restriction endonuclease cleavage sites within a piece of DNA. Generating such a map is usually the first step in characterizing an unknown DNA, and a prerequisite to manipulating it for other purposes. Typically, restriction enzymes that cleave DNA infrequently (e.g. those with 6 bp recognition sites) and are relatively inexpensive are used to produce at a map.

Restriction enzymes:
Restriction enzymes are enzymes that cut DNA at specific recognition sequences called "sites." The name "restriction enzyme" comes from the enzyme's function of restricting access to the cell. A bacterium protects its own DNA from these restriction enzymes by having another enzyme present that modifies these sites by adding a methyl group.

For example, E.coli makes the restriction enzyme Eco RI and the methylating enzyme Eco RI methylase. The methylase modifies Eco RI sites in the bacteria's own genome to prevent it from being digested.
Restriction enzymes are endonucleases that recognize specific 4 to 8 base regions of DNA. For example, one restriction enzyme, Eco RI, recognizes the following six base sequence:

                  5' . . . G-A-A-T-T-C . . .  3'
                  3' . . . C-T-T-A-A-G . . .  5'

A piece of DNA incubated with Eco RI in the proper buffer conditions will be cut wherever this sequence appears. As you can see, this site is palindromic; that is, reading the upper strand from 5' to 3' is the same as reading the lower strand from 5' to 3'. As a result, each strand of the DNA can self-anneal and the DNA forms a small cruciform structure:

Figure 1

All restriction enzyme sites are palindromic. This structure may help the enzyme to recognize the sequence that it is designed to cut.

There are hundreds of restriction enzymes that have been isolated and each one recognizes its own specific nucleotide sequence. Sites for each restriction enzyme are distributed randomly throughout a particular DNA stretch. Digestion of DNA by restriction enzymes is very reproducible; every time a specific piece of DNA is cut by a specific enzyme, the same pattern of digestion will occur. Restriction enzymes are commercially available and their use has made manipulating DNA very easy.

Restriction Mapping: PROCESS:

Restriction mapping involves digesting DNA with a series of restriction enzymes and then separating the resultant DNA fragments by agarose gel electrophoresis. The distance between restriction enzyme sites can be determined by the patterns of fragments that are produced by the restriction enzyme digestion. In this way, information about the structure of an unknown piece of DNA can be obtained. An example of how this works is shown below. You have isolated a clone in pBluescript (look at bacterial transformation lab again to see its restriction map). You know how big the pBluescript portion of the plasmid is (3.0 kilobases) and what restriction enzymes are present in the plasmid (because you have its restriction map from the company that sold you the plasmid). You also know that the insert is 2.0 kb long and that it is inserted the Eco RI site. Your task is to find out more information about the insert:

Figure 2

At this point, you would digest plasmid with an enzyme that you know is in the pBluescript plasmid. For example, you know that there is only one Bam HI site in pBluescript, and it is in the multiple cloning site next to the Eco RI site (figure 2). If you digest this plasmid with Bam HI, there are two possibilities: 1) There are no Bam HI sites in the insert. If this is the case, when you run this digestion on a gel you will see only one DNA fragment, and it will be 5.0 kb long (3.0 kb of pBluescript DNA and 2.0 kb of insert DNA). 2) There is a Bam HI site in the insert. If this is the case, then the enzyme will cut the circular plasmid in two places, in the pBluescript part of the plasmid and in the insert. In this case, you will end up with two fragments of DNA. One will be pBluescript with some of the insert still attached and the other will be just insert. The sizes of the two fragments (determined by electrophoresis) will tell you where the site is.

These two possibilities are shown in figure 3:

Figure 3

In the second case, where there is a site in the insert, the gel might look like this:

Figure 4

In this case, we learn two pieces of information: 1) that there is a Bam HI site in the insert, and 2) where the site is in relation to the one end of the insert. When the Bam HI digestion is separated on an agarose gel, the sizes of the two fragments can be determined. In the above gel, the fragments are 3.6 kb and 1.4 kb. Therefore, we know that the Bam HI site is 1.4 kb away from the right hand side of the insert (figure 5). In this way, you have "mapped" the Bam HI site:

Figure 5

By testing the insert for the presence and location of sites of many different restriction enzymes, a "restriction map" of the clone is made. This will give us important structural information on the insert.

Uses of Restriction Mapping:

Restriction map information is important for many techniques used to manipulate DNA. One application is to cut a large piece of DNA into smaller fragments to allow it to be sequenced. Genes and cDNAs can be thousands of kilobases long (megabases - Mb); however, they can only be sequenced 400 bases at a time. DNA must be chopped up into smaller pieces and subcloned to perform the sequencing. Also, restriction mapping is an easy way to compare DNA fragments without having any information of their nucleotide sequence. For example, you may isolate two clones for a gene that are 8 kb and 10 kb long. You know that they overlap, because the procedure you used to isolate them told you that they have sequences in common. A restriction map can tell you how much they overlap by:

On the basis of on the restriction maps of each these clones, it can be assumed that they overlap like this

From the restriction map information, you can tell which parts of the two clones are identical and which parts are different. The parts of the clones that overlap are identical. If you were interested in the sequence of this gene, you would only have to sequence the area of overlap in one of the clones, greatly reducing the amount of sequencing that you would have to do.

Phylogenetic Analysis

A phylogenetic analysis of a family of related nucleic acid or protein sequences is a determination of how the family might have been derived during evolution. The evolutionary relationships among the sequences are depicted by placing the sequences as outer branches on a tree. The branching relationships on the inner part of the tree then reflect the degree to which different sequences are related. Two sequences that are very much alike will be located as neighboring outside branches and will be joined to a common branch beneath them. The object of phylogenetic analysis is to discover all of the branching relationships in the tree and the branch lengths.

Phylogenetic analysis of nucleic acid and protein sequences is presently and will continue to be a important area of sequence analysis. In addition to analyzing changes that have occurred in the evolution of different organisms, the evolution of a family of sequences may be studied. On the basis of the analysis, sequences that are the most closely related can be identified by their occupying neighboring branches on a tree. When a gene family is found in an organism or group of organisms, phylogenetic relationships among the genes can help to predict which ones might have an equivalent function. These functional predictions can then be tested by genetic experiments. Phylogenetic analysis may also be used to follow the changes occurring in a rapidly changing species, such as a virus. Analysis of the types of changes within a population can reveal, for example,whether or not a particular gene is under selection (McDonald and Kreitman 1991; comeron and Kreitman 1998; Nielsen and Yang 1998), an important source of information in applications like epidemiology.

Phylogenetic Analysis is the study of evolutionary relationships phylogenetic analysis means estimating these relationships. The evolutionary history inferred from phylogenetic analysis is usually depicted as branding, treelike diagrams that represent on estimated pedigree of the inherited relationships among molecules, organisms or both.
Phylogenetics is also called as cladistics because the word ‘clade’ a set of descendants from a single ancestor is derived form the Greek word for branch. However, cladistics is a particular method of hypothesizing about evolutionary relationships.

Cladistics analysis is performed by comparing multiple characteristics or characters at once. Either multiple phenotype characters or multiple base pairs or amino acids in a sequence.

Three basic assumptions in cladistics.

  1. Any group of organisms id related by descent from a common ancestor.
  2. There is a bifurcating pattern of cladogenesis.
  3. Change in characteristics occurs in lineages over time. This is a necessary condition for cladistics to work.

A clade is a monophyletic taxon. Clades are grous of organisors or genes that include the most recent common ancestor of all or its members and all of the descendants of that most recent common ancestor. Clade is derived from the Greek work ‘Klados’ meaning branchingor twig.

  1. A Taxon is any named group of organisms but not necessarily a clade.
  2. In some analysis, branch lengths correspond to divergence (in above eg. mouse is slightly more related to fly than human to fly.)
  3. A node is bifurcating branch point.
  4. Branch : defines the relationship between the taxa in terms of descent and ancestry.
  5. Topology : is the branching pattern.
  6. Branch length : often represents the number of changes that have occurred in that branch.
  7. Root: is the common ancestor of all taxa.
  8. Distance scale : scale which represents the number of differences between sequences (e.g. 0.1 means 10 % differences between two sequences)

Common Phylogenetic Tree Terminology

Phylogenetic trees diagram the evolutionary relationships between the taxa

This dimension either can have no scale (for ‘cladograms’), can be proportional to genetic distance or amount of change (for ‘phylograms’ or ‘additive trees’), or can be proportional
to time (for ‘ultrametric trees’ or true evolutionary trees).

((A,(B,C)),(D,E)) = The above phylogeny as nested parentheses

These say that B and C are more closely related to each other than either is to A,
and that A, B, and C form a clade that is a sister group to the clade composed of
D and E. If the tree has a time scale, then D and E are the most closely related.

Three types of trees

Tree Styles

This offers the choice of tree diagram unrooted or the rooted forms of Chladogram, Phenogram, Curvogram, eurogram and Woopogram. The style are describe as

Rooted and Unrooted tree

Cladogram – Nodes are connected to other nodes and to tips by straight lines going directly from one to the other. This gives a V-shaped appearance.

Curvogram – Nodes are connected to other nodes and to tips by a curve, which is one fourth of an ellipse, starting out horizontally and then curving upwards to become vertical. John Rudd suggested this pattern.

Phenogram – Nodes are connected to other nodes and to other tips by a hortizontal and then a vertical line. This gives a particularly precise idea of horizontal levels.

Eurogram – So-called because it is a version of cladogram diagram popular in europe (name courtesy of David Maddison). Nodes are connected to other nodes and to tips by a diagonal line that goes outward and goes at most one-third of the way upto the next node,then turns sharply straight upwards and is vertical. Unfortunately it is nearly impossible to guarantee, when branch lengths are used, that the angles of divergence of lines are the same.

Swoopogram – This option (suggested by James Archie) connects two nodes or a node and a tip using two curves that are actually each one-quarter of an ellipse. The first part starts out vertical and then bends over to become horizontal. The second part, which is at least two-thirds of the total, starts out horizontal and then bends up to become vertical. The effect is that two lineages split apart gradually, then more rapidly, then both turn upwards.

Possible ways of drawing a tree:

Trees can be drawn in different ways. There are trees with unscaled branches and with scaled branches.

  1. Unscaled branches : the length is not proportional to the number of changes. Sometimes, the number of changes are indicated on the branches with numbers. The nodes represents the divergence event on a time scale.
  2. Scaled branches : the length of the branch is proportional to the number of changes. The distance between 2 species is the sum of the length of all branches connecting them.
  3. It is also possible to draw these trees with or without a root. For rooted trees, the root is the common ancestor. For each species, there is a unique path that leads from the root to that species. The direction of each path corresponds to evolutionary time. An unrooted tree specifies the relationships among species and does not define the evolutionary path.

  4. Image:


The objective of Phylogenetic analysis is to discover all of the branching relationship in the tree and the branch lengths. Phylogenetic analysis of nucleic acid and protein is presently and will continue to be an important area of sequence analysis. In addition to analysing changes that have occurred in the evolution of different organisms the evolution of family of sequences may be studied. On the analysing sequences that are closely related can be identified by their ocuupying neighbouring braches on a tree. When a gene family is found in an organism or group of organism phylogenetic relationship among the genes can help to predict which ones might have an equivalent function. Phylogenetic analysis may also be used to follow the changes occuring in a rapidly changing species, such as Virus. Analysis of the types of changes with in a population can reveal, for example: whether or not a particular gene is under selection an important source of information in applications like Epidemiology. With the aid of sequences , it should be possible to find the genealogical ties between the organisms. Experience learns that, closely related organisms have simillar sequences, more distantly related organisms have more dissimilar sequences . One objective is to reconstruct evoulutionary relationship beween species. Another objective is to estimate the time of divergence between two organisms since they last shared a common ancestor.

Human Genome Project


The Human Genome Project is a worldwide research effort initiated by the Department of Energy and the National Institutes of Health in 1987 as a multi-disciplinary effort to understand the basis of human heredity. This international collaboration is being carried out at several genome centers located in the United States, England, france and Japan The focus of the Human Genome Project is the Characterization of the human genome by determining the complete nucleotide sequence of our 24 different chromosomes, including the estimated 50,000 to 100,00 genes contained in human DNA project goals are to

Ø Identify all the approximately 30,000 genes in human DNA

Ø Determine the sequence of the 3 billion chemical base pairs that make up human DNA

Ø Store this information in databases,

Ø Improve tools for data analysis,

Ø Transfer related technologies to the private sector, and

Ø Address the ethica, legal, and social issues (ELSI) that may rise from the project.

What’s genome? And Why is it important?

Ø A genome is the entire DNA in an organism, including its genes. Genes carry information for making all the proteins required by all organisms. These protein determine, among other things, how the organism look, how well its body metabolizes food or fights infection, and sometimes even how it behaves.

Ø DNA is made up of four similar chemicals (called bases and abbreviated A,T, C, and G) that are repeated millions or billions of times throughout a genome.

Phrap: Genome mapping and assembly, to recognize the sequences of short fragments of raw DNA sequence data into a coherent whole

MAGPIE: Genome annotation, to connect functional information about the genome to specific sequence location.

Pip Maker: Genome comparison, to identify components of genome structure that differentiate one organism from another.

Array Viewer: Micro array image analysis to identify and quantitate spots in raw micro array data.

Melanie Viewer: 2D-PAGE analysis, to analyze,visualizes, and quantitates 2D-PAGE images.

Download bioinformatics Software

These resources are free, publicly available, multi-purpose tools for DNA sequence analysis. Links to resources for sequence presentation, manipulation tasks, and format conversion are found here.

BCM Search Launcher

Molecular biology-related search and analysis services organized by function; single point-of-entry for related searches (e.g., a single page for launching protein sequence searches using standard parameters).

BCM Search Launcher Sequence Utilities

Includes reverse complement, 6-frame translation, RepeatMasker, ReadSeq format conversion.

Bioinformatics Toolkit

This Toolkit is a collection of a wide range of tools and links for sequence analysis, function, and structure prediction. This resource offers convienent web interfaces for many freely available tools.


Printing and shading of multiple alignment files.


DNAtools include predicting DNA curvature; plotting physicochemical, statistical, or locally computed paramaters along DNA sequences; producing a 3-D model of a DNA sequence; searching an intron database.


This site provides several bioinformatics software tools packaged together for easy installation on MacOSX computers. The software includes NCBI tools, EMBOSS, ClustalW, Staden, T-Coffee and Primer3.


Diverse suite of tools for sequence analysis; many programs analagous to GCG; context-sensitive help for each tool.

HGNC: HUGO Gene Nomenclature Committee

The HGNC approves a unique gene name and symbol for each known human gene. The Human Gene Nomenclature Database (Genew) is searchable, and contains all approved symbols. For each symbol, if known, the database associates gene location, aliases, previous symbols and links out to sequence data.


NDB (Nucleic Acid Database) is a repository of three-dimensional structural information about nucleic acids.

OSU Bioinformatics and Computational Biology

The website of the Ohio State University Human Cancer Genetics Bioinformatics group. This site has many resources, including databases of promoters and transcription factors, software tools to predict potential P53 consensus binding sites and to predict first exon and promoter regions and a software toolkit for developing web-based applications to view genomic data.


pDRAW32 is a multi-function tool with features including: graphical displays useful for drawing plasmids, sequence analysis and editing, virtual agarose gel plots and homology plots.


Web server that automatically generates and annotates circular plasmid maps. The tool has: a built in set of features that can be displayed (ie. RE sites, tags, ORFs, etc.); allows users to define custom features to display; contains a library of commonly used plasmids; and, generates nice looking images in a variety of output formats.


Sequence format conversion; includes GenBank, EMBL, GCG, FASTA, ASN.1, Phylip and others.


Sequence analysis tools on the web; includes nucleic acid, protein, PCR and alignment tools.

Software for MacOSX at Mek&

Freely available programs that run on the MacOSX platform. 4Peaks is a DNA sequence editor and visualization program able to read and write common trace file formats. iRNAi assists in the design of error-free oligos. EnzymeX is a tool to help determine which restriction enzymes to use and includes information for over 580 enzymes. LabAssistant is a task/time management system to help organize your experiments.

The Sequence Manipulation Suite 2

The Sequence Manipulation Suite is a set of tools for tasks such as sequence format conversion, sequence presentation, analysing sequence characteristics and shuffling or generating random sequences. It can be accessed over the web, or installed locally and run through a web browser.


VisCoSe (Visualization and Comparison of consensus Sequences) is a web based tool that takes a set of sequences (aligned or unaligned) and calculates a consensus sequence and the conservation rates for the sequence alignment, producing an easy to interpret visualization as output. One can also compare and visualize a set of consensus sequences generated from several sequence sets.


The Web Alignment Visualization Server (WAViS) provides various web tools to enhance the presentation of amino acid or nucleotide multiple sequence alignments.