This article was contributed by Daniel S. Lieber

Scanning over the sequence of the human genome, one cannot help but feel the overwhelming sense of mystery that the first discoverers of ancient civilizations must have felt when they came across tablet writings in languages that have long since been lost. A major difference is that the meaning of the 3 billion A’s, C’s, G’s, and T’s that make up the human genome has never been understood — but scientists are quickly moving towards a better understanding. To date, scientists have been able to understand the regions of the genome that make up the “code” for proteins. Proteins give cells structure, and do much of the work of the cell – making cellular components and generating energy. Early research focused on understanding the code in DNA that reads out into protein, and much current work still focuses on understanding the function of proteins. But only around 2% of the human genome “codes” for proteins… what about the rest of our DNA?

A critical step to understanding our own genome is to learn how genes are regulated, and how this regulation is coded in DNA. Studies in model organisms such as yeast, flies, and mice have shown that a wealth of information lies in the region of DNA preceding each gene: within these “promoters” lie binding sites for transcription factors, which are the proteins responsible for turning on and off genes. Binding of a transcription factor to a short DNA sequence within a gene’s promoter can activate or sometimes inhibit the gene, thereby regulating its expression level.

Although these promoter sequences are known to contain critical regulatory information, we do not know enough to fully interpret the DNA sequence of promoters. To completely understand how transcription factors regulate genes, we would need to know the binding sites and affinities of each transcription factor (TF), the effect of individual TFs on the expression of genes, and how the interplay between multiple TFs and binding sites affects the regulation of each gene. Each of these goals is quite difficult on its own and a number of labs have focused on these questions in order to advance our understanding of the regulatory genome.

A critical study advancing our knowledge of transcription factor binding sites was published recently in the June 26 issue of Science (Badis et. al, Science 2009). In the study led by Martha Bulyk at Harvard Medical School, the DNA binding sites of over 100 mammalian transcription factors were mapped using a technology known as protein-binding microarrays (PBMs).  The technology enabled researchers to find short DNA sequences bound by an assortment of transcription factors. In addition to generating valuable data describing transcription factor binding preferences, the study yielded a number of new observations and general principles. The most important observation made in the study was that about half of the analyzed transcription factors were found to bind to at least two distinct sequences. This finding was unexpected given the widely held view that transcription factors generally prefer a single recognition sequence. The study’s results shed light on the complexity of transcription factor binding and will greatly contribute to our understanding of the DNA code of promoter sequences by mapping potential binding sites of transcription factors.

While Bulyk’s approach discovers the DNA sequences bound by a particular transcription factor, a complementary approach was recently developed to find the transcription factors that bind a particular DNA sequence. A study led by Saeed Tavazoie of Princeton University described a method by which they can experimentally discover proteins that bind to specific DNA sequences, such as predicted transcription factor binding sites (Freckleton et. al, PLoS Genetics, 2009). The method, nicknamed MaPS, takes advantage of a technique called phage display in which protein fragments are expressed on the surface of a bacterial virus known as bacteriophage l. To search for proteins that bind a specific fragment of DNA, they “wash” the bacteriophage over that sequence of DNA, and identify the bacteriophage that bind to the DNA. Through this method, the group discovered the transcription factor that binds to a particular DNA sequence in yeast.  In the future, this technique could be used to discover transcription factors binding to other DNA sequences in other organisms.

In many ways, Tavazoie’s MaPS technology is complementary to Bulyk’s protein binding microarrays. Whereas the strength of PBMs is the ability to explore the sequences bound by a particular transcription factor, MaPS allows researchers to find transcription factors that bind to a particular DNA sequence.  The studies both describe successful techniques that will allow the scientific community to better understand the regulatory information encoded by our DNA.

Nevertheless much work lies ahead– regulatory information encoded in DNA is incredibly rich and likely to be more complex than we can even imagine. The challenge should not be understated: we are attempting in a matter of decades to crack the code that took nature billions of years to evolve. But with enough time, money, and effort, there is no doubt that the application of such technologies will lead us to a better understanding of how our own bodies function and maybe even to advances in medical treatment and diagnostics.

For background information on transcription factors, see:

Journal articles discussed in this post:

Diversity and complexity in DNA recognition by transcription factors
Badis et. al, Science 2009.

Microarray profiling of phage-display selections for rapid mapping of
transcription factor-DNA interactions

Freckleton et. al, PLoS Genetics, 2009