DREAM6 Gene Expression Prediction Challenge

Submission is closed

 

 

7/26/2011: Scoring systems posted

A new document detailing how this challenge will be scored has been added. Download it from this link

7/11/2011: Note

The paper associated with this challenge has been accepted and the actual data is available for download. (The previous data was a sample of what to expect.)

 

Introduction

The level by which genes are transcribed is determined in large part by the DNA sequence upstream to the gene, known as the promoter region. Although widely studied, we are still far from a quantitative and predictive understanding of how transcriptional regulation is encoded in gene promoters. One obstacle in the field is obtaining accurate measurements of transcription derived by different promoters. To address this, an experimental system was designed to measure the transcription derived by different promoters, all of which are inserted into the same genomic location upstream to a reporter gene – a yellow florescence protein gene (YFP).  The challenge consists of the prediction of the promoter activity given a promoter sequence and a specific experimental condition.  To study a set of promoters that share many elements of the regulatory program, and thus are suitable for computational learning, the data pertains to promoters of most of the ribosomal protein genes (RP) of yeast (S.cerevisiae), in rich medium condition (SCD).

Experimental system

We defined a promoter region as the sequence immediately upstream of the ribosomal protein coding region, and further upstream until near the previous gene coding sequence, or 1200 base pairs – the shorter of the two. Each promoter was linked to a URA3 selection marker1and inserted into the same fixed location in the yeast genome2of a master strain which contained the YFP gene (See Fig. 1). In addition to 110 natural RP promoter strains, we constructed 33 strains with synthetically mutated RP promoters using similar methods (1,2)

The strains containing the different promoters were synchronized and grown, and their YFP fluorescence was measured in a plate reader. As a control for the experimental error, we had a red fluorescent protein (mCherry) driven by the same control promoter in all strains. The mCherry measurements as well as replicates of the tested promoters indicated that we can distinguish between any two promoters whose activities differ by as little as ~8%.

As a measure of transcription initiated by each promoter, we took the amount of YFP fluorescence produced during the exponential growth phase, divided by the integral of the OD during the same time period. This results in a measure, termed promoter activity, which represents the average rate of YFP production from each promoter, per cell per second, during the exponential phase.

Experimental System

  Figure 1. Overview of the experimental system.(A) Illustration of the master strain into which we integrated all the tested promoters. At a fixed chromosomal location, the master strain contains a gene that encodes for a red fluorescent protein (mCherry), followed by the promoter for TEF2, and a gene that encodes for a yellow fluorescent protein (YFP). Every tested promoter is integrated into this strain, together with a selection marker, between the TEF2 promoter and the YFP gene. (B) Strains with different promoters have highly similar growth rates. Shown is the growth of 71 different promoter strains, measured as optical density (OD). Measurements were obtained from a single 96-well plate, with glucose-rich media and a small number of cells from each strain inserted into each well at time zero. The exponential growth phase is indicated (vertical dashed gray lines). (C) Same as (B), but where the measurements correspond to mCherry intensity. Note the small variability in the intensity of mCherry, which is driven by the same control promoter across the different strains. (D) Same as (C), but where the measurements correspond to YFP intensity. Note the large variability in the intensity of YFP, which is driven by a different promoter in each strain.

The Challenge

The challenge consists of predicting the promoter activity derived by a given RP promoter sequence. Participants will be provided with a training set of 90 RP promoters for which both the promoter sequence and their activities are known, and a test set of 53 promoters for which only the promoter sequence is given. The test set is divided into two subsets. The first subset has 20 natural RP promoters. The second subset contains 33 promoters which are similar to natural RP promoters but have some mutations in their sequence. The goal is to predict the promoter activity of the promoters in the test set..

Note

The ultimate goal of the challenge is to develop a computational model that gets as input the RP promoter sequence or the synthetically mutated promoters and outputs its predicted transcription level. Any available data in the literature to reach this goal is fair game.

The Data

Participants will be provided with a training set, and a test set. The training set consists of three datasets, which contain sequence information and transcriptional activation rates. More specifically, the provided files are as follows:

  • DREAM6_ExPred_Promoters.fasta

contains the sequence of the 90 promoter regions that were tested in the experimental assay, in FASTA format.  More explicitly, this file will be of the form

  >Promoter_1
CAGAAACTGGAGGCCAAAGCGCTT…GGGCCGGGGTGTCTCAGACGACGC
>Promoter_2
TGTTTCCTCCCTAAAA…TACCGTTGCCAATAAGGGCACGTGT
………………
>Promoter_90
TGCACACGCGTTTGTGCCGTCATTG…GGCAGGGTCTATTTAACACGGGGTAGGTATCTTTCCTTCCAC 

 where the length of the promoter regions is variable, ranging between  200 and 1200 bases. For each of these 90 promoters, the measured promoter activity is provided in the file

  •  DREAM6_ExPred_PromoterActivities.txt

The file will contain two tab separated columns as follows:

Promoter_1     0.961934
Promoter_2     0.995191
……                    …………
Promoter_89   1.864408
Promoter_90   0.806867

Also supplied are the two sequences that flank every promoter (see Fig.1). This information is contained in the file:

  • DREAM6_ExPred_FlankingSeqs.fasta

which includes the following rows:

>Sequence Before All Promoters
<SEQ_BEFORE>
>Sequence After All Promoters
<SEQ_AFTER>

Where <SEQ_BEFORE> is the sequence in the yeast genome before the inserted promoter, that contains the loci: (His3 locus-mCherry-Tef2 promoter-URA3). The sequence <SEQ_AFTER> is the sequence in the genome immediately after the inserted promoter and contains the loci: (YFP-NAT1-HIS3 locus).

Finally, the test sets include the sequences of the promoters whose induced YFP fluorescence is requested. This files is named:

  • DREAM6_ExPred_TestSetPromoters.fasta

and is organized as follows

>Natural_Promoter_Test_1
CAGAAACTGGAGGCCAAAGCGCTT…GGGCCGGGGTGTCTCAGACGACGC
> Natural_Promoter_Test_2
TGTTTCCTCCCTAAAA…TACCGTTGCCAATAAGGGCACGTGT
………………
>Natural_Promoter_Test_20
TGCACACGCGTTTGTGCCGTCATTG…GGCAGGGTCTATTTAACACGGGGTAGGTATCTTTCCTTCCAC
>Synthetic_Promoter_Test_1
AGGTAGGCGAAGTA…AGCTCGCACCACCTCGAGTCTTATTCCAT
> Synthetic_Promoter_Test_2
GTCGTTGTCGTACGGGCAACTAGG…AGTCTACCCTCCACCTGTAAAGAAA
………………
>Synthetic_Promoter_Test_33
TCAAGGATTGGG…TTTAGGGCGAGGTCCGT

Note that there are 20 naturally occurring promoters and 33 synthetically mutated promoters.

Note

The paper that the present data is associated with is under revision, and has not been accepted yet. Until it is accepted, all the data that will be provided for this challenge will be indicated with the word ‘mock’ (e.g., DREAM6_ExPred_PromoterActivities_mock.txt), as it will have a similar format and characteristics as the actual data, but is not indeed the real data. Once the paper is accepted, the actual data will be released, and the paper published after the submission deadline.

(July 11, 2011) The paper that the present data is associated with has been finally accepted. Therefore we are providing the actual data in the file dream6_expred_real-data.zipat the bootm of this page. If you hadn't yet, please download the new data.

Submission of Predictions

Participants are required to submit a mandatory write-up file indicating the models used to solve the challenge, and one additional file containing the predictions. For each promoter in the test set, please submit your predictions of the YFP induction rate. Submit your predictions in a two tab delimited column format as indicated in the following example:

Natural_Promoter_Test_1\tab 0.85
Natural_Promoter_Test_2 \tab 1.5
………………
Natural_Promoter_Test_20 \tab 2.4
Synthetic_Promoter__Test_1 \tab 0.73
Synthetic_Promoter_Test_2 \tab 0.95
………………
Synthetic_Promoter__Test_33 \tab 3.14

The first column contains the id of the sequences requested in the test set and the second column contains the predicted promoter activity.

The file should contain no header row. Submit your files as text files with the name

  • DREAM6_ExPred_Predictions_<TeamName>.txt

Replacing <TeamName> by the name of the team with which you register to the challenge.

Write-up

Finally we request that each participating team submits a short write-up (around two to three pages) explaining the methods used to arrive at their predictions. This write-up can contain pseudo-code describing the algorithm used, workflows for coming to the prediction of the isoforms, etc. Submit the write-up as the file

  • DREAM6_ExPred_Writeup_<TeamName>.ext

replacing <TeamName>  with the name of your team and the file extension (ext) with your choice of doc or docx. The submission of this write-up is mandatory for participation in this challenge.

 

Scoring Metrics

To be announced

Credits

The data was kindly provided pre-publication by Eran Segal and his group at the Weizmann Institute of Science. The challenge was curated by Eran Segal, Eilon Sharon, Danny Zeevi, Pablo Meyer and Gustavo Stolovitzky.

References

1. Linshiz G, Yehezkel TB, Kaplan S, et al. Recursive construction of perfect DNA molecules from imperfect oligonucleotides. Molecular systems biology. 2008;4(191):191. Available at: http://www.ncbi.nlm.nih.gov/pubmed/18463615. 

2. Gietz RD, Schiestl RH. Microtiter plate transformation using the LiAc/SS carrier DNA/PEG method. Nature protocols. 2007;2(1):5-8. Available at: http://www.ncbi.nlm.nih.gov/pubmed/17401331.

Download

The link below will allow you to download a zipped file with the data for thischallenge.is