DREAM6 Alternative Splicing Challenge
| Important Note |
You cannot publish the Gold Standard for Alternative Splicing Challenge without the explicit permission by Gustavo Stolovitzky (gustavo AT us.ibm.com) and Myke Snyder (michael.snyder AT yale.edu) |
Synopsis
The goal of the mRNA-seq alternative splicing challenge is to assess the accuracy of the reconstruction of alternatively spliced mRNA transcripts from Illumina short-read mRNA-seq. Reconstructed transcripts will be scored against Pacific Biosciences long-read mRNA-seq. The ensuing analysis of the transcriptomes from mandrill and rhinoceros fibroblasts and their derived induced pluripotent stem cells (iPSC), as well as the transcriptome for human Embrionic Stem Cells (hESC) is an opportunity to discover novel biology as well as investigate species-bias of different methods.
Background
RNA-splicing is the process of combining different exons of one gene towards the production of mature mRNA transcripts. Alternative spicing, that is, the mechanism of assembling exons in different orders, plays a key role in transcriptome diversity in mammals. Shuffling of exons makes it possible for the same gene to code for different proteins. Figure 1, adapted from [1], shows different possible combinations of alternative splicing.
Figure 1. Transcripts from a gene can undergo many different patterns of alternative splicing. Transcriptional initiation at different promoters (shown in two different shades of pink) generates alternative 5'-terminal exons that can be joined to a common 3' exon (shown in blue) downstream. Similarly, alternative 3' exons, with alternative polyadenylation sites, can be joined to a common upstream exon. Through the use of alternative 5' or 3' splice sites, exons can be extended or shortened in length. The most common pattern of alternative splicing is a cassette exon that can be included in the mRNA or skipped, inserting or deleting a portion of internal sequence. A special case of paired cassette exons show mutually exclusive splicing, where one exon or the other is included, but not both. Finally, the excision of an intron can be suppressed, to leave the retained intronic sequence in the mRNA that is exported to the cytoplasm. Many genes show multiple positions of alternative splicing, creating complex combinations of exons and alternative segments and a large family of encoded proteins. Figure and legend adapted from [1]
Correct splicing is important for normal cells to function correctly. For example, abnormal splicing was found in several cancer cells. In the recent past, microarray-based technologies have been used to study RNA-splicing, but this technology was limited by the huge number of potential exon combinations. The recent availability of high-throughput sequencing methods has provided new opportunities to address this question in a more thorough way. The short read sequences obtained using typical high throughput sequencing, are first mapped to a reference genome (when it exists) and those mapped reads are subsequently used to predict the possible underlying transcript isoforms. This prediction can be done with or without predefined exon information. Some studies also tried to generate the transcriptomes of species whose genomic sequence is not yet available. In those projects reads were first assembled by De Bruijn graph assembler like Velvet [4] or ABySS [4], and then the assembled contigs were used to further investigate the whole transcriptome. Limited by the length of the short read sequencing (101bp) and high similarity between different isoforms, identifying different RNA-splicing and the enrichment of each isoforms remains a big challenge.
The Challenge
The challenge consists of using short read RNA-seq data from Mandrill and Rhinoceros fibroblasts and their derived IPS cells, as well as hESC, to predict as many transcript isoforms (generated by alternative splicing) as possible. Since there is no reference genome sequence for either Mandrill or for Rhinoceros, the mapping of reads to exons will have to be done by homology with other organisms whose genomes are known. This clearly adds further complexity to the problem.
Participant will use the provided data sets, consisting of Poly(A)-selected mRNA from Rhinoceros, Mandrill and human acquired using Illumina Hi-seq deep sequencing system, to build the underlying transcriptome. The sequences provided correspond to strand-specific paired-end reads, with read lengths of 101 nucleotides and an average distance between the read pairs of approximately ~100 bp. We request the inclusion of all the identified transcript sequences and their expression levels, quantitated by the number of reads assigned to the isoforms and a measure of confidence in the prediction.
The gold standard against which we will score the predictions will be created using selected target transcripts sequenced using pacific-BioSciences SMRT sequencer, which can generate read lengths between 1Kb and 2Kb [2] and should cover a good fraction of the whole transcript.
Data
Eight data files will be provided in FASTQ format. These datasets will contain the raw data information for each of the transcriptomes studied.
FASTQ is a format for storing nucleotide sequence and the corresponding quality scores. Each sequence is represented by four lines as follows:
+Optional repeat of title line
Quality line or lines with ASCII characters(can be wrapped)
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
The first ‘@’ title line holds a record identifier in free format and has no length limit.
The second line contains the sequence line. The IUPAC single letter code for (ambiguous) DNA is used. No white space such as tabs or spaces should be expected. The sequences will be 101 nucleotides in length, and correspond to pair end reads with spacers between the read pairs of lengths between~100 and ~200.
The third item signals the end of the sequence line and starts with ‘+’. The ‘+’ line can contain just this one character, or a repeat of the title line following the ‘@’.
The last line contains the sequence quality information. It uses a subset of the ASCII printable characters (at most ASCII 59–126 inclusive) with a simple offset mapping. The quality string must be equal in length to the sequence string.
There will be four compressed (gz) datasets for Rhinoceros:
- FG_NWR_FB_1.fastq.gz
- FG_NWR_FB_2.fastq.gz
- FG_NWR_iPSC_1.fastq.gz
- FG_NWR_iPSC_2.fastq.gz
Each of these datasets contains the compressed mRNA-seq from the Northern White Rhinoceros (NWR) fibroblast cells and iPSC induced from fibroblasts. The notation _1 and _2 refer to the two pairs of the paired-end sequencing.
There will also be four datasets for Mandrill:
- FG_Drill_FB_1.fastq.gz
- FG_Drill_FB_2.fastq.gz
- FG_NWR_iPSC_1.fastq.gz
- FG_NWR_iPSC_2.fastq.gz
Each of these contains the compressed mRNA-seq for the Mandrill fibroblast cells and iPSC induced from fibroblasts. As explained before, the notation _1 and _2 refer to the two pairs of the paired-end sequencing.
Finaly, there will be two datasets for human:
- FG_hg_HESC_1.fastq.gz
- FG_hg_HESC_2.fastq.gz
Each of these contains the compressed mRNA-seq for the human Embrionic Stem cells. As explained before, the notation _1 and _2 refer to the two pairs of the paired-end sequencing.
Submission of Predictions and Writeup
Participants are required to submit a writeup file indicating the methods used to solve the challenge, and 4 additional files, 2 per cell type corresponding to each of the two organisms. For each organism, please submit your predictions of maximal transcript that you inferred. Submit your predictions in tabular form using the following format:
Confidence \tab Counts \tab CTCTGTAAGTCAGG…GTGACTCGAGCGGAT
The first column contains the confidence that the sequence reported in that line is really an isoform. The confidence should be a number between 1 and 1000, with 1 being a very low confidence that the isoform is present and 1000 meaning certainty that it is present. The second column corresponds to the number of counts corresponding to that isoform. Finally, the third column contains the sequence of the isoform. The sequence should be the longest sequence that you think is present. Subsequences contained in longer sequences will be discarded. The file should contain no header row. The maximum number of rows allowed will be 100,000.
Submit your files as text files with the name
-
DREAM6_AltSplice_<Organism>_<CellType>_<TeamName>.txt
Replacing <Organism> by Rhino, Mandrill, or Human, <CellType> by Fibroblasts or IPS or HESC and <TeamName> by the name of the team with which you register to the challenge.
Write-up
Finally we request that each participating team submits a short write-up (around two to three pages) explaining the methods used to arrive at their predictions. This write-up can contain pseudo-code describing the algorithm used, workflows for coming to the prediction of the isoforms, etc. Submit the write-up as the file
- DREAM6_AltSplice_Writeup_<TeamName>.ext
replacing <TeamName> with the name of your team and the file extension (ext) with your choice of doc or docx. The submission of this writeup is mandatory for participation in this challenge.
Scoring Metrics
The submissions will be scored against a not-so-golden Gold Standard created out of longer reads mRNA-seq using PacificBiosciences technology. The data is new, and there may be some unforeseen difficulties in the scoring. But the basic idea is that once we have a list of True Positives out of the long-read RNA-seq, we will scan the files and determine matches between the predicted isoforms, and the True Positives. This will allow us to estimate the recall of the submission (that is, the fraction of the True Positives contained in the submitted lists). We will also attempt to quantify by correlation the predicted abundance of the different isoforms, with the same measure obtained in the Gold Standard.
For suggestions about scoring this challenge, please contact the organizers directly, or post a comment in the discussion forum.
Credits
The identity of the lab that provided the RNA will be disclosed after the submission deadline. The challenge was conceived by Mike Snyder and Gustavo Stolovitzky, the sequencing was done by Fabian Grubert, who also curated the challenge along with Yong Cheng, Bobby Prill and Gustavo Stolovitzky.
References
[1] Qin Li, Ji-Ann Lee & Douglas L. Black, Neuronal regulation of alternative pre-mRNA splicing, Nature Reviews Neuroscience 2007 8, 819-831
[2] http://oelemento.wordpress.com/2011/01/03/a-closer-look-at-the-first-pacbio-sequence-dataset/
[3] Zerbino DR, Birney E., Velvet: algorithms for de novo short read assembly using de Bruijn graphs, Genome Res. 2008 May;18(5):821-9.
[4] Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I., ABySS: a parallel assembler for short read sequence data, Genome Res. 2009 Jun;19(6):1117-23.
Download
The link below provides the website, userid and password from where the data can be ftp-ed. Please mind that there are eight files of size ~10 GB each. You need allot sufficient storage and time to download the files.
