Sage Bionetworks-DREAM Breast Cancer Prognosis Challenge

Click here to get started with the Sage Bionetworks - DREAM Breast Cancer Prognosis Challenge

NEW: Final phase of the challenge has started!


1. To remind you, we have set a deadline of October 15 to receive all of your submitted models for scoring and for determining Challenge winners (using the METABRIC data and then a little later this fall, using the Oslo-Val data). To make sure that none of you misses this crucial deadline, we will receive your models up to 11:59 pm Pacific on October 15. Please don't miss this deadline!! 

2. To select the top model as assessed using METABRIC data, we will choose no more than 5 models from each individual or team. Shortly after the October 15 deadline, we will send out a message letting you know that unless we receive a note from you to the alternative, we will submit your 5 top-scoring models for the final METABRIC model assessment (as listed on the October 15 leader board). 

3. Please note that a key aspect of our judging procedure will be to confirm that your model code is readable and reusable (i.e., such that others could use it or combine it with their own code to build a new and potentially better model). 


The goal of the breast cancer prognosis Challenge is to assess the accuracy of computational models designed to predict breast cancer survival, based on clinical information about the patient's tumor as well as genome-wide molecular profiling data including gene expression and copy number profiles.


Molecular diagnostics for cancer therapeutic decision-making are among the most promising applications of genomic technology. Several diagnostic tests have gained regulatory approval in recent years. Molecular profiles have proved particularly powerful in adding prognosis information to standard clinical practice in breast cancer, using gene-expression-based diagnostic tests such as MammaPrint [1] and Oncotype Dx [2].

Based on initial promising clinical results, computational approaches to infer molecular predictors of cancer clinical phenotypes are one of the most active areas of research in both industrial and academic institutions, leading to a flood of published reports of signatures predictive of cancer phenotypes. Several trends have emerged through these numerous studies: 1) genes defining predictive signatures of the same phenotype often do not overlap across multiple studies; 2) predictive signatures reported by one group may not prove robust in other studies; 3) there is no consensus regarding the most accurate signatures or computational methods for inferring predictive signatures; 4) there is no consensus regarding the added value of incorporating molecular data in addition to or instead of traditionally used clinical covariates.

There is a critical need to objectively and systematically assess whether genomic data, at this current time, provides value above and beyond classic TNM staging and other clinical covariates. For instance, the UK’s NICE (National Institute for Health and Clinical Excellence) has initial guidance that the genomic prognostic signatures currently being marketed do not supplant clinical measures in a cost-effective manner. The emergence of datasets containing clinical measurements combined with genome-wide molecular profiles of large breast cancer patient cohorts now allows prognostic models to be systematically evaluated. Given the complexity of the data and plethora of possible modeling approaches, we believe the most powerful mechanism of elucidating the optimal use of genomic and clinical information in breast cancer prognosis is through a community-based effort to evaluate the accuracy of many different modeling approaches on a common dataset and analytical platform and using a blind methodology to avoid the biases of self assessment.

The Challenge

This Challenge will create a community-based effort to provide an unbiased assessment of models and methodologies for the prediction of breast cancer survival. A common dataset will be provided to all participants, with a validation dataset held out for model evaluation. A novel dataset will be generated at the end of the Challenge and used to provide a final, unbiased score for each model.

Resources provided:

  1. Full-time use of a large computing resource is being donated by Google Inc. for the duration of the Challenge. The availability of these resources to all participants will allow for a democratization of computational resources and will empower participants to apply their best ideas in a high performance compute environment. Google’s donation of a common compute space also allows all models, including computationally intensive ones, to be shared and re-run on a common platform, enabling transparency of the process, and future work to evolve and extend components of promising models, either in breast cancer prognosis or other applications.
  2. The training dataset will come from the METABRIC cohort of 2,000 breast cancer samples and include detailed clinical annotations, 10 median year survival time, gene expression, and copy number data [3].
  3. Additional breast cancer datasets, curated by Sage Bionetworks, will provide information on several thousand additional patients that Challenge participants can use in their model development.
  4. A web-based platform called Synapse will be provided by Sage Bionetworks: the Synapse platform will enable transparent, reproducible model building and analysis workflows, as well as the sharing of data, tools, and models with the Challenge community, the model evaluators and the publication reviewers.
  5. The validation dataset will be newly released and derive from 300 to 500 fresh frozen primary tumors with the same clinical annotations and survival data as the METABRIC cohort. Gene expression and copy number data will be generated for this cohort using the same molecular profiling platforms as were used to generate the METABRIC data. This will provide a truly novel validation set for the scoring of predictive models.
  6. For questions about the challenge please contact

Challenge timeline:

June-July 17th, 2012

Sign up for the Sage Bionetworks - DREAM Breast Cancer Prognosis Challenge. Registered participants will be notified by email about the initiation of the Challenge.

July 17th-October 15th, 2012

A live demo call was held on July 17th to help participants get started with the Challenge. A step by step guide is available here with additional details about the Challenge available at Breast Cancer Challenge: Detailed Description. Data from 1,000 samples will be provided to participants for training of models. An additional 500 samples will be used to provide real-time evaluation of all submitted models. The remaining 500 samples will be used for final scoring of all models (taking place after October 15th).

October 15th, 2012

Final submission of all models, to be scored against the 500 Metabric data samples not used in the previous phase. The deadline for submitting models for the Breast Cancer Prognosis Challenge is 5PM EST October 15th, and the best performers will be announced at the DREAM 7 Conference taking place in San Francisco on November 12 to 16.

Late 2012

Final assessment of all models in newly generated data. For the new validation data set, molecular and clinical data on approximately 350 breast cancer samples (with archived fresh frozen tumor samples) is being provided by the group of Anne-Lise Borresen-Dale with the help of a donation from AVON. We are currently curating the clinical records of this patient cohort to harmonize with the current METABRIC dataset and working on generating the genomic profiling data for these samples. We aim to generate these data by the November 12 DREAM conference and announce initiation of the final evaluation to be performed on this data set. We will keep participants informed on progress in generating these data.


Starting in early July, all data for the Challenge will be accessible through Sage Bionetwork’s Synapse software platform and loaded into R objects via simple function calls through the Synapse R client. The data will comprise the following information:

Survival data 

  • Survival data is loaded into R as a Surv object as defined in the R survival package. This object is simply a 2 column matrix with sample names on the rows and columns:
    • time – time from diagnosis to last follow up.
    • status – weather the patient was alive at last follow up time.

Feature data

  • Gene expression data.
    • Performed on the Illumina HT 12v3 platform.
    • Loaded as Bioconductor ExpressionSet object.
    • Data normalized as described in [3].
  • Copy number data.
    • Performed on the Affymetrix SNP 6.0 platform.
    • Loaded as Bioconductor ExpressionSet object.
    • Data normalized as described in [3].
  • Clinical covariates. For a detailed explanation of the clinical data and how it is currently used in breast cancer prognosis and treatment, see Breast Cancer Challenge clinical background.
    • Loaded as a data.frame  object with the following features. We note that in the initial data release on 7/17/2012 factor data is encoded as characters. We are working on a new data release to encode these variables as factors with pre-specified factor levels and expect to update the data in the upcoming weeks.
variable name type description factor levels
age_at_diagnosis numeric age of patient at diagnosis of disease  
group factor disease and treatment group
  • 1 = Lymph Node negative and have not received chemotherapy

  • 2 = ER positive, Lymph Node positive, have received hormone therapy but no chemotherapy

  • 3 = ER negative, Lymph Node positive, have received chemotherapy

  • 4 = all others

grade integer grade of disease (1, 2, 3)  
size integer size of tumor in cm  
lymph_nodes_positive factor lymph node assessment
  • positive
  • negative
histological_type factor tumor histology
  • IDC
  • ILC
ER_IHC_status factor ER status
  • pos
  • neg
  • null
cellularity factor tumor cellularity
  • low
  • moderate
  • high
  • undef
Pam50Subtype factor Pam50 breast cancer intrinsic classifier
  • LumA
  • LumB
  • Her2
  • Normal
  • Basal
  • NC
Treatment factor treatment received
  • HT/RT = hormone / radiation therapy
  • CT/HT/RT = chemo / hormone / radiation therapy
  • NONE = none
  • CT/RT = chemo / radiation therapy
  • HT = hormone therapy
  • RT = radiation therapy
  • CT/HT = chemo / hormone therapy
  • CT = chemotherapy
Site factor site of data collection
  • 1
  • 2
  • 3
  • 4
  • 5


Submission of Predictions and Write-up

A primary goal of this Challenge is to promote transparent, reusable models that can be assessed and extended by the community. To this end, models built for this Challenge will be constructed using the R programming language and uploaded to a common platform (Synapse) provided by Sage Bionetworks. Models will be uploaded as R objects implementing a function called customPredict() that returns a vector of risk predictors when given a set of feature data as input. The customPredict() function will be run by a validation script for each submitted model and resulting predictions scored as described in the Scoring section below.

The Challenge supports source code submissions allowing the validation script to also reproduce the training of the model. At various times throughout the competition the creator of the best performing model on the leaderboard will be asked to write a short description of their approach to be posted on the discussion forum.

All Phase 3 submissions must be accompanied by a write-up that includes a short description of the approach used in the final model.


The Challenge models will be scored by calculating the concordance index between the predicted survival and the true survival information in the validation dataset (accounting for the censor variable indicating whether the patient was alive at last follow-up).

The final assessment of models and the determination of the best performer will be based on the concordance index of predictions on the test dataset in Phase 3 of the Challenge. In addition, other scoring metrics will be considered depending on the suggestions of the community throughout the Challenge.


The high impact journal Science Translational Medicine (STM) has agreed for the best performing individual or team in Phase 3 to publish their results as a "prize" for best performance provided its score is better than the score of a pre-defined baseline set of models. STM representatives agreed that having an evaluation committee re-run and compare all models in a transparent environment can serve the role of peer review (Challenge-assisted peer review), allowing the results from the winning individual or team to be published without additional review. Furthermore, the lead author of the best performing submission will receive a speaking invitation at the DREAM 7 Conference taking place in San Francisco on November 12 to 16.


This Challenge is fueled by the generous donation of clinical study data on 2,000 breast cancer patients obtained by Samuel Aparicio of the BC Cancer Research Centre, Carlos Caldas of Cancer Research UK, and Anne-Lise Borresen-Dale of Oslo University Hospital. The Challenge was organized by Adam Margolin, Erhan Bilal, Mike Kellen, Brian Bot, Brig Mecham, Erich Huang, Andrew Trister, Charles Ferte, Gustavo Stolovitzky and Stephen Friend who profited from many discussions with Laura van’t Veer.


[1]        L. J. van  ’t Veer, H. Dai, M. J. van de Vijver, Y. D. He, A. A. M. Hart, M. Mao, H. L. Peterse, K. van der Kooy, M. J. Marton, A. T. Witteveen, G. J. Schreiber, R. M. Kerkhoven, C. Roberts, P. S. Linsley, R. Bernards, and S. H. Friend, “Gene expression profiling predicts clinical outcome of breast cancer,” Nature, vol. 415, no. 6871, pp. 530–536, Jan. 2002.

[2]        S. Paik, S. Shak, G. Tang, C. Kim, J. Baker, M. Cronin, F. L. Baehner, M. G. Walker, D. Watson, T. Park, W. Hiller, E. R. Fisher, D. L. Wickerham, J. Bryant, and N. Wolmark, “A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer,” N. Engl. J. Med., vol. 351, no. 27, pp. 2817–2826, Dec. 2004.

[3]        C. Curtis, S. P. Shah, S.-F. Chin, G. Turashvili, O. M. Rueda, M. J. Dunning, D. Speed, A. G. Lynch, S. Samarajiwa, Y. Yuan, S. Gräf, G. Ha, G. Haffari, A. Bashashati, R. Russell, S. McKinney, M. Group, A. Langerød, A. Green, E. Provenzano, G. Wishart, S. Pinder, P. Watson, F. Markowetz, L. Murphy, I. Ellis, A. Purushotham, A.-L. Børresen-Dale, J. D. Brenton, S. Tavaré, C. Caldas, and S. Aparicio, “The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups,” Nature, 2012.