DREAM6/FlowCAP2 Molecular Classification of Acute Myeloid Leukaemia Challenge
Important Notice |
The datasets and gold standards for Challenge 4 (Molecular Classification of Acute Myeloid Leukaemia Challenge) were generously provided prior to its publication and cannot be used for external publication without explicit permission from Dr. Wade T. Rogers (rogersw AT mail.med.upenn.edu). |
9/27/2011
|
Submission is closed |
Note (July 17, 2011)We had originally requested that code be included at submission. Following some participants' suggestions CODE IS NO LONGER REQUIRED AT SUBMISSION.However, pseudocode must be submitted. |
Synopsis
The goal of this challenge is to diagnose Acute Myeloid Leukaemia from patient samples using flow cytometry data. For the uninitiated in flow cytometry, here is a good glossary of flow cytometry terms.
Background
Flow cytometry (FCM) has been widely used by immunologists and cancer biologists for more than 30 years as a biomedical research tool to distinguish different cell types in mixed populations, based on the expression of cellular markers. It has also become a widely used diagnostic tool for clinicians to identify abnormal cell populations associated with disease. In the last decade, advances in instrumentation and reagent technologies have enabled simultaneous single-cell measurement of tens of surface and intracellular markers, as well as tens of signaling molecules, positioning FCM to play an even bigger role in medicine and systems biology [1,2]. However, the rapid expansion of FCM applications has outpaced the functionality of traditional analysis tools used to interpret FCM data such that scientists are faced with the daunting prospect of manually identifying interesting cell populations in 20 dimensional data from a collection of millions of cells. For these reasons a reliable automated approach to flow cytometric analysis is desirable. While there has been a growing interest among the scientific community in developing these methods, guidance for end users about appropriate use and application of these methods is scarce.
This DREAM6/FlowCAP challenge is a collaboration between the DREAM project and the FlowCAP project. FlowCAP shares the spirit of DREAM, and focuses on advancing the development of computational methods for the identification of cell populations of interest in flow cytometry data. Besides attending the DREAM conference in Barcelona on Oct 14th, 2011, DREAM participants in this challenge may also be interested in attending the FlowCAP summit where this and other flow cytometry specific challenges will be discussed. The FlowCAP summit will take place on Sep 22 and 23, 2011 at the NIH site. For more information on the FlowCAP project please visit: http://flowcap.flowsite.org/ or join the FlowCAP Google Group and mailing list at http://groups.google.com/group/flowcap.
The Challenge
Detailed understanding of the pathogenesis of complicated immunological diseases like leukaemia requires the measurement of multiple cell sub-types. Identification of these subtypes is only possible by measurement of multiple factors (e.g., proteins) at the single cell level. Flow cytometry is the primary tool for making such measurements.
The goal of this challenge is to find cell populations, i.e., homogeneous groups/clusters of cells, which can be used to discriminate between Acute Myeloid Leukemia (AML) positive patients and healthy donors. The samples consist of 43 AML positive patients and 316 healthy donors. Samples from peripheral blood or bone marrow aspirate were collected over a one year period. The samples were subsequently studied with flow cytometry to quantitate the expression of different protein markers. Since the flow cytometer could not measure all of the markers simultaneously, each patient’s sample was subdivided in 8 aliquots (“tubes”) and analyzed with different marker combinations, 5 markers per tube (see Fig. 1 and Table 1).Information for about half of the donors on whether they are healthy or AML positive is provided as training set. The challenge is to determine the state of health of the other half, based only on the provided flow cytometry data.

Figure 1. Schematics of the process used to create the data provided in this challenge. Samples from 316 healthy donors and 43 AML patients were collected. For each sample, cells were immuno-stained using 8 tubes, containing different combinations of 5 fluorophore-conjugated antibodies. The stained samples were analyzed with flow cytometry. The resulting data is provided in this challenge.
The five markers for each tube correspond to different fluorophore-conjugated antibodies targeting specific proteins and shown in Table 1.
Table 1.Lists the fluorophore-conjugated antibodies contained in each of the 8 tubes in which the samples were incubated.
|
|
FL1 |
FL2 |
FL3 |
FL4 |
FL5 |
|
Tube 1 |
IgG1-FITC |
IgG1-PE |
CD45-ECD |
IgG1-PC5 |
IgG1-PC7 |
|
Tube 2 |
Kappa-FIT |
Lambda-PE |
CD45-ECD |
CD19-PC5 |
CD20-PC7 |
|
Tube 3 |
CD7-FITC |
CD4-PE |
CD45-ECD |
CD8-PC5 |
CD2-PC7 |
|
Tube 4 |
CD15-FITC |
CD13-PE |
CD45-ECD |
CD16-PC5 |
CD56-PC7 |
|
Tube 5 |
CD14-FITC |
CD11c-PE |
CD45-ECD |
CD64-PC5 |
CD33-PC7 |
|
Tube 6 |
HLA-DR-FITC |
CD117-PE |
CD45-ECD |
CD34-PC5 |
CD38-PC7 |
|
Tube 7 |
CD5-FITC |
CD19-PE |
CD45-ECD |
CD3-PC5 |
CD10-PC7 |
|
Tube 8 |
Non Specific |
Non Specific |
Non Specific |
Non Specific |
Non Specific |
For example, the fluorophore-conjugated antibody CD45-ECD permits the identification of cells expressing the CD45 antigen present in human biological samples. The expression of the CD45 molecule correlates with the stage of differentiation of the cells studied and is weak in the case of acute myeloid leukaemia thus enabling malignant cells to be distinguished from normal ones [3]. Note that CD45-ECD is present in all tubes 1-7. No other antibody is present in more than 1 tube. The 8th tube is an isotype control tube, with non-specific-binding antibodies (i.e., mouse antibodies which are not supposed to bind to human cells) to measure the background noise of the assays.
The data
From the discussion of the previous section it follows that there are a total of 2872 flow cytometry assays (8 aliquots for 359 subjects, of which 316 are healthy and 43 AML positive).
The raw data for all the subjects and all the tubes is compressed in the file
- FCS.zip
which, when uncompressed, contains 2872 files, named
Each of these files is formatted according to the Flow Cytometry Standard (FCS) [4] (see also this link).
We also provide preprocessed data (transformed/compensated) in the compressed file
- CSV.zip
which, when uncompressed, contains 2872 files, named
Each of these files is formatted as comma separated values (CSV), and contains a header row that identifies each of the 7 columns. Each subsequent row is an event (a cell) detected by the flow cytometer. The following example shows the typical contents of a CSV file
The number of events ranges from as few as 6,764 events to as many as 49,370 events. The meaning of the header row column is as follows:
SS is a rough indicator of cellular granularity, membrane complexity, number of organelles, etc. and FS is roughly proportional to cell size. For more information on how CSV files are processed from FCS files see [5]. While participants can use the processed CSV files to do their analyses, there may be merit in trying different preprocessing algorithms directly from the raw data (FCS files) to solve the challenge.
The naming of the files (0001.ext, 0002.ext, …, 2872.ext) can be used to identify the corresponding subject number and tube number following the following simple rules:
For example, the file numbers corresponding to subject 17 are 129, 130, …, 136. Likewise, file number 1539 corresponds to patient 193. The association between file number, tube number and subject number is given explicitly in the file
- DREAM6_AML_FilesTubesAndSubjects.CSV
As a training set, we give the state of health (AML or Normal) of subjects 1 through 179. These subjects comprise 23 of the 43 AML-affected subjects and 156 of the 316 healthy donors. This information is provided in the file:
- DREAM6_AML_TrainingSet.CSV
Submission of Predictions and Write-up
Participants are required to submit a list of subjects, from subject 180 through subject 359, ranked according to the confidence you assign to the subject to be affected with AML. The order should go from your most reliable prediction that a subject is AML positive (first row) to your most reliable prediction that a subject is normal (last row). Use a 2 tab-separated column format as in:
S \tab XYZ
where S is one of the subjects (180 through 359). XYZ is a score between 1 and 0 that indicates the confidence level you assign to the prediction. E.g., XYZ = 1 if subject S is deemed to be AML positive with highest confidence and XYZ = 0 if S is deemed to a normal donor with the most confidence. If there are omitted subjects, we will consider them as appearing randomly ordered at the end of the list. Save your submission as a text file, and name it:
- DREAM6 _AML_ Predictions_TeamName.txt
where "TeamName" is the name of the team with which you registered for the challenge. Best performance will be assessed based on the accuracy of the results of this prediction.
Finally, we request that each participating team submits a short write-up with the necessary documentation to run their algorithm, and explaining the methods used to arrive at their predictions. This write-up can contain pseudo-code describing the used algorithm, workflows for analyzing the data, etc. Name the write-up as
- DREAM6_AML_ Writeup_TeamName.ext
replacing "TeamName" with the name of your team and the file extension (ext) with your choice of doc, or docx. The submission of the write-up is mandatory for participation in the main challenge.
Scoring Metrics
Results will be scored using the area under the precision versus recall (PR) curve. Precision is defined as the fraction of correct positive set predictions, and recall is the proportion of correct positive set predictions out of all patients in the positive set. Other metrics such as the area under the receiver operating characteristic (ROC) curve will also be evaluated. Teams will be ranked according to their overall performance based on the area under the PR and ROC curves. We will evaluate these predictions as discussed in [6].
Credits
The challenge was conceived by the FlowCAP committee, and curated by Ryan Brinkman and Gustavo Stolovitzky along with Nima Aghaeepour, Julio Saez-Rodriguez and Pablo Meyer.
References
[1] Querec TD et al (2009). “Systems biology approach predicts immunogenicity of the yellow fever vaccine in humans”, Nature Immunology, 10, 116 – 125.
[2] Bendall SC et al (2011). “Single-Cell Mass Cytometry of Differential Immune and
Drug Responses Across a Human Hematopoietic Continuum”, Science, 332, 687-96
[3] Borowitz MJ, Guenther KL, Shults KE, Stelzer GT (1993). “Immunophenotyping of acute leukemia by flow cytometric analysis. Use of CD45 and right-angle light scatter to gate on leukemic blasts in three-color analysis", (1993), Am. J. Clin. Pathol., 100, 534-540.
[4] Spidlen J, Shooshtari P, Kollmann TR, Brinkman RR (2011). “Flow cytometry data standards”, BMC Res Notes, 4:50.
[5] Parks DR, Roeder M, Moore WA (2006).“A new ‘Logicle’ display method avoids deceptive effects of logarithmic scaling for low signals and compensated data”, ,69(6):541-51.
[6] Stolovitzky G, Prill RJ, Califano A (2009). “Lessons from the DREAM2 Challenges”, in Stolovitzky G, Kahlem P, Califano A, Eds, Annals of the New York Academy of Sciences, 1158:159-95.
Download
The link below provides theDREAM6_AML_FilesTubesAndSubjects.CSVandDREAM6_AML_TrainingSet.CSV files, and a file indicating the website from whereCSV.zipandFCS.zipfiles can be ftp-ed. Please mind that these two files, are of size ~10 GB each and you need to allot sufficient storage and time to download the files.