DREAM6/FlowCAP2 Molecular Classification of Acute Myeloid Leukaemia Challenge

 

Important Notice

The datasets and gold standards for Challenge 4 (Molecular Classification of Acute Myeloid Leukaemia Challenge) were generously provided prior to its publication and cannot be used for external publication without explicit permission from Dr. Wade T. Rogers (rogersw AT mail.med.upenn.edu).

                                                                                                                                               9/27/2011 

 

Submission is closed

 

Note (July 17, 2011) 

 

We had originally requested that code be included at submission. Following some participants' suggestions CODE IS NO LONGER REQUIRED AT SUBMISSION.However, pseudocode must be submitted. 

 

 

Synopsis

The goal of this challenge is to diagnose Acute Myeloid Leukaemia from patient samples using flow cytometry data. For the uninitiated in flow cytometry, here is a good glossary of flow cytometry terms.

 

Background

Flow cytometry (FCM) has been widely used by immunologists and cancer biologists for more than 30 years as a biomedical research tool to distinguish different cell types in mixed populations, based on the expression of cellular markers. It has also become a widely used diagnostic tool for clinicians to identify abnormal cell populations associated with disease. In the last decade, advances in instrumentation and reagent technologies have enabled simultaneous single-cell measurement of tens of surface and intracellular markers, as well as tens of signaling molecules, positioning FCM to play an even bigger role in medicine and systems biology [1,2]. However, the rapid expansion of FCM applications has outpaced the functionality of traditional analysis tools used to interpret FCM data such that scientists are faced with the daunting prospect of manually identifying interesting cell populations in 20 dimensional data from a collection of millions of cells. For these reasons a reliable automated approach to flow cytometric analysis is desirable. While there has been a growing interest among the scientific community in developing these methods, guidance for end users about appropriate use and application of these methods is scarce.

This DREAM6/FlowCAP challenge is a collaboration between the DREAM project and the FlowCAP project. FlowCAP shares the spirit of DREAM, and focuses on advancing the development of computational methods for the identification of cell populations of interest in flow cytometry data.  Besides attending the DREAM conference in Barcelona on Oct 14th, 2011, DREAM participants in this challenge may also be interested in attending the FlowCAP summit where this and other flow cytometry specific challenges will be discussed. The FlowCAP summit will take place on Sep 22 and 23, 2011 at the NIH site. For more information on the FlowCAP project please visit: http://flowcap.flowsite.org/ or join the FlowCAP Google Group and mailing list at http://groups.google.com/group/flowcap.

The Challenge

Detailed understanding of the pathogenesis of complicated immunological diseases like leukaemia requires the measurement of multiple cell sub-types. Identification of these subtypes is only possible by measurement of multiple factors (e.g., proteins) at the single cell level. Flow cytometry is the primary tool for making such measurements.

The goal of this challenge is to find cell populations, i.e., homogeneous groups/clusters of cells, which can be used to discriminate between Acute Myeloid Leukemia (AML) positive patients and healthy donors. The samples consist of 43 AML positive patients and 316 healthy donors. Samples from peripheral blood or bone marrow aspirate were collected over a one year period. The samples were subsequently studied with flow cytometry to quantitate the expression of different protein markers.  Since the flow cytometer could not measure all of the markers simultaneously, each patient’s sample was subdivided in 8 aliquots (“tubes”) and analyzed with different marker combinations, 5 markers per tube (see Fig. 1 and Table 1).Information for about half of the donors on whether they are healthy or AML positive is provided as training set. The challenge is to determine the state of health of the other half, based only on the provided flow cytometry data.

 

image

Figure 1. Schematics of the process used to create the data provided in this challenge. Samples from 316 healthy donors and 43 AML patients were collected. For each sample, cells were immuno-stained using 8 tubes, containing different combinations of 5 fluorophore-conjugated antibodies. The stained samples were analyzed with flow cytometry. The resulting data is provided in this challenge.

The five markers for each tube correspond to different fluorophore-conjugated antibodies targeting specific proteins and shown in Table 1.

Table 1.Lists the fluorophore-conjugated antibodies contained in each of the 8 tubes in which the samples were incubated.

 

FL1

FL2

FL3

FL4

FL5

Tube 1

IgG1-FITC

IgG1-PE

CD45-ECD

IgG1-PC5

IgG1-PC7

Tube 2

Kappa-FIT

Lambda-PE

CD45-ECD

CD19-PC5

CD20-PC7

Tube 3

CD7-FITC

CD4-PE

CD45-ECD

CD8-PC5

CD2-PC7

Tube 4

CD15-FITC

CD13-PE

CD45-ECD

CD16-PC5

CD56-PC7

Tube 5

CD14-FITC

CD11c-PE

CD45-ECD

CD64-PC5

CD33-PC7

Tube 6

HLA-DR-FITC

CD117-PE

CD45-ECD

CD34-PC5

CD38-PC7

Tube 7

CD5-FITC

CD19-PE

CD45-ECD

CD3-PC5

CD10-PC7

Tube 8

Non Specific

Non Specific

Non Specific

Non Specific

Non Specific

 

 

For example, the fluorophore-conjugated antibody CD45-ECD permits the identification of cells expressing the CD45 antigen present in human biological samples. The expression of the CD45 molecule correlates with the stage of differentiation of the cells studied and is weak in the case of acute myeloid leukaemia thus enabling malignant cells to be distinguished from normal ones [3]. Note that CD45-ECD is present in all tubes 1-7. No other antibody is present in more than 1 tube. The 8th tube is an isotype control tube, with non-specific-binding antibodies (i.e., mouse antibodies which are not supposed to bind to human cells) to measure the background noise of the assays.

 

The data

From the discussion of the previous section it follows that there are a total of 2872 flow cytometry assays (8 aliquots for 359 subjects, of which 316 are healthy and 43 AML positive). 

The raw data for all the subjects and all the tubes is compressed in the file

  • FCS.zip

which, when uncompressed, contains 2872 files, named

0001.FCS
0002.FCS
2871.FCS
2872.FCS

Each of these files is formatted according to the Flow Cytometry Standard (FCS) [4] (see also this link).

We also provide preprocessed data (transformed/compensated) in the compressed file

  • CSV.zip

which, when uncompressed, contains 2872 files, named

0001.CSV
0002.CSV
2871.CSV
2872.CSV

Each of these files is formatted as comma separated values (CSV), and contains a header row that identifies each of the 7 columns. Each subsequent row is an event (a cell) detected by the flow cytometer. The following example shows the typical contents of a CSV file

"FS Lin","SS Log","FL1 Log","FL2 Log","FL3 Log","FL4 Log","FL5 Log"
273,0.545,0.219,0.210,0.181,0.163,0.144
......
793,0.649,0.457,0.377,0.344,0.1889,0.149

The number of events ranges from as few as 6,764 events to as many as 49,370 events. The meaning of the header row column is as follows:

FS Lin: Forward Scatter in linear scale;
SS Log: Side Scatter in logarithm scale;
FLi Log: Fluorescence intensity in Channel i in logarithmic scale;

SS is a rough indicator of cellular granularity, membrane complexity, number of organelles, etc. and FS is roughly proportional to cell size.  For more information on how CSV files are processed from FCS files see [5]. While participants can use the processed CSV files to do their analyses, there may be merit in trying different preprocessing algorithms directly from the raw data (FCS files) to solve the challenge.

The naming of the files (0001.ext, 0002.ext, …, 2872.ext) can be used to identify the corresponding subject number and tube number following the following simple rules:

Subject_Number = Integer_Part[ (File_Number -1)/ 8 ] + 1
Tube_Number = File_Number – 8 * (Subject_Number-1)

For example, the file numbers corresponding to subject 17 are 129, 130, …, 136. Likewise, file number 1539 corresponds to patient 193. The association between file number, tube number and subject number is given explicitly in the file

  • DREAM6_AML_FilesTubesAndSubjects.CSV

As a training set, we give the state of health (AML or Normal) of subjects 1 through 179. These subjects comprise 23 of the 43 AML-affected subjects and 156 of the 316 healthy donors. This information is provided in the file:

  • DREAM6_AML_TrainingSet.CSV

Submission of Predictions and Write-up

Participants are required to submit a list of subjects, from subject 180 through subject 359, ranked according to the confidence you assign to the subject to be affected with AML. The order should go from your most reliable prediction that a subject is AML positive (first row) to your most reliable prediction that a subject is normal (last row). Use a 2 tab-separated column format as in:

S \tab XYZ

where S is one of the subjects (180 through 359). XYZ is a score between 1 and 0 that indicates the confidence level you assign to the prediction. E.g., XYZ = 1 if subject S is deemed to be AML positive with highest confidence and XYZ = 0 if S is deemed to a normal donor with the most confidence. If there are omitted subjects, we will consider them as appearing randomly ordered at the end of the list. Save your submission as a text file, and name it:

  • DREAM6 _AML_ Predictions_TeamName.txt

where "TeamName" is the name of the team with which you registered for the challenge. Best performance will be assessed based on the accuracy of the results of this prediction.

Finally, we request that each participating team submits a short write-up with the necessary documentation to run their algorithm, and explaining the methods used to arrive at their predictions. This write-up can contain pseudo-code describing the used algorithm, workflows for analyzing the data, etc. Name the write-up as

  • DREAM6_AML_ Writeup_TeamName.ext

replacing "TeamName" with the name of your team and the file extension (ext) with your choice of doc, or docx. The submission of the write-up is mandatory for participation in the main challenge.

 

Scoring Metrics

Results will be scored using the area under the precision versus recall (PR) curve. Precision is defined as the fraction of correct positive set predictions, and recall is the proportion of correct positive set predictions out of all patients in the positive set. Other metrics such as the area under the receiver operating characteristic (ROC) curve will also be evaluated. Teams will be ranked according to their overall performance based on the area under the PR and ROC curves. We will evaluate these predictions as discussed in [6].

 

Credits

The challenge was conceived by the FlowCAP committee, and curated by Ryan Brinkman and Gustavo Stolovitzky along with Nima Aghaeepour, Julio Saez-Rodriguez and Pablo Meyer.

References

[1] Querec TD et al (2009). “Systems biology approach predicts immunogenicity of the yellow fever vaccine in humans”, Nature Immunology, 10, 116 – 125.

[2] Bendall SC et al (2011). “Single-Cell Mass Cytometry of Differential Immune and

Drug Responses Across a Human Hematopoietic Continuum”, Science, 332, 687-96

[3] Borowitz MJ, Guenther KL, Shults KE, Stelzer GT (1993). “Immunophenotyping of acute leukemia by flow cytometric analysis. Use of CD45 and right-angle light scatter to gate on leukemic blasts in three-color analysis", (1993), Am. J. Clin. Pathol., 100, 534-540.

[4] Spidlen J, Shooshtari P, Kollmann TR, Brinkman RR (2011). “Flow cytometry data standards”, BMC Res Notes, 4:50.

[5] Parks DR, Roeder M, Moore WA (2006).“A new ‘Logicle’ display method avoids deceptive effects of logarithmic scaling for low signals and compensated data”, ,69(6):541-51.

[6] Stolovitzky G, Prill RJ, Califano A (2009). “Lessons from the DREAM2 Challenges”, in Stolovitzky G, Kahlem P, Califano A, Eds, Annals of the New York Academy of Sciences, 1158:159-95.

 

Download

The link below provides theDREAM6_AML_FilesTubesAndSubjects.CSVandDREAM6_AML_TrainingSet.CSV files, and a file indicating the website  from whereCSV.zipandFCS.zipfiles can be ftp-ed. Please mind that these two files,  are of size ~10 GB each and you need to allot sufficient storage and time to download the files. 

Download file (for login user):