Challenge 4: Pearson correlation
We are wondering, why you chose Pearson correlation as an extra measure for challenge 4.
We understand that you want to differentiate between the perfect classification results of the top 8 performers.
But from a participant’s perspective, including an additional, previously unmentioned, measure and leaving out two of the documented ones (area under PR and ROC curve) is irritating.
Additionally, Pearson correlation is not a good measure from our point of view, because
- it does not measure class separation
- it does not account for classification confidence (meaning proximity to 0/1 of predictions)
Let us give a short, illustrative example:
Assume the correct class labels would be
c=(1,1,1,0,0,0,0)
We have one group (g1) who submitted the following prediction
g_1 = (0.511,0.512,0.511,0.491,0.490,0.489,0.490)
and another group (g2) who submitted the following prediction
g_2 = (0.9,1,0.8,0,0.1,0.3,0.2).
In this example, both prediction result in a perfect classification result. However, we would clearly favor the prediction of g2 over that of g1, because the separation of the two classes is better and g2 also seem to be more confident in their prediction.
Instead, Pearson correlation would be 0.966 for g2 and 0.998 for g1 and, hence, result in the opposite.
Spontaneously, we could think of at least three measures that are suited better for ranking prefect classification results
1. the sum of the absolute deviations from the class labels: \sum \abs( g_i - c )
2. the difference of the minimum value for the positives and the maximum value for the negatives, measuring class separation: \min_{j : c[j] == 1} ( g_i [ j ] ) - \max_{j : c[ j ] == 0} (g_i [ j ] )
3. the probability of a correct decision for all patients, i.e., the product of the predicted probabilities of the positives multiplied by the product of 1 minus the predicted probabilities: \prod_{ j : c[ j ] == 1 } ( g_i [ j ] ) \times \prod_{ j : c[ j ] == 0 } (1 - g_i [ j ] )
Although we understand, if you do not want to modify the ranking after publication, we would be happy if you consider including (some of) these measures into the table to allow others to choose their favorite measure for ranking the perfect classification results.
Comments
Juicy Couture Watch
Juicy Couture Purses Crest Daydreamer red-colored Juicy Couture Handbags with velour is juicy couture laptop bag awesome gentle to Juicy Couture Sale produce Juicy Couture Bag utilization of Juicy Couture Outlet in every one Juicy Couture Backpacks day time life Jordan Retro 3.All Jordans 11 Cheap need within their Air Jordan wardrobe to Jordan 11 Shoes satisfy with Air Jordan 11 their garments.Monster Beats Excellent embroidery is Beats By Dr Dre so stunning Monster Beats Pro and charming Monster Beats Sale.Air Jordan 3 warm profit on Air Jordan Retro collection store Retro Jordan 3.
Disagreements?
Dear Gustavo et al.
Thanks a lot for keeping the discussion alive and for evaluating the alternatie measures! Since you explicitly welcome disagreements, here's mine :-) :
In my humble opinion, the question is not what measure one should use to compare the numerical scores of different methods. The question is: does it make sense at all? In general, it might indeed. But in the challenge, it was clear that the numerical scores should not play a role (the AUC of ROC and PR only depend on the order of patients!).
As I wrote before, in our method the score does somehow reflect certainty, i.e. distance from a decision boundary. But by no means can the scores be interpreted directly. Any monotonic transformation changes the scores drastically, but leaves the classifier unaltered. In the same sense, a score of 0.5 can mean something completely different in two different methods. A direct comparison is, essentially, numerology, unless it is clear from the start that such a comparison will be performed. In that case, all teams would have had the chance to come up with proper probabilistic assignments.
I still think that all 8 teams that submitted a perfect classifier should be listed as rank 1 without further "sub-ranking".
I will be happy to further discuss this in Barcelona. Who knows, maybe I can be convinced to change my mind (but I doubt it :-) )
Best regards and thanks again!
Pearson correlation
First of all let me thank you for participating in this challenge. This year we had 17 teams participating in this challenge. Everyone seems to have achieved excellent scores, and we hope to discuss the reason for this in the upcoming DREAM conference.
If we take the results at face value, there were 8 best performers who achieved a perfect AUC of 1 both in precision-recall and ROC. Because we needed to invite only two participants to present at the conference, we decided to create an ad-hoc score given by the Pearson correlation between the confidence levels you provided in your submission and the "perfect" confidence level (1 for AML patients and 0 for healthy donors). Some of you complained that this measure hadn't been announced earlier and that it was somehow arbitrary. It seemed to us that if participants added their confidence measure in good faith, then the Pearson correlation measure is meaningful. I hope nonetheless that we all you learned something from the exercise, and congratulations to the 8 the best performers.
We thank the organizers for
We thank the organizers for taking part in this discussion. However, we do not agree that Pearson correlation is meaningful, even if the participants submitted "their confidence measure in good faith". As shown in our first posting, Pearson correlation does not reward for confidence or proximity to the "perfect" 0/1 levels. And as Admire-LVQ pointed out, Pearson correlation could easily be tweaked without changing classification results. May we kindly ask the organizers to comment on these two points?
Pearson V
I'd like to thank JKJG for this remark. A clarification: according to the rules, area under curves of ROC and PR were supposed to be used in the ranking. These do not depend on the absolute "certainties", but only on the order of patients according to the scores. That is why we did not put any effort in transforming, optimizing, or even interpreting them. Our scores do measure certainty somehow, but they cannot be interpreted as a probabilistic assignment directly.
In fact, we stopped training when a stable order of patients was found. Further training would have moved scores closer to 0/1, thus increasing the Pearson correlation with the crisp labels.
PS: this was meant as a short reply to "Pearson correlation IV", my apologies for misplacing.
Pearson correlation II
We would like to thank the organizers for providing the interesting AML data set, because the analysis task certainly helped in creating trust and advancing the methods applied.
At the same time, a transparent choice of ranking criterion from the very beginning is encouraged in upcoming challenges, because this is crucial for the communication about the generally very stimulating DREAM activities. When we are going to write a research report on our DREAM participation we need to justify why we received rank X-4, knowing that we found exactly the same AML candidates as team X. So the current decision taken by the DREAM organizers for using the arguable Pearson correlation criterion in post-hoc ranking is passed on to the next round of researchers who might start wondering about the validity of DREAM challenges in general. In addition, those 7 perfect, yet downrated, teams might face slight frustration while even the best team might not feel complete satisfaction because of the Pearson correlation compromise.
There is a biologically more important aspect about the classification task. Why are the received scorings different at all? Either because the applied method is imperfect or because regression would be more natural for handling the data rather than classification. Why should organisms constituted by complex chemical regulations, different developmental stages and variable environmental conditions algorithmically map to crystal clear states of AML and normal? What about similarity between different subtypes of AML or close relationships to other kinds of cancer diseases possibly reflected in loosely specific marker bindings? Intuitively, this variability should be expected, which would hence reverse the argument of binary assessment: Methods with clearly separated scores are excellent classifiers, but they might not properly respect the biology which is driving the data.
After all, a pronounced scoring ordering offers nice insights into potentially interesting patients even apart from their AML and normal cell tissue. DREAM has now collected nice scoring lists of 8 perfect classifiers from different domains of machine learning, and it would be absolutely fascinating to have a closer look at the identified boundary candidates of AML and normal patients. While providing more information about their health status, the consistency of orderings about the threshold might be used for the definition of upcoming evaluation criteria.
Feeling that the organizers have generally done a great work in setting up the challenges, we are looking forward to the next round of DREAM-related events!
Pearson Correlation III
Let us first join "biolobe" and "JKJG" in thanking the organizers for the great job in setting up this highly interesting challenge - we have learnt a lot from it!
In my humble opinion, however, the additional ranking of the rank-1 results is, both, unnecessary and unjustified for several reasons. The obvious one is that rules should not be changed in retrospect, of course. From the announcements it was absolutely clear that the actual scores would not play a role in the challenge.
Furthermore, scores as produced by our method cannot be interpreted as "certainties" directly. I assume this applies to other contributions as well. An arbitrary nonlinear (monotonic) transformation of our scores would not have changed the ordering of patients and the quality of the classifier at all, but would have affected the "Pearson ranking", clearly.
Obviously, a score of, say, "0.5" in method X could mean something completely from the same value in method Y. Only if scores are explicitly meant to be probabilistic assignments, it is fair to compare numerical values betweeen different solutions. But even then, there is no single best performance measure as outlined by "JKJG" in the first comment of this thread.
In my opinion, the top 8 teams submitted solutions of equal quality. A perfect classifier (with AUC=1 in PR and ROC, for instance) is a perfect classifier, after all. A clarifying remark in the team ranking would be only fair.
Once more, thanks for a great challenge! We are already looking forward to the DREAM7 challenges in 2012!
Pearson correlation IV
First of all, we would like to thank biolobe and Admire-LVQ for joining the discussion on Pearson correlation as an additional performance measure. And we are certainly grateful to the DREAM organizers for putting so much effort in creating new and interesting challenges every year.
There is one point where we have a slightly different opinion than Admire-LVQ: It was stated in the original description of the challenge that the submitted scores should indicate confidence in the prediction. So a judgement of submission by terms of certainty seems fair to us.
However, Pearson correlation remains a sub-optimal measure for this purpose, and the remark of Admire-LVQ that a simple monotonic transformation of scores would have increased correlation holds for our method as well. And we agree that a perfect classifier is a perfect classifier is a perfect classifier.
It would be great if the organizers could join the discussion and explain the considerations that lead to choosing Pearson correlation. Additionally, we would like ask if the inclusion of additional scores into the table is something the organizers would take into consideration.
Unfortunately, we can not participate in the DREAM conference this year, but we wish the organizers and all participants interesting and fruitful discussions!
Dear teamsWe appreciate the
Thank you
Dear Gustavo, thanks a lot for computing the additional measures. The differences in ranking between the different measures are indeed much smaller than we had expected - which is a good thing. We are happy to see for ourselves that the ranking has not been affected much by the - as you said, arbitrary - choice of Pearson correlation. Thanks also for pointing out the equivalence of measures 1 and 3 if predictions are "close to optimal". We are looking forward to DREAM7 next year!