- Bishop Hill blog - Cook’s consensus: standing on its last legs

Friday

Oct112013

Bishop Hill

Cook’s consensus: standing on its last legs

Oct 11, 2013

Climate: Sceptics

This is a guest post by Shub Niggurath.

A bird reserve hires a fresh enthusiast and puts him to do a census. The amateur knows there are 3 kinds of birds in the park. He accompanies an experienced watcher. The watcher counts 6 magpies, 4 ravens and 2 starlings. The new hire gets 6 magpies, 3 ravens and 3 starlings. Great job, right?

No, and here’s how. The new person was not good at identification. He mistook every bird for everything else. He got his total the same as the expert but by chance.

If one looks just at aggregates, one can be fooled into thinking the agreement between birders to be an impressive 92%. In truth, the match is abysmal: 25%. Interestingly this won’t come out unless the raw data is examined.

Suppose, that instead of three kinds of birds there were seven, and that there are a thousand of them instead of twelve. This is the exact situation with the Cook consensus paper.

The Cook paper attempts validation by comparing own ratings with ratings from papers’ authors (see table 4 in paper). In characteristic fashion Cook’s group report only that authors found the same 97% as they did. Except this agreement is solely of the totals – an entirely meaningless figure

Turn back to the bird example. The new person is sufficiently wrong (in 9 of 12 instances) that one cannot be sure even the matches with the expert (3 of 12) aren’t by chance. You can get all birds wrong and yet match 100% with the expert. The per-observation concordance rate is what determines validity.

The implication of such error, i.e. of inter-observer agreement and reliability, can be calculated. In the Cook group data, kappa is 0.08 (p <<< 0.05). The Cook rating method is essentially completely unreliable. The paper authors’ ratings matched Cook’s for only 38% of abstracts. A kappa score of 0.8 is considered ‘excellent’; score less than 0.2 indicates worthless output.

With sustained questions about his paper, Cook has increasingly fallen back on their findings being validated by author ratings (see here, for example). Richard Tol’s second submission to Environmental Research Letters has reviewers adopt the same line:

This paper does not mention or discuss the author self-ratings presented in the Cook et al paper whatsoever. These self-ratings, in fact, are among the strongest set of data presented in the paper and almost exactly mirror the reported ratings from the Cook author team.

The Cook authors in fact present self-rating by paper authors and arrive at 97.2% consensus by author self-ratings.

In reality, the author ratings are the weakest link: they invalidate the conclusions of the paper. It is evident the reviewers have not looked at the data themselves: they would have seen through the trickery employed.

[1] Sim J, Wright C. The Kappa Statistic in Reliability Studies: Use, Interpretation, and Sample Size Requirements. Physical Therapy March 2005; 85(3): 257-268. •• A well-cited review that provides good general description.

137 comments

View Printer Friendly Version

Reader Comments (137)

shub, I'm not sure what you intended to say with your last comment. You didn't respond to much of what I've posted, and what you did respond with is almost in line with what I've been saying all along. Just make two changes. First, instead of (or in addition to) saying the measurement error causes the "values [to] swing across the cutoffs," say the actual values do so but the measurement error causes some of those swings to not be picked up. Second, replace the words "'high', 'medium' and 'low'" with "'endorse, 'neutral' and 'reject.'" Do that, and you're saying the same thing I said in my earlier example.

That is, two data sets with perfect accuracy can have low kappa scores if the noise in one data set is not the same as the noise in the other. That makes the test in this blog post uninformative at best. That makes the conclusions you drew from the test in this blog post unfounded.

Oct 16, 2013 at 4:35 PM |

Brandon Shollenberger

Brandon
This has been a waste of time. The Cook scores are nominal/categorical data. You've been thinking about the problem in terms of continuous variables. You need to pause and think about the issue.

Oct 16, 2013 at 5:55 PM |

shub

Brandon:

Yet I specifically said kappa would show disagreement regardless of the data's accuracy. Why would you assume I don't understand what kappa shows when I described exactly what you describe?

Because you wrote this:

A paper could explicitly endorse a position within its body while only implicitly endorsing that position within its abstract. In that case, an accurate self-rating would necessarily "disagree" with an accurate SkS rating. That's two examples of ways in which the kappa score would indicate disagreement whether or not the ratings were accurate.

It's the phase "whether or not the ratings were accurate" that suggests confusion.

Whether or not the rankings are accurate isn't the issue, as it's not what kappa is testing. If the two data sets are both accurate, but really don't agree with each other, you appropriately won't get high values of kappa.

The chances of kappa to "fail to reject" is low, which is a redeeming feature of this metric. On the other hand, the test that is used in Cook's paper could still fail to reject, because it's a statistically weak test.

No. If there is an a priori reason to know the data has differences that will affect the kappa score, such as differing levels of noise, shub's test is meaningless

It would be a meaningless test only if there were no information regardless of the value of kappa. Since high value of kappa would still validate the authors conclusions, even in this case, this claim is still wrong. All these errors do is reduce the chance of agreement, which makes "failure to confirm" even less informative.

I agree with your criticism of shub's interpretation of kappa (if this weren't obvious already). He cannot or should not use this statistic to invalidate the conclusions of Cook's paper. On the other hand, had kappa been high, this would have been an important validation of the conclusions.

Since you seem to want to /poke him in the eye, I'd ask why he included the p value for kappa. That actually is meaningless as used here.

You say:

I decided to do a quick demonstration of why I say the test used in this post was meaningless

And I say 'this is an example of motivated reasoning".

The way I put it to students is, "statistics is meant to be informative and useful". It is always possible to apply statistical methods in a way that uninformative or not useful, but that's why it's as much an art as it is a science. You would be better off thinking of examples where kappa is a useful metric, then try and see whether this scenario is inconsistent with that. I think the proper way to look at it is just to say "meh... it fails to confirm, we've not learned anything from this test."

You bring up thermometers: Regarding the use of kappa in objective measurements --- it is done. In this case you can model P(e), so the issues with kappa being "overly conservative" don't apply. Type II errors are much less frequent when you can more accurately model the noise (and the p values are even useful in this case).

Oct 16, 2013 at 6:17 PM |

Carrick

Brandon:

The same is true for the Cook et al data. A low kappa score for it doesn't invalidate anything about the paper. A low kappa score is expected because the rating sets measured different things. This is obvious as the authors even highlighted a difference which would lead to a low kappa score.

Another way of reading Cook's warnings is "this is really a terrible way to test the external validity of my data set, but we're going to use this really low power statistical test that has a high chance of failure to reject." My guess is this is the only test that was included in the paper, because this weak test was the only one he could find that failed to reject.

The failure to agree is a real issue for Cook's paper, and I'd be surprised if any more meaningful tests of agreement than Cook's craptastic test would do better.

Oct 16, 2013 at 6:28 PM |

Carrick

"He cannot or should not use this statistic to invalidate the conclusions of Cook's paper."

The poor correlation between Cook's ratings and authors' ratings, accepting that author ratings are the gold standard as per Cook, invalidate the paper. ("Nobody is more qualified to judge a paper’s intent than the actual scientists who authored the paper" - Cook's own description).You can argue about kappa and its use - and as I noted before, the AGW crowd is generally uncomfortable with this test.

I use kappa statistics in my field and I am familiar with it. If we went about talking about kappa in this manner, several key papers and classification systems would have never appeared.

Oct 16, 2013 at 6:32 PM |

shub

shub:

The poor correlation between Cook's ratings and authors' ratings, accepting that author ratings are the gold standard as per Cook, invalidate the paper

Correlation is the wrong choice of words here, isn't it? You haven't looked at correlation, or at least if you have, you've not reported on it here.

Any measure that reliably demonstrated a lack of agreement, would invalidate conclusions based on that comparison, not the paper.

Both ratings could be accurate, but measuring different things, in which case you should get low values of kappa. That doesn't invalidate the paper. If you could say that the authors' ratings were accurate and that the papers ratings were meant to measure the same thing, then failure to reliably confirm agreement should be seen as invalidation of the data.

Two problems with this, first you don't know how accurate the author's ratings are (it's not a gold standard regardless of anything Cook might say), and secondly the test you are using for agreement doesn't reliably test for lack of agreement.

If you are working in an error where you can separately assess the reliability of kappa and quality control the categorical data, so you can say "apples to apples" each time, I agree it is (more) useful over a wide range of kappas. AFAIK, low values of kappa usually signal the need to repeat a test, and aren't used diagnostically other than that.

Without a more definitive example from you, it's not possible to respond further.

Oct 16, 2013 at 7:16 PM |

Carrick

* If you are working in an area

Oct 16, 2013 at 7:19 PM |

Carrick

There seems to be a three way split in informed opinion on using kappa here:

Shub says:

The poor correlation between Cook's ratings and authors' ratings, accepting that author ratings are the gold standard as per Cook, invalidate the paper.

Carrick says:

The way I would put it is, if you get strong agreement, the kappa statistic is informative. Low values of kappa can occur even when there is agreement between the two data sets, so low values of kappa are non-informative.

Paraphrase Brandon:

Don't even bother with kappa.

Hey but Shub, this is where the the point where the debate has to take a breather surely? "..accepting that author ratings are the gold standard as per Cook

Surely it is clear that the "Gold standard" for Cook and his reviewers just means showing that the 97% number flops out the other end of both abstract and author ratings? End of.

Their criteria is pathetic and the referees have no standards either. That is all.

I think maybe we could agree that Cook et al delude themselves they have a "Gold Standard". But surely the fact they, (and the referees as Carrick reminds us), really only have a “turd standard” i.e. they all think that we only need to show that 97% of those showing a preference is roughly similar in both groups. This is the most interesting thing psychologically about the paper and its "peer review".

You guys have a scientific bone to chew but to my mind I think you miss the clear psychological and spin distortions.

I don't see how you all can shame them to be more interested in their "results".

Spin merchants never are.

BTW seems to me now that if the kappa between abstracts and raters was >0.8 it would be spooky. Not informative ;)

Oct 16, 2013 at 7:27 PM |

The Leopard In The Basement

"correlation" is not the word, I agree. "Lack of agreement" would be better.

My statement that about invalidating the paper is based on the key (?unstated) assumption of Cook's group that abstracts can be analysed to assess this type of information (level of AGW consensus). In order for this to be true, the information has to be present in the abstract and it has to be the same as in the paper.

It is Cook's contention that author ratings are the ultimate, not mine. I agree with Tol. My own opinion is that author ratings are worthless data and should be thrown away.

I agree that there is no true gold standard. But that doesn't negate the invalidation claim, because, the degree of agreement is so poor that it brings the classification system into question.

Brandon failed to grasp the pattern in author response discrepancy. About 40% of ratings for the Cook rated '2' ('explicit endorse') (n=218) are given ratings that carry lesser, or even worse, opposite information (i.e., >2). If papers can explicitly 'endorse the consensus' in the abstract and become sceptic papers when reading the whole, the rating system is broken.

Oct 16, 2013 at 7:43 PM |

shub

shub:

In order for this to be true, the information has to be present in the abstract and it has to be the same as in the paper.

While this is true, you still haven't addressed the question of the accuracy of the authors' rankings of papers. It could be that Cook's rankings are more accurate than authors' ranking based on memory for example. It could also be the case that the agreement between authors' and Cook's rankings is stronger than one would infer from the kappa test, to the inherently overly conservative nature of the kappa test.

You also haven't addressed the generally acknowledged issues with the overly conservative nature of the kappa statistic. I think a better approach would be to use multiple measures here. Maybe visit this site for some ideas.

A better test of the adequacy of the method of abstract ranking, IMO, would have been to use the same rankers for both full papers and abstracts, even if only a small portion of the full data set were ranked using full papers.

Oct 16, 2013 at 8:50 PM |

Carrick

shub, please don't put words in my mouth or thoughts in my mind:

Brandon
This has been a waste of time. The Cook scores are nominal/categorical data. You've been thinking about the problem in terms of continuous variables. You need to pause and think about the issue.

I have not been "thinking about this problem in terms of continuous variables." The closest I've come to even discussing such variables is when I used an example that had ~23 possible values (as everything was rounded to integers). There was no relevant difference between that and categorical data.

If you believe I've thought or said something, quote what I've said which indicates such. Otherwise, it appears you're just going off faulty telepathy.

Oct 16, 2013 at 9:26 PM |

Brandon Shollenberger

Seriously shub, you really ought to stop this:

Brandon failed to grasp the pattern in author response discrepancy. About 40% of ratings for the Cook rated '2' ('explicit endorse') (n=218) are given ratings that carry lesser, or even worse, opposite information (i.e., >2). If papers can explicitly 'endorse the consensus' in the abstract and become sceptic papers when reading the whole, the rating system is broken.

I have not "failed to grasp" anything of the sort. I said I'm not interested in discussing the issue. That doesn't mean I'm unaware of it. That doesn't mean I disagree about it. All it means is I'm not interested in discussing it because it's a red herring.

It's tedious to have someone insist I'm wrong based upon me supposedly thinking or saying things I've never thought or said.

Oct 16, 2013 at 9:38 PM |

Brandon Shollenberger

The Leopard In The Basement, your paraphrase of my remarks misses a key caveat. I actually agree with Carrick. The difference between Carrick and I is I acknowledge the fact we knew in advance there would not be a high kappa score. Cook et al specifically pointed out differences in the data sets that would ensure such.

I don't think it's a bad test. I don't think the test is, in and of itself, meaningless. I only think the test is meaningless when we know what the results will be in advance. We don't need a test to tell us the data sets disagree when the authors said so in their paper.

Carrick:

It's the phase "whether or not the ratings were accurate" that suggests confusion.

How do you figure that phrase indicate I think kappa scores measure accuracy? I gave examples of how the results of a kappa test were independent of the data's accuracy. I cannot see how discussing the independence of kappa values and accuracy could imply I think kappa values measure accuracy.

Whether or not the rankings are accurate isn't the issue, as it's not what kappa is testing.

Which was the point I was making when I said the kappa score would be low "whether or not the ratings were accurate." I was pointing out accuracy was not tied to the kappa score, the same thing you're saying here.

It would be a meaningless test only if there were no information regardless of the value of kappa. Since high value of kappa would still validate the authors conclusions, even in this case, this claim is still wrong. All these errors do is reduce the chance of agreement, which makes "failure to confirm" even less informative.

The authors stated a level of disagreement between their data sets that would ensure a low kappa score. One can show, mathematically, the information they provided made it impossible for a high kappa score to be found. If a "high value of kappa" could never be found, your hypothetical where one validates the authors' conclusions is impossible.

Since you seem to want to /poke him in the eye, I'd ask why he included the p value for kappa. That actually is meaningless as used here.

For what it's worth, I actually don't want to /poke him in the eye. I'm trying to just get this one point resolved. There are dozens of remarks I've wanted to make but decided not to because I'm trying to stay focused.

And I say 'this is an example of motivated reasoning".

Why? The test I performed was based on the data sets shub used. I was showing, via simplified mathematics, the level of disagreement in the two data sets was high enough to ensure the kappa score he found could never validate the authors' results.

Again, the data sets used by Cook et al were known in advance to have so much disagreement a high kappa score could never be found. The authors published this information. A low kappa score was a foregone conclusion. It was impossible for this test to validate the authors' results.

I've made this point multiple times. If you think it's untrue, I'd be open to hearing an explanation. However, you can't simply ignore it.

Oct 16, 2013 at 10:25 PM |

Brandon Shollenberger

Brandon, sounds like we agree on what kappa is testing and what it isn't so let's move on, if you don't mind.

Regarding this:

The authors stated a level of disagreement between their data sets that would ensure a low kappa score. One can show, mathematically, the information they provided made it impossible for a high kappa score to be found

If it is impossible to demonstrate agreement, even in principle, even with otherwise ideal, errorless measurements, that would mean they are not statistically inter-comparable results. That's a big deal.

I think it is true that Cook thinks they are statistically inter-comparable, otherwise he wouldn't have made the comparison associated with his Table 4. (At least he *shouldn't have*.) But this is a case of not having your cake and eating it too--if Cook, the editors, and the reviewers can assume statistically inter comparability, then other people must be allowed to do the same, at least to test whether the conclusions of the paper really hold up based on the assumptions necessary to make to allow inter comparison.

So if we start with the assumption that the data sets are inter-comparable, we should be able to use the kappa statistic or other tests to analyze the level of agreement of the two data sets. When we do so, we arrive at the conclusion that kappa is too small to confirm agreement, I think independent to your considerations that the small value of kappa is non-informative. That's a much weaker conclusion that shub is trying to make of course.

For what it's worth, I actually don't want to /poke him in the eye

Not sure I entirely believe you on that. :-P But that's okay, I can live with that.

The only points I wanted to resolve was the methodological issues associated with using kappa to test agreement in this paper, and whether there are better, more robust ways of testing for non-agreement that people could all accept as valid.

What you are suggesting indicates that the paper is so fundamentally flawed, that this is not even possible. That seems like a rather pessimistic view, but it may well be right.

Why? The test I performed was based on the data sets shub used

Because you are assuming the conclusion, of course.

I can always concoct statistical tests that are uninformative (like including unknowns in P(e) in your bird comparison example), regardless of the data sets involved. That doesn't imply the converse, namely that with better "art", I can't concoct tests that can be used. That's why I suggested considering the converse problem: How would *you* go about making a "better" test for agreement (making whatever minimal assumptions are needed to allow the comparison to be made)?

If you don't think agreement is testable, the paper is smoked as far as that is concerned.

Oct 17, 2013 at 12:12 AM |

Carrick

Brandon, while I'm at it, can you quantity this a bit:

The authors stated a level of disagreement between their data sets that would ensure a low kappa score. One can show, mathematically, the information they provided made it impossible for a high kappa score to be found

* Quantitatively, what is the maximum value, according to you, that is possible?
* Precisely which piece or pieces of information, in your opinion, is it that makes it "impossible for a high kappa score to be found"?

Oct 17, 2013 at 12:33 AM |

Carrick

If Cook et al were mistaken, where are the thousands of papers publishing evidence against AGW, produced by all those thousands of climate sceptic scientists?

Oct 17, 2013 at 12:34 AM |

entropic man

carrick
You might have understood Brandon's point but I don't think he gets yours. He is trying to have his cake and eat it too - he wants to oppose kappa by saying the two data sets are not directly inter-comparable, and that Cook has managed to find a way to compare them un-comparable data sets.

Look at his unknowns example. Say there are two ratings (say, yes/no) and the second observer gives ratings in half the cases but gives an 'unknown' to the other half. You put an 'unknown' column against the first observer, enter a value of zero and perform the kappa. Of course, the second rater will fare poorly and the kappa will be low. The low kappa is the right conclusion as the system (raters + rating scheme) did not provide good inter-rater reproducibility. You can't say: "well, I was right whereever I did take a swing". If the rating system is good, it would allow for different observers to successfully classify.

On the other hand, what is Brandon's response:

Suppose the new person was not totally inept. Suppose instead of misidentifying every bird, they simply failed to identify half of them. For that half, they wrote "Unknown." We'd get a kappa score indicating a lot of disagreement in this case even though the new person was right on every bird they took a guess at.

For the unweighted kappa, 'mis-identification' vs non-identification' doesn't matter. They are both wrong. "If you leave out all my wrong answers and the questions I didn't know answers for, I scored really high" - doesn't get you a pass grade in school. Look at his 3x3 contigency table - the point is evident there as well.

Oct 17, 2013 at 1:16 AM |

shub

I can propose a test to examine Brandon's hypothesis. The Brandon hypothesis is that abstracts and paper ratings can be different as the information contained in the two is different, and that this is due to greater information being contained in the paper than the abstract. How does one test this?

Give one group of observers just abstracts. Let them give it each a rating +3 to -3. Give another group the papers corresponding to abstracts. Let them do the same rating. The abstract rating precedes the paper rating.

If ratings for abstracts with high information remain in the same, and ratings for abstracts with low information get assigned high information ratings, Brandon's hypothesis would be proven. If any other significantly different patterns are observed, the hypothesis would not hold.

The null would be: No significant change in rating proportion toward those with less information will be observed when rating papers as compared to abstracts.

Oct 17, 2013 at 1:46 AM |

shub

Carrick:

Brandon, sounds like we agree on what kappa is testing and what it isn't so let's move on, if you don't mind.

If when you say this, you mean to include the idea of moving on from the fact you claimed I didn't understand what kappa measures, even offering a quote which demonstrated I shared the same understanding as you... I do mind. I find it is usually pointless to have conversations where one person gets things totally wrong but refuses to acknowledge it. I think it's reasonable for me to expect a simple, "Oh, I misread what you wrote" or the like.

Quite frankly, I've had you, shub and Richard Tol all claim I've said or thought things I've never said or thought. Not a one of you has ever acknowledged having been wrong about any of it. It makes me inclined to walk away and write you guys off, not continue trying to have a discussion.

That's especially true given this behavior has been used to insult me here.

So if we start with the assumption that the data sets are inter-comparable, we should be able to use the kappa statistic or other tests to analyze the level of agreement of the two data sets.

This is a meaningless claim as nobody has disputed it. All anyone (namely me) disputes is whether or not the kappa test performed in this blog post could possibly be informative. I say it can't. Claiming two series are inter-comparable in no way implies kappa scores should be calculated. In this case, there is no reason to.

What you are suggesting indicates that the paper is so fundamentally flawed, that this is not even possible. That seems like a rather pessimistic view, but it may well be right.

This is another case of you claiming I say (or suggest) things that have no basis in what I've actually posted. I've never suggested the data sets aren't inter-comparable. You're massively exaggerating what I have said - that the kappa test is a bad test in this situation. Saying one form of comparison between two data sets is inappropriate in no way suggests I think the data sets are incomparable.

Because you are assuming the conclusion, of course.

You've provided no basis for this claim. You say "of course," but in reality, you've done nothing to support what you say. You're merely repeating an assertion over and over. If you want to make this assertion, show one of the assumptions/simplifications I used led to my results. If you can't, your assertion is baseless.

Brandon, while I'm at it, can you quantity this a bit:

No. There are already enough outstanding issues that ought to be resolved. Until we can agree to what simple sentences mean when I write them, I don't care to be forced into performing more calculations.

At the very least, I'd like everybody to be able to recognize I've understood what kappa measures all along, and I've never claimed the two data sets are not inter-comparable. If we can't agree to points that are that simple, there's no reason to continue posting.

Oct 17, 2013 at 5:21 AM |

Brandon Shollenberger

I want to point something out because I'm afraid it may have gotten lost in the rest of the conversation:

Way back in my first comment, I called into question whether or not time stamps were recorded for the Cook et al ratings. Richard Tol insisted they were (saying this was the Nth time he had told me). He later claimed John Cook told him this. I showed that was inconsistent with what he had said earlier. Five days later, he has still failed to address this inconsistency, and nobody else has questioned him on it.

This bugs me because it should be a simple matter to resolve. If John Cook told Richard Tol time stamps were recorded, when did it happen and why did he post indicating Cook had not told him this?

On a tangent, this is indicative of why I hate people deciding to randomly drop points of discussion. They always seem to do it when most convenient for them. It seems mostly to be a way of avoiding admitting mistakes.

Oct 17, 2013 at 5:37 AM |

Brandon Shollenberger

You want others to respond to questions you raise. Yet you won't answer to any of their points. You start your paragraphs with "nonsense" and other single word Realclimate style responses and holier-than-thou statements. Yet, you do not even understand that the kappa is not applicable directly to continuous data, a basic point about the test you are trying to criticize.

The kappa is a statistical test. You want to show it is wrong, propose a hypothesis and verify it. Everything else is talk. Don't come with cooked-up examples that circularly prove their own assumptions. Take the Cook data and show that on-to-one correspondence is not required and his results are hunky-dory.

Oct 17, 2013 at 9:56 AM |

shub

"At the very least, I'd like everybody to be able to recognize I've understood what kappa measures all along, ..."

No, you have not. I entirely agree with you that the two data sets are different. I differ in thinking the kappa test can be used and it is the inferential train that is different between how kappa is used in different settings. You, on the other hand, argued that two sets are different enough for kappa not to be used and Cook's sordid aggregate math is a good way of dealing with this problem. Secondly, as indicated by your tree ring and temperature example right at the beginning and further examples, you are thinking about this issue in terms of continuous variables when they are nominal variables.

The cookean logic is insane: for every strange finding, there is a different ad-hoc explanation. How does the 'endorsement look? 97% yay, high consensus. Why are there so many no position papers? That's because the science has already been settled. How does the author rating look? yay, they got the same answer as us. Why are you not showing the actual correspondence? No, the two cannot be compared.

Oct 17, 2013 at 10:02 AM |

shub

Brandon:

If when you say this, you mean to include the idea of moving on from the fact you claimed I didn't understand what kappa measures, even offering a quote which demonstrated I shared the same understanding as you... I do mind.

Okay then mind. We'll rehash:

A paper could explicitly endorse a position within its body while only implicitly endorsing that position within its abstract. In that case, an accurate self-rating would necessarily "disagree" with an accurate SkS rating. That's two examples of ways in which the kappa score would indicate disagreement whether or not the ratings were accurate.

Whether or not the ratings were accurate is beside the point, but you made it a point. This is a confused argument even if you weren't really confused, as kappa measures only level of agreement not accuracy. The accuracy of the measures plays no roll here, both could be equally inaccurate and you might get a kappa of 1.

Also, this: "A paper could explicitly endorse a position within its body while only implicitly endorsing that position within its abstract. In that case, an accurate self-rating would necessarily "disagree" with an accurate SkS rating" would yield a low value of kappa correctly. It would be consistent with the statement that the two ratings do not agree because they measure different things.

Your comment is equivalent to saying "if I have a coffee cup full of hot coffee and a glass full of icy water the thermometer readings won't agree with each other". Well, yeah. That's how it's supposed to work. The temperatures are different so the thermometers don't agree.

You also seem to think that because Cook acknowledges that one is a coffee cup full of hot coffee and the other a glass full of icy water, that somehow makes the lack of agreement of the thermometers a non-issue. Of course it doesn't.

Failure to measure the same quantity (even if both are separately accurate) means a low value of kappa, which is what is expected when the data sets don't agree with each other because they aren't measuring the same quantities.

This is a meaningless claim as nobody has disputed it

Meaningless use the wrong word choice, in fact it is a word that is simultaneously both belittling and wrong. Is that really the effect you're going for?

Being agreed to (or undisputed) doesn't make the claim meaningless, rather it's one universally agreed to and hopefully then not meaningless at all. Otherwise we're all a bunch of gits.

This is another case of you claiming I say (or suggest) things that have no basis in what I've actually posted. I've never suggested the data sets aren't inter-comparable. You're massively exaggerating what I have said - that the kappa test is a bad test in this situation

I'm afraid you've massively misunderstood what I said.

Of course, I never said you claimed that the sets weren't incomparable. What I said is what you claimed implies this, whether you realized it or not.

You claimed:

The authors stated a level of disagreement between their data sets that would ensure a low kappa score. One can show, mathematically, the information they provided made it impossible for a high kappa score to be found

The inability to get a high value of kappa, even in principle, between the two data sets would make them non-intercomparibility: It would mean that structurally they measure different things. Which means the methodology of the paper is so flawed as to disallow their comparison.

If you want to go further you need to state the maximum value of kappa that you claim can be shown "mathematically" doesn't bolster your argument. Saying "no" to this means an end of the conversation for me because I'm not going to rehash the same contended points over and over like a scratched record.

Oct 17, 2013 at 2:13 PM |

Carrick

Brandon:

This bugs me because it should be a simple matter to resolve. If John Cook told Richard Tol time stamps were recorded, when did it happen and why did he post indicating Cook had not told him this?

If you think Tol is lying, write Cook yourself and find out the answer. If you don't think he's lying, be a man and accept him at his word.

On a tangent, this is indicative of why I hate people deciding to randomly drop points of discussion. They always seem to do it when most convenient for them. It seems mostly to be a way of avoiding admitting mistakes.

Well I'm sorry you hate it, perhaps there are strategies you could employ that would reduce people's inclination to bale on a thread.

On the same tangent, there is only so many times you can discuss the same topic before it becomes stale. Some issues just have to be dropped because they aren't ever going to be resolved.

Oct 17, 2013 at 2:28 PM |

Carrick

typo: * inclination to bail on a thread

shub:

You might have understood Brandon's point but I don't think he gets yours. He is trying to have his cake and eat it too - he wants to oppose kappa by saying the two data sets are not directly inter-comparable, and that Cook has managed to find a way to compare them un-comparable data sets.

That's my take too. If you can't get a high value of kappa due to methodological issues, then the sets aren't inter-comparable and that renders Cook's comparison invalid.

Measuring kappa in this case is still both appropriate to do and diagnostic, and oddly Cook's admission of flaws in his paper that Brandon keeps bringing up as if it were an indulgence tends to bolster your interpretation of a low value of kappa as invalidating Cook's paper.

I also don't think, if you do it properly, you'll get a low value of kappa other than as a result of the data sets not agreeing. Whether they could never agree is an issue I hadn't thought of, but clearly that doesn't bolster Cook's paper. At all.

Oct 17, 2013 at 2:44 PM |

Carrick

Carrick:

Whether or not the ratings were accurate is beside the point, but you made it a point. This is a confused argument even if you weren't really confused, as kappa measures only level of agreement not accuracy. The accuracy of the measures plays no roll here, both could be equally inaccurate and you might get a kappa of 1.
...
Your comment is equivalent to saying "if I have a coffee cup full of hot coffee and a glass full of icy water the thermometer readings won't agree with each other". Well, yeah. That's how it's supposed to work. The temperatures are different so the thermometers don't agree.

You claim accuracy "plays no roll [sic] here," yet the argument in this blog post requires kappa measure accuracy, not disagreement, to be true. Showing kappa doesn't measure accuracy was essential to showing the conclusions of the blog post were wrong.

Your claim that I misunderstood what kappa measures, and your current claim that my argument was confused, hinges on ignoring the reason I discussed accuracy vs. agreement. The point of the argument you call confused was, "Kappa doesn't measure accuracy so a low kappa score doesn't invalidate the authors' results." It's no different than what you've said.

In other words, I said the same thing you said, but when I said it, it was wrong.

Meaningless use the wrong word choice, in fact it is a word that is simultaneously both belittling and wrong. Is that really the effect you're going for?
Being agreed to (or undisputed) doesn't make the claim meaningless, rather it's one universally agreed to and hopefully then not meaningless at all. Otherwise we're all a bunch of gits.

This is a silly semantic point. I said your claim was meaningless as nobody had disputed it. As in, you raised a point that wasn't in dispute. It had as much meaning as randomly saying, "The sky is blue today." The statement may be true, and in a technical sense it may have meaning, but it is meaningless in the context of the discussion being had.

I'm afraid you've massively misunderstood what I said.
Of course, I never said you claimed that the sets weren't incomparable. What I said is what you claimed implies this, whether you realized it or not.

You said what I've written suggests this. I responded by saying I never said or suggested such. Your response is to harp on the distinction between referring to what I said and referring to what I suggested, as though that distinction refutes my remarks. This ignores the fact I disputed both at the same time.

The inability to get a high value of kappa, even in principle, between the two data sets would make them non-intercomparibility: It would mean that structurally they measure different things. Which means the methodology of the paper is so flawed as to disallow their comparison.

I never said it was impossible for the data sets to have a high value in principle. I said, many times, that it was impossible given the knowledge we had about the data sets.

Well I'm sorry you hate it, perhaps there are strategies you could employ that would reduce people's inclination to bale on a thread.

Would one of these strategies be to not use words like "nonsense" and "meaningless," words you happily use but insult me for using? I ask because that's how you joined the discussion, and that suggests it's a strategy you support.

Personally, I doubt that'd work. I suspect the issue has little, if anything, to do with my tone. I think the issue is simply that people like to avoid discussing mistakes they make. That happens all the time. For example, Stephan Lewandowsky just broke off communication with me, basically saying, "Get lost" even though I used a perfectly polite tone the entire time. This is telling as he did this only after first attempting to refute the issues I raised. It was when he had no answer that he resorted to ending the discussion.

Oct 17, 2013 at 6:19 PM |

Brandon Shollenberger

shub:

Yet, you do not even understand that the kappa is not applicable directly to continuous data, a basic point about the test you are trying to criticize.

I understand what kappa can and cannot do just fine. I've never done anything to show otherwise. In each case I used kappa, the data was in discrete categories.

You start your paragraphs with "nonsense" and other single word Realclimate style responses and holier-than-thou statements.

If you want to complain about people using words like "nonsense" to start paragraphs, you shouldn't completely distort what they say. You shouldn't boldly claim people think or believe things they don't think or believe. You should insult them over and over by claiming they don't know what they're talking about. If you're content to behave the way you do, you have no room to complain when people behave like I do. To wit:

You want others to respond to questions you raise. Yet you won't answer to any of their points.

Bull. I've responded to many points people have raised. I refuse to address some points as I've explained they're irrelevant (a point you've never refuted), but there is an enormous difference between some points and all points.

You're grossly distorting what I've done to create an insulting portrayal of me, and there's no way one could possibly justify your depiction. That merits the description: Bull. Same for comments like:

You, on the other hand, argued that two sets are different enough for kappa not to be used and Cook's sordid aggregate math is a good way of dealing with this problem.

I have never argued Cook's methodology was good or appropriate. All I've said is the kappa test cannot possibly show his methodology was bad. You are flat-out making things up about me, and you're doing it while consistently refusing to quote my words as I've asked you to do.

And for the record:

Secondly, as indicated by your tree ring and temperature example right at the beginning and further examples, you are thinking about this issue in terms of continuous variables when they are nominal variables.

If one is talking about measured data sets, not the underlying physical phenomenon, the data is discrete. The limited precision of measurements ensures there is a finite sample space.

And not that it matters, but I used tree rings to highlight the nature of proxies in general, not to discuss an application of the kappa test.

Oct 17, 2013 at 6:22 PM |

Brandon Shollenberger

I'm going to say this in a separate comment so it is not lost or unclear:

If it is impossible to demonstrate agreement, even in principle, even with otherwise ideal, errorless measurements, that would mean they are not statistically inter-comparable results. That's a big deal.

You might have understood Brandon's point but I don't think he gets yours. He is trying to have his cake and eat it too - he wants to oppose kappa by saying the two data sets are not directly inter-comparable, and that Cook has managed to find a way to compare them un-comparable data sets.

That's my take too. If you can't get a high value of kappa due to methodological issues, then the sets aren't inter-comparable and that renders Cook's comparison invalid.

It was theoretically possible for the paper ratings and abstract ratings to match perfectly. It was a possibility we'd have no reason to expect, but it was possible, in theory. This theoretical possibility was only ruled out when the data was shown to rule it out. One example has been quoted several times:

Among self-rated papers not expressing a position on AGW in the abstract, 53.8% were self-rated as endorsing the consensus. Among respondents who authored a paper expressing a view on AGW, 96.4% endorsed the consensus.

Once 53.8% of the largest abstract rating category was found to disagree with the paper ratings, it was no longer possible for the abstract and paper ratings to match perfectly. It was no longer possible to get a kappa score of 1.

In fact, based on the numbers published in Cook et al, we'd expect that largest category to make up ~60% of the data. That means if we knew 53.8% of that category disagreed, we'd expect at least ~30% of the total data to show disagreement. 30% doesn't translate directly into a kappa score, but it should be obvious the kappa score would have to be well below 1. And that's based entirely upon what the authors openly stated and showed was in line with expectations.

An expected aspect of the data discussed by the authors would necessarily lead to a kappa score well below 1. That means doing a kappa test with a naive scale of 0 - 1 was inappropriate.

Oct 17, 2013 at 6:42 PM |

Brandon Shollenberger

Brandon:

You claim accuracy "plays no roll [sic] here," yet the argument in this blog post requires kappa measure accuracy, not disagreement, to be true.

It's not just a claim that kappa doesn't measure accuracy but rather level of agreement, it's a fact. You can have kappa = 1 with two series, both with the same bias, and both equally inaccurate.

Showing kappa doesn't measure accuracy was essential to showing the conclusions of the blog post were wrong

Sorry, but no. That's not the issue at all.

The premise of this blog post is that Cook claims that the authors' rankings can be used as a "gold standard" to externally validate the reviewers' rankings.

If you accept the premise, then kappa is a valid measure of internal validity of the paper. A high value of kappa, without any question, would have have been a validation of the paper, as it is a much stronger test than the one that was actually used.

The result of the test was a low value of kappa. The question is then one of interpretation of this value of kappa rather than your erroneous argument that the use of the kappa statistic was itself invalid.

To remind again, your statement "this post is nothing but a test applied in obviously wrong way" is itself invalid. If there are problems with the post (I agree there are), this is not it. Ironically in your criticism, you also demonstrated how to compute kappa "in the wrong way" in your own criticism.

While it's easy to assume from this you are arguing from the seat of your pants and have no clue what kappa is or what the point of shub's post is. I don't assume that of course, but it is what it looks like. Had I wrote a blog comment as inherently flawed as the one you did, especially with the needless rant at end, you would have torn me a new a$$hole and sent me packing. Deservedly.

Oct 18, 2013 at 3:39 PM |

Carrick

This is a silly semantic point. I said your claim was meaningless as nobody had disputed it.

Again the belittling word choices.... the claim wasn't meaningless, it was meaningful,nd to point out that a word choice is exactly the opposite of the one you made, I assume for purposes of belittlement, isn't exactly silly. Finally, you state agreement with the "meaningless" statement which make it "a stated point of agreement." Hardly "meaningless". The irony is rich.

You said what I've written suggests this. I responded by saying I never said or suggested such. Your response is to harp on the distinction between referring to what I said and referring to what I suggested, as though that distinction refutes my remarks. This ignores the fact I disputed both at the same time.

First of all I never said you intended to suggest that what you said suggests. I'm not a mind reader and I try not to impose motivations in what other people write:

The authors stated a level of disagreement between their data sets that would ensure a low kappa score. One can show, mathematically, the information they provided made it impossible for a high kappa score to be found

The statement in bold is both clearly written and in the corpus of statistic science has an exact meaning. Other than what "low" and "high" kappa values are of course. I assume you mean by "high values of kappa", values of kappa that validate the hypothesis of agreement between the two data sets, and by "low values of kappa", value of kappa that don't validate this hypothesis. [For the record, when I use "high" and "low" qualifiers for kappa, this is what I mean.]

Now, again, you have claimed that you can prove mathematically that it is impossible for a high kappa score. That is a very strong statement to make.

Noisy data isn't an example of where high values of kappa are mathematically impossible, because you can get high values of kappa by chance. And that can be proven mathematically too.

The only way you can mathematically limit kappa so that you don't even "high values of kappa" is if there is a structural flaw in Cook's paper that precludes the possibility of high values of kappa, even in principle.

Oct 18, 2013 at 4:14 PM |

Carrick

Keep a lid on it please gentlemen

Oct 18, 2013 at 4:16 PM |

Bishop Hill

Brandon:

Once 53.8% of the largest abstract rating category was found to disagree with the paper ratings, it was no longer possible for the abstract and paper ratings to match perfectly. It was no longer possible to get a kappa score of 1.

Technically, you don't need a kappa of 1 to indicate agreement. Anything over 0.4 would be viewed as tending to confirm the hypothesis of agreement of the data sets.

An expected aspect of the data discussed by the authors would necessarily lead to a kappa score well below 1.

But not necessarily one that fails to validate the hypothesis that the two data sets agree.

Oct 18, 2013 at 6:01 PM |

Carrick

Carrick, I think at the point a person starts taking steps to avoid moderation filters, it's inappropriate to continue responding to them. Before I quit though, I want to highlight something:

While it's easy to assume from this you are arguing from the seat of your pants and have no clue what kappa is or what the point of shub's post is. I don't assume that of course, but it is what it looks like. Had I wrote a blog comment as inherently flawed as the one you did, especially with the needless rant at end, you would have torn me a new a$$hole and sent me packing. Deservedly.

I've never ranted on this site, and if you had posted the same way I have here, I'd have responded far more graciously than you have. You're claiming I'd behave a certain way, but that's no better than you and shub repeatedly claiming to know what I think. Neither of you has ESP, and you can't see the future. You even acknowledge this notion when you say:

I'm not a mind reader

Yet you've repeatedly claimed to know what I think and what I'd do.

Regardless, since you apparently can't abide by simple moderation rules, and since the host has asked for this to end, I'm going to bow out now. Please try to remember you claim I create a hostile environment, yet you joined this discussion simply to attack me, and you're now the one who cursed and took steps to bypass moderation.

For the other readers, I'd like to point out Carrick has repeatedly agreed this post's conclusions are wrong. That means whatever you may think about me or my posts, the primary point I was trying to make from the start is agreed to by almost everybody here. The post hasn't been changed though, and I believe shub still defends it.

Oct 18, 2013 at 7:52 PM |

Brandon Shollenberger

No one's agreed that anything is wrong. If something is 'wrong', prove it. Just saying it wont fly. You are the one avoiding discussion of why you believe kappa is wrong.

Oct 18, 2013 at 9:32 PM |

shub

I agree to the most trivial possible observation that volunteers rated abstracts and authors rated papers. There is no further agreement with you on any single point whatsoever.

There is nowhere that I can find, that authors read their own papers fully before rating them. There is nowhere in Cook's instructions asking them to do so. This, is an assumption to begin with. You've built your entire kappa castle on it. How does it feel?

When arguments are conducted, one needs to fix at least a few points at a time so others can be contested. I, or carrick, or anyone else might be doing the same when there are points of agreement. When I agree with you on differences between abstract and paper ratings, it falls in the category. While there is nothing facile about it, it is only an indication I can contest other larger issues keeping smaller ones aside. Since you, on the other hand, are on an arm-waving tirade trying to catch anything that sticks in order to support your initial high rhetoric, no such agreement with you is meaningful. The other modus operandi consists of imagining or willing that the conversation be driven in the way you think it ought to be. It doesn't work that way either.

Oct 18, 2013 at 9:49 PM |

shub

Brandon:

Regardless, since you apparently can't abide by simple moderation rules, and since the host has asked for this to end, I'm going to bow out now. Please try to remember you claim I create a hostile environment, yet you joined this discussion simply to attack me, and you're now the one who cursed and took steps to bypass moderation.

Actually, I used a scatological term, not a curse word. I'm sure nobody has seen it before, nor would I expect, would it require the dollar signs to "pass moderation". As far I know, my comment abides by the Bishop's house rules, if it doesn't, I apologize.

Secondly, I didn't join this thread to attack you---is that not an assumption of motive on your part? Nor did I use the dollar signs to "pass moderation"... both a personal attack on your part and an assumption of my motivation to boot. Class act there.

Third, as far as I know I haven't assumed motive on your part.... nor even what you know or don't know. I do know what you are saying is confused and at times even wrong, but I don't know whether that's an issue with communication or speaks to a more fundamental lack of understanding. I'll admit I've repeatedly pointed the errors in your initial comment, but I had suggested we move on from this. It is you who are the one who insisted on playing this scratched record over and over.

As to whether what you've said amounted to rants, +1 on that last comment, as that qualifies in my book.

Oct 18, 2013 at 11:15 PM |

Carrick

shub, briefly I agree (partially) with Brandon here. I'll come back in the morning (hopefully) and summarize my comments, as I've pulled "wife aggro."

Oct 18, 2013 at 11:17 PM |

Carrick

Post a New Comment

Enter your information below to add a new comment.

My response is on my own website »

Author:

Author Email (optional):

Author URL (optional):

Post:

↓ | ↑

Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>