illuminating science

29/6/2008

The inner crowd or just statistics?

Filed under: — Joel @ 2:50 am

Synopsis: A recent journal article claims (literally) that second guessing ourselves gives better accuracy. I think this is wrong, and that what they’re seeing is just statistics. I’d like some opinions on this!

A recent paper in the journal Psychological Science (and much publicised by The Economist in an easier to digest article) makes some interesting claims. Basically the story goes that if you ask two people to estimate something (like the number of jelly beans in a jar, or the percentage of world airports in the USA) then taking two people’s guesses and averaging them gives you a better estimate than either alone (on average). It’s called the wisdom of crowds; extend it to a hundred people, and up to a point, the group guesses get better.

Fair enough, I could believe this, I think (more specifically, the group will average towards the “group bias”, and you’re assuming this is a “good” guess. I digress…) But the new paper goes one step further: it suggests that even one person can improve their guess by making two guesses and averaging them, improving accuracy by 10 percent. The article’s authors suggest:

Although people assume that their first guess about a matter of fact exhausts the best information available to them, a forced second guess contributes additional information, such that the average of two guesses is better than either guess alone. This observed benefit of averaging multiple responses from the same person suggests that responses made by a subject are sampled from an internal probability distribution, rather than deterministically selected on the basis of all the knowledge a subject has.

Translation: we don’t come up with one best guess straight off; instead, each “guess” comes from a range of possible values our brain has computed. Furthermore, they note that a delay of three weeks between the first and second guess improves the average, presumably by making the guesses more “independent”.

But something about this bugged me, so I did a little computational experiment myself: Generate a random number (between 0.0 and 1000.0, fractions allowed) which is the “right” answer and generate two more random numbers which are my guesses (also between 0.0 and 1000.0). Then, look at the difference between the first guess and the “correct” answer, and between the average of my two guesses and the “correct” answer. Repeat.

So to be clear: I’m randomly choosing the correct answer, and then I’m randomly making two guesses with no information other than a lower and upper bound. This would be perfectly reasonable for questions like the airport percentage above (which was mentioned in the article) where you know the answer is between 0-100. I did 1200 tests (in Excel; it’s not enough for the averages to be absolutely constant, but it doesn’t change the qualitative results.)

The results: On average, the first guess was off by 330 (which I’m sure you could argue theoretically, too). But for the average (mean) of my two guesses, I was only off by 290 - this is a 10% better guess than before! Here’s a summary:

Average “answer” 492
First Guess average 495
Average of two guesses average 499
Deviation of Guess 1 (average) 332
Devation of AvGuess (average) 293
Percentage change -12%
RMSE Guess 1 406
RMSE AvGuess 360
Percentage change in RMSE -11%

Notice the averages are in about the right place; my data is pretty random, and yet simply taking the average gets me closer.

I’ve also included there the root mean square errors (RMSE), which were discussed in the paper [Disclaimer: I don’t really have a stats background: for the RMSE I took the difference between each guess and the corresponding correct answer, squared each difference and added the results, then averaged this sum and took the square root. I hope that’s right…] Unfortunately, I don’t know exactly the data numbers and ranges used in the article, so I’m not quite sure how to compare them directly, but the percentage changes are what’s important, I think, and they fit nicely with the paper’s predictions (of between 5-15%).

Obviously, I made some assumptions, in particular that no person had the faintest clue what the “right” answer should be. But that’s not ridiculous, and it shouldn’t be hard to redo the data with a normal distribution of guesses around the mean; I’d expect it to give the same result, but the conclusion is now trivial (average two deliberately normally distributed guesses, and you’re going to get closer to the mean, on average :) ).

I also did a couple of trials where I fixed the “correct answer” to the same thing for every person. If the correct answer was 0 (out of 1000), then, as you’d expect, either a single or an averaged guess both produced the same average deviation, 500. Interestingly, however, their RMSE were quite different, and the averaged guess again produced a 6% better guess by this metric. Same for a guess of 1000. Why? Averaging the two numbers produces a lower standard deviation around the average (500); the RMSE weights big differences more, so the values further away from 0 (or 1000, respectively) contribute much more, even allowing for those corresponding closer values. (Does that make sense?) So RMSE might not be a great measure here.

What about if we take 500 as the average? Then the effect is even more pronounced - the mean guess has a lower standard deviation, is closer to 500 on average, and contributes much more. 250 or 750 were similar, and I’d expect this to be true for the whole range.

In conclusion, I would argue that any benefit seen in this study is simply from statistics, not from some innate feature of our brain’s estimating abilities. The only thing they did get right is that the time delay probably allows the second guess to be more “random” which seems to be useful, on average!

Thoughts? If this makes sense, I’m going to write a rebuttal/similar, but I’d love some feedback first (particularly if I’m completely wrong and/or have missed something obvious!) The preprint article is missing a key figure, but I don’t think I’ve misunderstood or missed anything important (except for their actual data values). All comments welcome!

Addendum: I finally did some simple analytical calculations - when both guesses and the answer are chosen randomly and independently from 0-1000, the RMSE of one guess should be 410, compared to 355 for two averaged guesses, just like the simulations predict! I’ve also checked that the average absolute differences for one guess should be 1000/3, which looks right, but I haven’t yet slogged through the maths for the average guess case (does anyone know a shortcut for analytical expectation values of absolute values?!)

Addendum #2: I’ve now done the RMSE case for the absolute difference (I’m so slow sometimes…)  and I predict 333 for the single guess or 291 for the averaged guess - right on the money! Thanks to Tim for useful comments about triangular probability distributions!

Tim Says:

Joel, you might find the following wikipedia article interesting (it explains why we see so many Gaussian distributions “in the wild”):

http://en.wikipedia.org/wiki/Central_limit_theorem

Joel Says:

Thanks Tim! That was a good read. If choose random numbers over an interval, however, that won’t be normally distributed, right? It’s simply going to be a flat distribution. What about the averages? That’s not clear to me, or for the differences from the “correct” answer in either case. Do you have more thoughts?

 
 
Tim Says:

If you have two independent random variables, the distribution of the sum will be the convolution of the two distributions of the single variables. If you convolve a uniform distribution with itself, you get a “triangular” distribution over an interval twice as long.

(For diagrams, the convolution of http://en.wikipedia.org/wiki/Image:Zeroorderhold.impulseresponse.svg with itself gives http://en.wikipedia.org/wiki/Image:Delayedfirstorderhold.impulseresponse.svg with an appropriate scaling factor to give an integral of 1)

If you then halve this distribution — ie, taking the average when we’ve just added two variables — you’ll get the original interval again, but with the triangular distribution.

So the distribution of the average of the two uniformly distributed variables (same interval in both cases) is unimodal about halfway through the interval. You can look at this as being a “bad approximation” to a normal distribution, which the central limit theorem shows you get if you keep on convolving with uniform distributions ad infinitum.

The metric you’re using for calculating is essentially sum of squares (with a square root at the end) — this will heavily penalise large differences between the correct answer and the guesses answer. When you take a uniformly distributed guess, you’re just as likely to be right next to correct answer as bloody far away from it. If you take the average, you’re biasing the result towards the centre of the interval, which in turn minimises the likelihood of being far away from the correct answer.

This last argument is extremely unrigourous, but I think there might be something in it. What happens if you use a different error metric, like simply sum of differences?

Joel Says:

Ah, right! That makes sense - thanks for clarifying! In the table above, what I’m dodgily calling “Devation of Guess 1″ and “Deviation of AvGuess” is the average of the sum of the (absolute) differences (which I understand is the mean absolute deviation). No squaring involved, but I am taking the absolute value each time by necessity (else I’ll just get zero). And I still that the average absolute separation is about 10% less, which I presume is a result of the more central distribution like you were saying.

I mean, all this isn’t to say that their arguments couldn’t be right, but they’d have to explain away these simple statistical effects first. Would you agree?

Tim Says:

Yes, I certainly don’t mean to suggest that the paper is based on bad science! However I think this should (if nothing else) illustrate why good statistical analysis is tough.

I’ll have to have a think about the different factors that could influence the results and whether they can be tested for. It’s a nice little thought experiment.

Joel Says:

*grin* Well, I think I am dubious about the science content of the paper, but I can’t think how you could necessarily separate out bad stats from exciting brain discoveries.

Like you say, though, it’s a great example to use! And if I can write a rebuttal paper, then a nice thing for my ego :)

(Comments wont nest below this level)
 
Tim Says:

As an addendum, I read the paper as saying “this is one possible conclusion that does not contradict our results”.

(Comments wont nest below this level)
 
 
 
 

Powered by WordPress