How should we compute post scores?

You use the lower bound if ranking an item with few votes higher than it really deserves is worse than ranking it lower than it deserves, the lower bound if it’s the opposite, and the middle if it’s about the same.

A poor-but-new answer coming to the top isn’t the disaster that it would be on a shopping site, and if they are lower in the list they may never get voted. The center of the interval seems like the right place to me.

A z of sqrt(2) works out to being “wrong” about 16% of the time, with it underestimating how extreme the true rating is, which counterintuitively ends up making the center of the interval more extreme than it should be. If you use a z of 2, as recommended by Wikipedia, it’s wrong a little less than 5%, and the constants, 2 and 4, are almost as nice as the 1 and 2 from your proposal.

My gut says that a z of sqrt(2) will feel a little more fun but more quirky than a z of 2. A z of 2 will keep items with few votes in the safe but boring center for a bit longer.

2 Likes

I’m absolutely happy with the z=2 value.

If no one else disagrees, then I’d say we have a consensus here:

Rank the questions on (upvotes+2)/(upvotes+downvotes+4).

It’s an easy formula (therefore transparent even to people with no mathematical knowledge), and at the same time has a scientific basis (the Wilson score interval).

The question that remains is what to display on the answers. My suggestion would be to simply show the number of upvotes and downvotes separately.

Note: If we decide on giving different weights on votes from differently experienced people, then upvotes/downvotes here should refer to the sum of the weights of the corresponding votes.

4 Likes

Can someone provide a visual (text or a mockup) example of posts sorted using that Wilson method? I’m having a rough time figuring this one out :s

I suggest showing the numbers of upvotes and downvotes and using this computed score for ordering. In other words, don’t expose the score through the UI but expose the votes instead. If we show score, some people are going to be confused when voting doesn’t change it by 1.

Exposing the votes also shows how much voting there’s been. A raw score doesn’t tell you whether it came from 3 voters or 30. The raw votes also expose controversial votes, which is useful for the reader trying to decide whether to use the recommendations in an answer.

1 Like

I would put the “2” and “4” in the database. That way another instance of Codidact or a community within an instance can choose different constants without changing the code.

2 Likes

I tried, but my little one hit “save” before it was ready. I then edited the for a few hours, creating 39 examples total, with a comparison between how the two methods rank those items. When I hit save, Discourse popped up a generic error message and lost all that work. I doubt there is a way to get it back.

I’ll summarize what you would have seen. The Wilson score is a prediction of how likely a vote is to be an upvote, based on the votes so far, so it comes out as number between 0 and 1, exclusive. The ratio of upvotes to downvotes is the most important factor, but it will give a little higher score for 20:10 than 10:5 because more data means more confidence.

The Stack Exchange ranking tends to put items with a lot of votes at the extremes, either at the top or the bottom of the list. This is most notable when there are about double one kind of vote than the other. SE treats 100:50 as ten times better than 10:5, while Wilson scoring treats them almost identically (66.23% to 63.16%).

2 Likes

Consider the following actual question on Mathematics.SE:

https://math.stackexchange.com/questions/382736/are-all-prime-numbers-finite

As of this writing, the non-deleted answers have the following scores, upvotes and downvotes:

SE order SE score upvotes downvotes Wilson score (z=2, center) Wilson order
1 83 86 3 0.946 1
2 55 59 4 0.910 3
3 30 30 0 0.941 2
4 3 4 1 0.667 4 or 5 (tie)
5 2 2 0 0.667 4 or 5 (tie)
6 0 1 1 0.5 6
7 -2 1 3 0.375 7

If we use the vote difference (the SE score) as tie breaker, then only the second and third answers are exchanged. That’s despite in the SE scoring, the second answer scores significantly higher than the third.

Another example:

https://math.stackexchange.com/questions/2505777/abusing-mathematical-notation-are-these-examples-of-abuse

SE order SE score upvotes downvotes Wilson score (z=2, center) Wilson order
1 68 69 1 0.959 1
2 37 37 0 0.951 2
3 19 19 0 0.913 3
4 14 16 2 0.818 6 or 7
5 12 15 3 0.773 10
6 9 9 0 0.846 4
7 or 8 8 9 1 0.786 8
7 or 8 8 8 0 0.833 5
9 7 7 0 0.818 6 or 7
10 5 5 0 0.778 9
11 4 5 1 0.7 13
12 or 13 3 3 0 0.714 11 or 12
12 or 13 3 3 0 0.714 11 or 12
14 2 2 0 0.667 14
15 or 16 1 1 0 0.6 15 or 16
16 or 16 1 1 0 0.6 15 or 16

This example demonstrates quite nicely how the Wilson score favours uncontroversial answers over controversial ones. For example, the highly voted but controversial number with SE score 12, on place 5 on SE, falls back to place 10 in Wilson score, beaten even by an uncontroversial score 5 question.

Note also that for questions with many votes, an additional vote counts very little in the Wilson score. In the SE difference model, upvoting an answer with no votes yet has the same effect as upvoting an answer with 50 upvotes and 50 downvotes (the score goes from 0 to 1). In the Wilson score, the first upvote raises the score from 2/4=0.5 to 3/5=0.6, while in the 50/50 scenario, it only raises the score from 52/104=0.5 to 53/105≈0.505.

8 Likes

Certainly seems like it has flaws, but is there a better possible method? Controversial answers being dropped lower even with only 1 downvote is quite scary.

2 Likes

Note however that the drops compared to the SE position were normally at most one position. The one exception is where two posts had the exact same vote counts both for upvotes and downvotes, so naturally when the post above them dropped, it dropped below both.

I think we should just use the simple formula that @celtschk provided:

The Wilson normalization is favoring posts that have no downvotes at all and that just wouldn’t quite work, since there’s always a few fussy users, and under that system, they would have too much influence over which posts are ranked more highly.

Also, the simple formula can be explained in words, and it’s versatile - a normal system of sorting, favoring uncontroversial posts in the event of a tie, etc.

1 Like

I have to admit I’m skeptical that this is anywhere near so large a problem. On Meta SE, which is rightly known for being fairly free with downvotes, I have 22 answers with a score of at least 10. 15 of these answers have no downvotes. Of the 7 that have downvotes, 1 has 11 downvotes, and the other 6 have 1 each. Half of these 1-downvote answers are singletons: they have no competitors and are not sorted. In one case, a single lower-score answer at 15/0 is sorted above mine at 23/1. Given that it’s a discussion on a hot-button issue, I think that’s a fairly reasonable outcome; it’s not as though either post is blatantly better or worse. In the other cases, other answers are also downvoted (or are much lower-voted, and therefore still sort low) and single downvotes make no particular difference to sorting.

Finally, the simple suggestion quoted, in the one case above where it makes a discernible difference, actually increases the discrepancy in scores and therefore makes it more sensitive to stray downvotes. I’m not enough of a statistician to be extremely confident of the reason, but I do suspect it’s because it’s using a looser threshold of statistical significance, and is therefore willing to put more weight on what might potentially be noise. That is, mathematically the simpler formula is falling deeper into the trap you want to avoid, relative to the slightly more complex one ((upvotes+2)/(upvotes+downvotes+4)).

If you want, I can try to whip up a SEDE query so you can try various samples to see how large a problem this might be in practice. I don’t expect it to give us any trouble at all most of the time, and when there is a surprise, I expect it to be as relatively unsurprising as the case above.

The voting on SE and SO is all based on number of votes and the rate +/- has little value.

So, there is not much use in basing order on the Wilson score. Most posts acquire unanimous votes and the mean score of the votes (which is often extremely close to -1 or +1) is gonna be very meaningless.

In addition, most posts acquire only a handful of votes and there is too much (a-typical) randomness, which the Wilson score is not gonna make up for.

This Wilson score is assuming a large number of votes, in both up votes and down votes. (it uses a z-score which is a continuous normal distribution approximation of a discrete binomial distribution). So it might make the ordering only worse: a post with +4/-1 will tie with +2; but what if that -1 is a mis-click or just an unhappy OP or bystander?

1 Like

I’m not sure I understand your point. If the voting is unanimous, what scoring method could be better than Wilson-lower-bound or Wilson-center? This is just irrelevant, since Wilson can’t be worse here than the SE style of up - down.

In addition, most posts acquire only a handful of votes and there is too much randomness, which the Wilson score is not gonna make up for.

Statistically, Wilson is specifically designed to distinguish noise from signal by accepting a certain specified chance of being randomly wrong. I believe z=2 here is about a 5% chance.

The SE setup of up - down is not designed for any particular statistical standard and I don’t know its expected error rate, but it’s guaranteed to be higher, probably much higher. (It’s not actually possible, as far as I know, to get a lower error rate than Wilson simply by using a different algorithm.)

This Wilson score is assuming a large number of votes, in both up votes and down votes. (it uses a z-score which is a continuous normal distribution approximation of a discrete binomial distribution). So it might make the ordering only worse: a post with +4/-1 will tie with +2; but what if that -1 is a mis-click or just an unhappy OP or bystander?

Also, what if it isn’t? What are the actual rates of occurrence? (The Wilson score is not more reliant on large numbers of votes than any other scoring system, by the way. SE-style scoring is highly unreliable at low vote counts too.)

If the distribution is biased (which it might be), then we’d need to figure out how to compensate for that, or just accept that once in a while even the best algorithm is going to do slightly worse than a substantially inferior algorithm. But I don’t really see why a disgruntled OP should be factored in; if, as seems likely, we gate downvoting behind a privilege of some kind, most OPs won’t be able to downvote, and in any case, how you distinguish “this correct answer to my question made me irrationally angry and I will therefore downvote for no good reason” from “this answer to my question did not help my actual case”?

Obviously, some topics (politics, parenting, tabs vs spaces) are likely to get votes more on opinions and tribal affiliation than reason and evidence. But whatever scoring method we choose can’t fix that. That has to be handled some other way.

I do not think that any computation would be better. But I did not say that there could be a better computation (in the sense of more precise, certainly there can be simpler computations). I said that the Wilson score is useless for the cases that we see at StackExchange (which is mostly average scores very close to 1 or -1).

Better scoring systems would be altering the way to gather data. Personally I would like a rating system (let people give a rate instead of relying on a rate of people). But a simpler alternative could be to also allow blank votes (and apply Wilson’s method to that).

Basically, the problem is not computation. But instead, the problem is data gathering. There are very little votes being casted, and also the votes are cast at seperate times.

I believe that this is a misconception. Or at least, you have to be precise here what ‘being wrong’ means. The Wilson score gives boundaries for an interval-estimate for the true rate of positive/negative scores. When z=2 then the interval will be wrong at most 5% of the time.

But this probability/frequency of being wrong loses that meaning in the context that one uses the z=2 as a point-estimate.

What I was referring to is that the Wilson score is only an approximation of a confidence interval (in a similar way as a chi-squared test is not exact like the Fisher’s exact test is). But anyway, that was a bit of a pedantic remark, and not so important.

However, what is important that is that for small values these values are inaccurate rules of thumb. So my point is that we should not believe that we are having any better particular statistical standard than the absent statistical standard with the SE setup.

2 Likes

Absolutely. I’d certainly support a better system to gather higher-quality votes, and I’d expect the best value to come from that, but I don’t have anything more than vague ideas at the moment myself.

Fair enough, although I’m not entirely sure how much worse that makes things. Is there a better point-estimate algorithm, or should we tighten up the constants for lower error rates, or what?

If you mean that we need to keep in mind that, with small populations, there’s inevitably less accuracy, I’d certainly agree. I don’t agree that that makes choice of algorithm completely irrelevant even at fairly low vote counts, although admittedly if we never had more than (say) n=5 there would be many more important things to consider.

Worth considering: should we attempt to display in the UI some sort of broad classification of score confidence, such as bronze/silver/gold symbols of some kind when the score meets certain thresholds of confidence?

2 Likes

In theory, the idea of using an algorithm like (upvotes+1)/(upvotes+downvotes+2) is good specifically for sorting, for the reasons you mentioned. But it comes with complications. If we display the number next to each post, we’ll need to have a very visible link right there in the UI to tell people what that number means, since it’s not a trivial simple operation like (upvotes)-(downvotes). We’ll also need to have a nice title for it (e.g. reddit assigns each post, if I’m not mistaken, a ‘score’ and ‘percentage positive’, both of which are self-explanatory and don’t sound like technical terminology). But if we don’t display the number and only apply the algorithm in the back-end processing, people will wonder how the posts are sorted, and it may appear somewhat random.

Also, we need to make sure we have some number which we’re showing users, and ideally it should be a nice round number. I’m in favor of allowing access to the number of upvotes and downvotes irrespective of reputation (although I think SE says it slows their databases); they can be hidden, such that you need to click some button to fetch those raw numbers.

1 Like

Really quite simple. We don’t need to show the algorithm-generated number. Show Up & Down votes (why SE hides that until you have a lot of rep is beyond me, but I digress). The only place this matters is when sorting of answers. So where you can select the sort order - in SE this is: “active”, “oldest”, “votes”:

  • instead of “votes” make it “weighted score”
  • mouseover of “weighted score” can give a very short description
  • Put a little “?”/“Help” thing next to “weighted score” and provide the full algorithm & explanation on the help page.
4 Likes

It means that we should not trust the mathematical rigour of the underlying algorithm. It doesn’t give the subjective choices a good objective foundation and we should be aware of that (that the choices remain subjective).

Sure you can make a post with 16+/2- equal in order/ranking to 7+/0- (and thus weigh downvotes much more heavily). But that has little to do with 0.182 being a reasonable boundary/center for a 95% confidence interval estimate of a frequency for downvotes for both these posts (or 0.818 for upvotes as celtschk computed).

It is only a good estimate for independent votes with a constant frequency/probability of downvotes per vote. But most importantly, it is answering a different question and is sort of abused as a measure for a ranking.


My answer to “How should we compute post scores?” is “we shouldn’t”.

But then I would have to say that I am anyway not a fan of the scoring and gamification mechanism. If I would be forced to come up with a solution then I would make much more and larger changes.

  • I would make the voting much less visible (such that it doesn’t influence other votes which is clearly happening) and one can only see the vote that one has given themselves.
  • Idealy, I would not show numbers but something like classes. Like gold silver bronze as in wine tastings where multiple answers/questions can win the prize (and non-classified when there are not enough votes; like imdb does when there are insufficient ratings for movies).
  • and I would base it on some more advanced algorithm (incorporating who votes) or else I would make it like some promotion/degradation system where people vote for promotion/degradation in a similar way as they vote for closing/reopen.
  • also, we would need to get rid of the term ‘vote’. Pressing the +1 or -1 has little to do with ‘voting’.
2 Likes

SE doesn’t show everyone separate vote counts by default because apparently the query is expensive (https://meta.stackexchange.com/a/1007 and https://meta.stackexchange.com/a/69854). I’m worried that the system we’re looking at is going to be way more expensive and slow once we have a large repository of posts.

The schema will likely include total up, down and net/Wilson/etc in the posts table so that the votes table won’t be queried for a standard display page.

1 Like