How should we compute post scores?

cellio · 29 November 2019 00:39

We’ve had a lot of discussions about reputation, voting, scores, displaying scores, collecting reactions in addition to votes… are we ready to start converging on what we want to do in our MVP?

These are things I understand from prior discussions (but haven’t hunted down links for):

We want to be able to sort answers by some sort of score (best stuff rises to the top).
We have some concerns about raw “upvotes minus downvotes” as that score, including how it reflects (or doesn’t reflect) controversy.
We have some concerns about whose votes count (or count how much).
Should votes be normalized in any way, or should we do nothing that we can’t explain in simple English? (Warning: I do not understand the math there. Link came from this answer on Community Building.)
We don’t want to give users a single reputation number (so scores don’t contribute to user status).
We are talking about some privileges that are gated by scores on questions and answers, so individual scores and not just ranking do matter at least a little.
We don’t know how scores should be presented to readers and whether that should vary based on who the reader is (e.g. anonymous visitor from Google vs. logged-in user vs. post owner).

Thoughts? How do we move forward on this question?

tuggyne · 29 November 2019 02:39

I would strongly recommend sorting by a Wilson-normalized total score, but I don’t think that specifically needs to be shown as such, especially as it is highly unlikely to be an integer. (Although if we show score but don’t show up/down breakdowns things might be confusing, which is a recommendation also for being as transparent as practical to everyone.)

If we have multiple weights for votes, I’d recommend keeping it as simple as possible: two vote weights, the larger one acquired at a fairly high (and ideally topic-specific) trust level and usable only with fairly restrictive daily limits. Much more complexity than that will make it almost impossible to reason about post quality and votes. I don’t know that that’s at all MVP, though it would also be difficult to manage if it’s not balanced into the system from fairly early on.

celtschk · 29 November 2019 07:12

I have no idea what a Wilson-normalized total score is, and I couldn’t find it on a web search. Can you please elaborate?

FWIW, I have a quite simple suggestion: Sort the answers on (upvotes+1)/(upvotes+downvotes+2).

This has the following desirable properties:

For answers which get only upvotes or only downvotes, the sort order is exactly the same as with the difference of upvotes and downvotes (the SE model).
For the same vote difference, uncontroversial answers get scored stronger than controversial answers. That is, answers votes unanimously positive go above answers with both positive and negative votes, and answers voted unanimously negative go below answers with both negative and positive votes.

For example, considering the vote difference of 1 (one more upvote than downvote), a post with only one upvote gets a score of 2/3≈0.67, while a post with 2 upvotes and a downvote only gets a score of 3/5=0.6. On the other hand, with a vote difference of −1, a post with only one downvote gets a score of 1/3≈0.33, while a post with two downvotes and an upvote gets a score of 2/5=0.4.
If there are many upvotes and many downvotes, the scoring is approximately the fraction of upvotes over total votes.

tuggyne · 29 November 2019 07:32

It’s the statistical method mentioned in @cellio’s link:

celtschk · 29 November 2019 07:47

Sorry, didn’t notice that. But then, that’s a mighty complicated formula. And most of the complication comes from insisting on the lower bound of that interval, which was not justified in any way (other than “it works better than the two simple methods above”). Why not use the upper bound instead? Or the center?

When using the center instead of the lower bound, the function simplifies dramatically. More exactly, it reduces to (p+z²/(2n))/(1+z²/n), or after multiplying by n, (pn+z²/2)/(n+z²). Given that pn=upvotes, and n=upvotes+downvotes, that further simplifies to (upvotes+z²/2)/(upvotes+downvotes+z²). Which actually is basically my formula with an adjustable parameter added; my formula is recovered by setting z=sqrt(2).

HeapUnderflow · 30 November 2019 06:08

You use the lower bound if ranking an item with few votes higher than it really deserves is worse than ranking it lower than it deserves, the lower bound if it’s the opposite, and the middle if it’s about the same.

A poor-but-new answer coming to the top isn’t the disaster that it would be on a shopping site, and if they are lower in the list they may never get voted. The center of the interval seems like the right place to me.

A z of sqrt(2) works out to being “wrong” about 16% of the time, with it underestimating how extreme the true rating is, which counterintuitively ends up making the center of the interval more extreme than it should be. If you use a z of 2, as recommended by Wikipedia, it’s wrong a little less than 5%, and the constants, 2 and 4, are almost as nice as the 1 and 2 from your proposal.

My gut says that a z of sqrt(2) will feel a little more fun but more quirky than a z of 2. A z of 2 will keep items with few votes in the safe but boring center for a bit longer.

celtschk · 30 November 2019 06:45

I’m absolutely happy with the z=2 value.

If no one else disagrees, then I’d say we have a consensus here:

Rank the questions on (upvotes+2)/(upvotes+downvotes+4).

It’s an easy formula (therefore transparent even to people with no mathematical knowledge), and at the same time has a scientific basis (the Wilson score interval).

The question that remains is what to display on the answers. My suggestion would be to simply show the number of upvotes and downvotes separately.

Note: If we decide on giving different weights on votes from differently experienced people, then upvotes/downvotes here should refer to the sum of the weights of the corresponding votes.

user1306322 · 30 November 2019 17:17

Can someone provide a visual (text or a mockup) example of posts sorted using that Wilson method? I’m having a rough time figuring this one out :s

cellio · 30 November 2019 23:15

I suggest showing the numbers of upvotes and downvotes and using this computed score for ordering. In other words, don’t expose the score through the UI but expose the votes instead. If we show score, some people are going to be confused when voting doesn’t change it by 1.

Exposing the votes also shows how much voting there’s been. A raw score doesn’t tell you whether it came from 3 voters or 30. The raw votes also expose controversial votes, which is useful for the reader trying to decide whether to use the recommendations in an answer.

manassehkatz · 30 November 2019 23:29

I would put the “2” and “4” in the database. That way another instance of Codidact or a community within an instance can choose different constants without changing the code.

HeapUnderflow · 1 December 2019 00:19

I tried, but my little one hit “save” before it was ready. I then edited the for a few hours, creating 39 examples total, with a comparison between how the two methods rank those items. When I hit save, Discourse popped up a generic error message and lost all that work. I doubt there is a way to get it back.

I’ll summarize what you would have seen. The Wilson score is a prediction of how likely a vote is to be an upvote, based on the votes so far, so it comes out as number between 0 and 1, exclusive. The ratio of upvotes to downvotes is the most important factor, but it will give a little higher score for 20:10 than 10:5 because more data means more confidence.

The Stack Exchange ranking tends to put items with a lot of votes at the extremes, either at the top or the bottom of the list. This is most notable when there are about double one kind of vote than the other. SE treats 100:50 as ten times better than 10:5, while Wilson scoring treats them almost identically (66.23% to 63.16%).

celtschk · 1 December 2019 06:13

Consider the following actual question on Mathematics.SE:

https://math.stackexchange.com/questions/382736/are-all-prime-numbers-finite

As of this writing, the non-deleted answers have the following scores, upvotes and downvotes:

SE order	SE score	upvotes	downvotes	Wilson score (z=2, center)	Wilson order
1	83	86	3	0.946	1
2	55	59	4	0.910	3
3	30	30	0	0.941	2
4	3	4	1	0.667	4 or 5 (tie)
5	2	2	0	0.667	4 or 5 (tie)
6	0	1	1	0.5	6
7	-2	1	3	0.375	7

If we use the vote difference (the SE score) as tie breaker, then only the second and third answers are exchanged. That’s despite in the SE scoring, the second answer scores significantly higher than the third.

Another example:

https://math.stackexchange.com/questions/2505777/abusing-mathematical-notation-are-these-examples-of-abuse

SE order	SE score	upvotes	downvotes	Wilson score (z=2, center)	Wilson order
1	68	69	1	0.959	1
2	37	37	0	0.951	2
3	19	19	0	0.913	3
4	14	16	2	0.818	6 or 7
5	12	15	3	0.773	10
6	9	9	0	0.846	4
7 or 8	8	9	1	0.786	8
7 or 8	8	8	0	0.833	5
9	7	7	0	0.818	6 or 7
10	5	5	0	0.778	9
11	4	5	1	0.7	13
12 or 13	3	3	0	0.714	11 or 12
12 or 13	3	3	0	0.714	11 or 12
14	2	2	0	0.667	14
15 or 16	1	1	0	0.6	15 or 16
16 or 16	1	1	0	0.6	15 or 16

This example demonstrates quite nicely how the Wilson score favours uncontroversial answers over controversial ones. For example, the highly voted but controversial number with SE score 12, on place 5 on SE, falls back to place 10 in Wilson score, beaten even by an uncontroversial score 5 question.

Note also that for questions with many votes, an additional vote counts very little in the Wilson score. In the SE difference model, upvoting an answer with no votes yet has the same effect as upvoting an answer with 50 upvotes and 50 downvotes (the score goes from 0 to 1). In the Wilson score, the first upvote raises the score from 2/4=0.5 to 3/5=0.6, while in the 50/50 scenario, it only raises the score from 52/104=0.5 to 53/105≈0.505.

Corsaka · 1 December 2019 11:41

Certainly seems like it has flaws, but is there a better possible method? Controversial answers being dropped lower even with only 1 downvote is quite scary.

celtschk · 1 December 2019 14:16

Note however that the drops compared to the SE position were normally at most one position. The one exception is where two posts had the exact same vote counts both for upvotes and downvotes, so naturally when the post above them dropped, it dropped below both.

ozewski · 1 December 2019 14:36

I think we should just use the simple formula that @celtschk provided:

The Wilson normalization is favoring posts that have no downvotes at all and that just wouldn’t quite work, since there’s always a few fussy users, and under that system, they would have too much influence over which posts are ranked more highly.

Also, the simple formula can be explained in words, and it’s versatile - a normal system of sorting, favoring uncontroversial posts in the event of a tie, etc.

tuggyne · 3 December 2019 04:48

I have to admit I’m skeptical that this is anywhere near so large a problem. On Meta SE, which is rightly known for being fairly free with downvotes, I have 22 answers with a score of at least 10. 15 of these answers have no downvotes. Of the 7 that have downvotes, 1 has 11 downvotes, and the other 6 have 1 each. Half of these 1-downvote answers are singletons: they have no competitors and are not sorted. In one case, a single lower-score answer at 15/0 is sorted above mine at 23/1. Given that it’s a discussion on a hot-button issue, I think that’s a fairly reasonable outcome; it’s not as though either post is blatantly better or worse. In the other cases, other answers are also downvoted (or are much lower-voted, and therefore still sort low) and single downvotes make no particular difference to sorting.

Finally, the simple suggestion quoted, in the one case above where it makes a discernible difference, actually increases the discrepancy in scores and therefore makes it more sensitive to stray downvotes. I’m not enough of a statistician to be extremely confident of the reason, but I do suspect it’s because it’s using a looser threshold of statistical significance, and is therefore willing to put more weight on what might potentially be noise. That is, mathematically the simpler formula is falling deeper into the trap you want to avoid, relative to the slightly more complex one ((upvotes+2)/(upvotes+downvotes+4)).

If you want, I can try to whip up a SEDE query so you can try various samples to see how large a problem this might be in practice. I don’t expect it to give us any trouble at all most of the time, and when there is a surprise, I expect it to be as relatively unsurprising as the case above.

MartijnWeterings · 3 December 2019 23:15

The voting on SE and SO is all based on number of votes and the rate +/- has little value.

So, there is not much use in basing order on the Wilson score. Most posts acquire unanimous votes and the mean score of the votes (which is often extremely close to -1 or +1) is gonna be very meaningless.

In addition, most posts acquire only a handful of votes and there is too much (a-typical) randomness, which the Wilson score is not gonna make up for.

This Wilson score is assuming a large number of votes, in both up votes and down votes. (it uses a z-score which is a continuous normal distribution approximation of a discrete binomial distribution). So it might make the ordering only worse: a post with +4/-1 will tie with +2; but what if that -1 is a mis-click or just an unhappy OP or bystander?

tuggyne · 4 December 2019 00:13

I’m not sure I understand your point. If the voting is unanimous, what scoring method could be better than Wilson-lower-bound or Wilson-center? This is just irrelevant, since Wilson can’t be worse here than the SE style of up - down.

In addition, most posts acquire only a handful of votes and there is too much randomness, which the Wilson score is not gonna make up for.

Statistically, Wilson is specifically designed to distinguish noise from signal by accepting a certain specified chance of being randomly wrong. I believe z=2 here is about a 5% chance.

The SE setup of up - down is not designed for any particular statistical standard and I don’t know its expected error rate, but it’s guaranteed to be higher, probably much higher. (It’s not actually possible, as far as I know, to get a lower error rate than Wilson simply by using a different algorithm.)

This Wilson score is assuming a large number of votes, in both up votes and down votes. (it uses a z-score which is a continuous normal distribution approximation of a discrete binomial distribution). So it might make the ordering only worse: a post with +4/-1 will tie with +2; but what if that -1 is a mis-click or just an unhappy OP or bystander?

Also, what if it isn’t? What are the actual rates of occurrence? (The Wilson score is not more reliant on large numbers of votes than any other scoring system, by the way. SE-style scoring is highly unreliable at low vote counts too.)

If the distribution is biased (which it might be), then we’d need to figure out how to compensate for that, or just accept that once in a while even the best algorithm is going to do slightly worse than a substantially inferior algorithm. But I don’t really see why a disgruntled OP should be factored in; if, as seems likely, we gate downvoting behind a privilege of some kind, most OPs won’t be able to downvote, and in any case, how you distinguish “this correct answer to my question made me irrationally angry and I will therefore downvote for no good reason” from “this answer to my question did not help my actual case”?

Obviously, some topics (politics, parenting, tabs vs spaces) are likely to get votes more on opinions and tribal affiliation than reason and evidence. But whatever scoring method we choose can’t fix that. That has to be handled some other way.

MartijnWeterings · 4 December 2019 00:32

I do not think that any computation would be better. But I did not say that there could be a better computation (in the sense of more precise, certainly there can be simpler computations). I said that the Wilson score is useless for the cases that we see at StackExchange (which is mostly average scores very close to 1 or -1).

Better scoring systems would be altering the way to gather data. Personally I would like a rating system (let people give a rate instead of relying on a rate of people). But a simpler alternative could be to also allow blank votes (and apply Wilson’s method to that).

Basically, the problem is not computation. But instead, the problem is data gathering. There are very little votes being casted, and also the votes are cast at seperate times.

I believe that this is a misconception. Or at least, you have to be precise here what ‘being wrong’ means. The Wilson score gives boundaries for an interval-estimate for the true rate of positive/negative scores. When z=2 then the interval will be wrong at most 5% of the time.

But this probability/frequency of being wrong loses that meaning in the context that one uses the z=2 as a point-estimate.

What I was referring to is that the Wilson score is only an approximation of a confidence interval (in a similar way as a chi-squared test is not exact like the Fisher’s exact test is). But anyway, that was a bit of a pedantic remark, and not so important.

However, what is important that is that for small values these values are inaccurate rules of thumb. So my point is that we should not believe that we are having any better particular statistical standard than the absent statistical standard with the SE setup.

tuggyne · 4 December 2019 01:10

Absolutely. I’d certainly support a better system to gather higher-quality votes, and I’d expect the best value to come from that, but I don’t have anything more than vague ideas at the moment myself.

Fair enough, although I’m not entirely sure how much worse that makes things. Is there a better point-estimate algorithm, or should we tighten up the constants for lower error rates, or what?

If you mean that we need to keep in mind that, with small populations, there’s inevitably less accuracy, I’d certainly agree. I don’t agree that that makes choice of algorithm completely irrelevant even at fairly low vote counts, although admittedly if we never had more than (say) n=5 there would be many more important things to consider.

Worth considering: should we attempt to display in the UI some sort of broad classification of score confidence, such as bronze/silver/gold symbols of some kind when the score meets certain thresholds of confidence?