How should we compute post scores?

In theory, the idea of using an algorithm like (upvotes+1)/(upvotes+downvotes+2) is good specifically for sorting, for the reasons you mentioned. But it comes with complications. If we display the number next to each post, we’ll need to have a very visible link right there in the UI to tell people what that number means, since it’s not a trivial simple operation like (upvotes)-(downvotes). We’ll also need to have a nice title for it (e.g. reddit assigns each post, if I’m not mistaken, a ‘score’ and ‘percentage positive’, both of which are self-explanatory and don’t sound like technical terminology). But if we don’t display the number and only apply the algorithm in the back-end processing, people will wonder how the posts are sorted, and it may appear somewhat random.

Also, we need to make sure we have some number which we’re showing users, and ideally it should be a nice round number. I’m in favor of allowing access to the number of upvotes and downvotes irrespective of reputation (although I think SE says it slows their databases); they can be hidden, such that you need to click some button to fetch those raw numbers.

1 Like

Really quite simple. We don’t need to show the algorithm-generated number. Show Up & Down votes (why SE hides that until you have a lot of rep is beyond me, but I digress). The only place this matters is when sorting of answers. So where you can select the sort order - in SE this is: “active”, “oldest”, “votes”:

  • instead of “votes” make it “weighted score”
  • mouseover of “weighted score” can give a very short description
  • Put a little “?”/“Help” thing next to “weighted score” and provide the full algorithm & explanation on the help page.
4 Likes

It means that we should not trust the mathematical rigour of the underlying algorithm. It doesn’t give the subjective choices a good objective foundation and we should be aware of that (that the choices remain subjective).

Sure you can make a post with 16+/2- equal in order/ranking to 7+/0- (and thus weigh downvotes much more heavily). But that has little to do with 0.182 being a reasonable boundary/center for a 95% confidence interval estimate of a frequency for downvotes for both these posts (or 0.818 for upvotes as celtschk computed).

It is only a good estimate for independent votes with a constant frequency/probability of downvotes per vote. But most importantly, it is answering a different question and is sort of abused as a measure for a ranking.


My answer to “How should we compute post scores?” is “we shouldn’t”.

But then I would have to say that I am anyway not a fan of the scoring and gamification mechanism. If I would be forced to come up with a solution then I would make much more and larger changes.

  • I would make the voting much less visible (such that it doesn’t influence other votes which is clearly happening) and one can only see the vote that one has given themselves.
  • Idealy, I would not show numbers but something like classes. Like gold silver bronze as in wine tastings where multiple answers/questions can win the prize (and non-classified when there are not enough votes; like imdb does when there are insufficient ratings for movies).
  • and I would base it on some more advanced algorithm (incorporating who votes) or else I would make it like some promotion/degradation system where people vote for promotion/degradation in a similar way as they vote for closing/reopen.
  • also, we would need to get rid of the term ‘vote’. Pressing the +1 or -1 has little to do with ‘voting’.
2 Likes

SE doesn’t show everyone separate vote counts by default because apparently the query is expensive (https://meta.stackexchange.com/a/1007 and https://meta.stackexchange.com/a/69854). I’m worried that the system we’re looking at is going to be way more expensive and slow once we have a large repository of posts.

The schema will likely include total up, down and net/Wilson/etc in the posts table so that the votes table won’t be queried for a standard display page.

1 Like

Currently the SE site already has the total score separately in the question table (no query of the large votes table is neccesary). As Manassehkatz says they would only need to add total number of votes (or up/down votes).

An interesting detail is that SE/SO is aware of such solution. They are either too lazy to follow up on it (although meta has bigger/deeper issues than SE/SO being lazy to respond to it) or they do not want to because they do not care to improve the user experience, or maybe they have other unofficial/non-stated reasons.

I can imagine that they like to change the current database as little as possible. Although I also imagine that the fear of an extra column going to break things is a bit exaggerated not? Are there database experts that can explain this? Is it just a lazy system/database administrator that doesn’t want to do the work, or is there a reasonable reason to keep the number of edits on database layout minimised and only do this when really necessary?

As MVP the builders of the back-end should focus on a votes-table (to store information) and for queries from the typical use of the website one or more columns in the posts-table that contain aggregate scores from the votes table (from which the score/rank/order can be computed but at that point, when aggregrate scores are stored, the database builders do not need to care about what algorithm is being used). The front end builders need to keep in mind that there needs to be some space where a voting and score display is needed, and that there will be some ordering of answers. But also for them they do not really need to know the actual algorithm as long as the voting/liking mechanism (up/down or something more advanced) is fixed.

2 Likes

The original post about the problem is from June 2009. My guess is that at the time the database did not have the necessary fields but that in November 2009 when they rolled out an easy way to see those votes for 1,000 Rep. users they changed the database. However, as a proprietary system they had no need to actually change the schema. In fact, I would not expect them to broadcast such changes to the schema as there is no real reason to do so, and in fact does not matter. But what does matter, to me, is this concept of “New people can’t see everything reasonable.” There are some limits - new users arguably should not see Deleted posts and other information that can be truly distracting. New users certainly should have limits on actions. But viewing of basic, non-personal, non-confidential, non-controversial (arguably Deleted posts can be a bit controversial) information should be open to all.

I wonder whether they changed the database. In the datadump the necessary fields are not available. There is a field for ‘score’ but not a field for upvotes or downvotes.

I believe that the datadump gives a good picture of their actual database tables and scheme. I would find it weird when they censored those columns from the dump. They actually do censor some other columns (like who voted) but then the dump has an empty column and they do not seem to remove the columns.


Interesting detail about the data-dumps is that they are still published with a reference to the CC BY-SA 3.0 license (the dump that is available on the https://data.stackexchange.com/ site).

2 Likes

Late to the party here but I highly advocate using Steam’s game ranking algorithm for ranking posts instead of Wilson scores. The SteamDB algorithm is quite a bit simpler and more straightforward than Wilson’s, and I’ll bet its also less sensitive to an additional downvote.

Can you summarize? Since it’s simple and straightforward, can you explain it to us instead of linking to a long page with math? Thanks.

2 Likes

Of course! In my opinion the article I linked is a pretty good explanation but I’ll restate here anyway, especially since the original article has a lot of tangential discussion about game rankings on Steam.

Premises

  1. At 100% uncertainty (+0/-0, +1/-0, +0/-1), assume that a given post has a “true” rating of 50%.
  2. If a post A has 10x as many votes as post B, then our uncertainty about the “true” rating of A is half that of B. (These 10x and 0.5x multipliers can be adjusted as need be.)

A consequence of the following math is that a post with millions of upvotes but no downvotes has a “true” rating of 1 and likewise, millions of downvotes but no upvotes means a “true” rating of 0. This can be transformed into a more-familiar number if need be, but for plain ranking, it suffices to simply compare “true” ratings and see which is bigger.

Example

If a post has 9 upvotes and 1 downvote, then it has an “approval” rating of 90% - number of upvotes divided by total votes. (If the post had +1 / -9, then it has an approval rating of 10%.) As the total number of votes is 10, which is 10x that of a post with 1 vote, then the uncertainty is half that of a post with 1 vote. As the uncertainty of a post with 1 vote is 100%, the uncertainty of this post is 50%. We are 50% certain that +9/-1 = 90% represents the true rating of this post. We are 50% certain that 90% is the true rating and 50% certain that 50% is the true rating, so we can combine these like so - 0.5 * 0.9 + 0.5 * 0.5 = 0.7 - to conclude that we should consider +9/-1 to have a rating of 70% for the purposes of ranking. Conversely, +1/-9 has an adjusted rating of 30%.

Calculation

The uncertainty level is 1/2 to the power of the log base 10 of the total number of votes. uncertainty = 0.5 ** log_10(upvotes + downvotes)

The adjusted rating is then a combination of the default rating and the observed rating, weighted by the uncertainty. adjusted_rating = uncertainty * 0.5 + (1 - uncertainty) * (upvotes / total_votes)

(Posts with +0/-0 will just be assigned a rating of 0.5 to make the math easier. Ties can be broken by the sum of upvotes and downvotes.)

Comparison with Wilson Score

Piggybacking off of @celtschk’s prior work here (thanks for doing this!): How should we compute post scores?

SE order SE score upvotes downvotes Wilson score (z=2, center) Wilson order SteamDB score SteamDB order
1 83 86 3 0.946 1 0.846 1
2 55 59 4 0.910 3 0.811 3
3 30 30 0 0.941 2 0.820 2
4 3 4 1 0.667 4 or 5 (tie) 0.615 4
5 2 2 0 0.667 4 or 5 (tie) 0.594 5
6 0 1 1 0.5 6 0.500 6
7 -2 1 3 0.375 7 0.415 7
SE order SE score upvotes downvotes Wilson score (z=2, center) Wilson order SteamDB score SteamDB order
1 68 69 1 0.959 1 0.851 1
2 37 37 0 0.951 2 0.831 2
3 19 19 0 0.913 3 0.794 3
4 14 16 2 0.818 6 or 7 0.726 6
5 12 15 3 0.773 10 0.694 9
6 9 9 0 0.846 4 0.742 4
7 or 8 8 9 1 0.786 8 0.700 8
7 or 8 8 8 0 0.833 5 0.733 5
9 7 7 0 0.818 6 or 7 0.722 7
10 5 5 0 0.778 9 0.692 10
11 4 5 1 0.7 13 0.639 13
12 or 13 3 3 0 0.714 11 or 12 0.641 11 or 12
12 or 13 3 3 0 0.714 11 or 12 0.641 11 or 12
14 2 2 0 0.667 14 0.594 14
15 or 16 1 1 0 0.6 15 or 16 0.500 15 or 16
16 or 16 1 1 0 0.6 15 or 16 0.500 15 or 16

What sticks out to me most is how close the Wilson score ranking and SteamDB ranking are to each other, with only a few places where they differ. In particular, when considering two answers where A scored +15/-3 and B scored +5/-0, the Wilson score preferred B and the SteamDB rating preferred A.

4 Likes

The centered Wilson score needs just addition and division (and a multiplication by 2, but that can be precomputed as it only involves a global parameter), and no special case.

The Steam formula on the other hand uses a logarithm, and needs to handle the no-vote part as special case.

I wonder which metric you apply in order to consider the Steam formula simpler.

I think I’ve seen broad support for the idea that the default answer sort should give recent (1 month, configurable) answers a ranking boost. No idea how that could be incorporated into these ranking systems.

1 Like

That’s what the bonus term is for in the Specification. Basically, the bonus affects the sorting in the same way as the corresponding number of upvotes would.

2 Likes

@celtschk For what it’s worth, I will note that I said the SteamDB formula is considerably simpler than the original Wilson’s score formula, which you yourself described as “a mighty complicated formula”. Your adjustment to using the centered version that boils down (up+2)/(up+down+4) is, indeed, simpler than the SteamDB formula, and I have no strong objections to it.

This is a strawman. Both the division and the logarithm blow up if the total number of votes is 0. If you hate the special-casing, feel free to simply add 1 like the author of the SteamDB ranking article did (or do max(total_votes, 1) to preserve the exact mathematical characteristics of the premises). I opted to special-case in order to make the formulas simpler, much like you did with Wilson’s by choosing z=2.

Furthermore, I invite you to write an explanation of why your formula is sensible, like I did when answering Monica. Whichever ranking method we use, we’ll have to justify it to users, and we may want to pick the simpler explanation.

1 Like

Yes, but your first post started with:

And since we had decided to use the centered Wilson score, I naturally assumed that this “instead of” referred to that formula (since the other one was already decided against a long time ago).

But the denominator in the centered Wilson formula cannot be 0.

I didn’t choose z=2 to make the formulas simpler, I chose it because without a choice, I would not have had any numbers to calculate, and HeapUnderflow had noted earlier that z=2 was the value recommended by Wikipedia.

It is the center of the interval in which the true ratio will fall with high probability (95% for z=2). Thus is it highly likely to be a good approximation to the true value.

Honestly, from your explanation I didn’t get at all why that formula to define the uncertainty level is reasonable. It probably is (the people at Steam will have thought about it, after all), but it’s absolutely not clear to me why.

1 Like

I missed this since I was skimming the thread. When you arrive two months late to the party, there’s a lot of stuff to read… My apologies for not being clearer.

My one-liner would be: for every 10x increase in the number of votes a post has, we are twice as certain that the post’s score accurately represents its true quality.

If you mean the 10 and 0.5 constants, then those are basically completely arbitrary. We can tweak them as need be. Otherwise, the formula is a consequence of the premises.

2 Likes

No problem.

And that’s what I don’t understand: Why should our certainty relate to the number of votes in this form?

Note that my one-liner does not contain any formula; it contains concepts. If you want to know why the concepts lead to the formula, you can look into Wikipedia. But the concept itself is clear: If we want a good approximation to the real value, we better are in the region where the real value is most likely to be found. And being right in the middle of it is likely not a bad bet.

I don’t see the concept behind the Steam formula.

That’s because the premises, at least in the form you gave them, essentially are the formula.

I must admit I don’t understand your question. If I rephrase my liner to focus entirely on concepts, then it’s literally just “the more votes a post has, the more certain we are that we know it’s true value”, which is exactly the same thing as the centered Wilson score. It is true that you make no mention of the size of “the region where the real value is most likely to be found”, which is the ‘confidence’ part, but it shows up in the formula nonetheless. (And in your original one-liner, where you imply a 95% confidence interval.) The formula you advocate also incorporates total votes (dividing by up+down+4), which is why I’m confused by your remarks on the SteamDB formula incorporating the number of votes into the level of certainty.

2 Likes

Trouble is, the concepts behind the scoring rule have to be germane to justify it.

The center of the score interval is an estimator more typically derived as the posterior mean given a binomial likelihood & a symmetric beta prior. In any case the stochastic model of the data-generating process is the same, & the premises are these:

  1. We’re interested in estimating what would be the ratio of up-votes to down-votes from a hypothetical large population of voters. Is that so? If we see a handful of answers with single-figure up-votes & one with fifty, we’ll imagine before we’ve read them that the most up-voted must be head & shoulders above the others, especially if they were all posted around the same time: to refrain from voting is a choice, & the relative tallies of up-votes informative far beyond any effect on the precision of such estimates. (@MartijnWeterings has already made this point, but it’s worth reiterating.)

  2. This notional ratio of up-votes to down-votes is a constant. One obvious exception is when an answer attracting down-votes heavily edited & starts to get up-votes; another is when a truthy answer attracts up-votes till someone points out an error, when it starts to get down-votes.

  3. Votes are cast independently, i.e. no-one takes into account current vote tallies on the answer they’re voting on, or other answers. I don’t suppose this is close to true.

It’s not that I have a more suitable model up my sleeve, but that we need to investigate whether a candidate rule works sensibly enough in particular cases & to consider what kind of behaviour its use is going to promote. For instance, using (upvotes + 1) / (upvotes + downvotes + 2), the default, results in the score for solely up-voted answers having a preposterous sensitivity to single downvotes. If one answer’s got 10 up-votes, & another in the invidious top-of-the-page position has got 20, a single down-vote for the latter will reverse their rankings. (There should be a Leap-Frog badge for users who do this & get their own answer to the top of the page.)

1 Like