MVP Proposal: Base Trust Level for a User on Quality + Consistency

Related: MVP Discussion: Reputation and MVP Proposal: User Trust and Reward System.


I propose that a user’s “reputation” be based on both 1) the average quality of their contributions and 2) the level of confidence in the user’s ability to maintain or improve their quality of contributions.


To start with, let’s consider a simpler situation: answers! Suppose there are four answers on the same question with the following scores:

A. +100 / -10
B. +20 / - 1
C. +10 / -1
D. +6 / -0

If you consider merely the sum of upvotes and downvotes, as SE does, then these answers are ranked the way I’ve listed them. However, if you consider the ratio of upvotes to downvotes (up / (down + 1) to avoid divide-by-zero errors), then they’re ranked like so: B (10), A (9.09), D (6), C (5.5). Going to extremes can be pretty illuminating - an answer that scores +1 / -0 will tie with an answer that scores +1000 / -999, using either ranking method. Does it make sense to say that both answers are equally good?

This is where confidence comes in. This is what drove the development of ranking methods like Wilson scores. I, however, much prefer SteamDB’s method for ranking games which is quite a bit simpler and more straightforward.

The gist of it is that we arbitrarily assume that if one answer has +100 / -10 and another answer has +1000 / -100, then we are twice as confident that the second answer has a score that reflects its true quality. That is, as more upvotes and downvotes come in, we are more certain over time that an answer’s current score is indicative of its future score.


As has been said many times, SE’s system of having one rep number determine everything is highly flawed, so an alternative approach is to have multiple numbers, in a sense, that are based on asking, answering, flagging, etc separately. If a user casts many flags and all but one or two are deemed valid, then it naturally follows that they should be trusted more when it comes to flagging.

This is where using this kind of average + confidence method really shines. A new user casts three flags, and two are valid but one is rejected. If this were a user that cast 300 flags and only two thirds of them were good, that’d be a problem. But this is a new user, we don’t know for sure if this pattern will continue. They’re probably just learning. So we can be a little lenient at first.

Over time, however, as this user continues casting more and more flags, let’s say they get to 1000 valid and 10 invalid flags. At this point, we can be highly confident that they make good decisions and allow them some more power in their flagging. Or give them the privilege to validate flags from other users. That sort of thing.

Here are some ideas on how this could be applied to various categories of actions:

  • Asking: score is based on good vs bad questions; potential benefit for good performance could be the ability to bounty immediately upon posting; potential detriment for poor performance could be rate limiting.
  • Answering: score is based on good vs bad answers; potential benefit could be a boost in ranking among competing answers; potential detriment could be getting put in a review queue.
  • Commenting: score is based on helpful vs unhelpful comments; potential benefit could be ability to comment on other users’ posts; potential detriment could be rate limiting.
  • Flagging: valid vs invalid; benefit could be super flags; detriment could be inability to flag others’ posts.

So on and so forth. On top of all these, there should be a general level of “reputation” whereby positive usage of the site in general can partially or wholly contribute towards a user getting privileges such as flagging. This general “reputation” can be calculated as a weighted combination of each of the category-specific “reputation” scores - a user that asks many great questions isn’t necessarily good at flagging, but their track record of positive contributions should count for something.


For each category of site actions, a user’s trust level should be based on both the quality and quantity of their contributions within that category; the latter tells us how confident we are about the former.


As a physicist, I instinctively feel that the confidence should possibly involve a sqrt(number of votes) in there somewhere - that is, with 10x the votes, it seems to me that sqrt(10) ~ 3.16 as confident. Still just a vague rule of thumb really, but one that is perhaps slightly less vague (10x the votes means 10x the time spent reading, right? So your error is maybe ~sqrt(10)).

Wait… Perhaps that means that the confidence should actually be related to the number of views of the post? By this square root dependence, perhaps?

When it comes to the flagging, things are slightly more difficult - some flags are more borderline than others, so a relatively high uncertainty in borderline flags might be down to a difference in subjective opinion, while a high degree of uncertainty in more obviously right or wrong flags is much less subjective.

1 Like

From the beginning I’ve wanted to replace a single reputation number with some sort of indicators for types of activities (like asking, answering, editing, curating). It’s useful to see at a glance that this person has a lot of positive contributions from answers and that one does a lot of good editing and so on. We haven’t developed that idea much so far.

We have already tackled controversial answers in another way: answer order is based on Wilson scores, not mere up-minus-down like on SE. If we do per-activity scoring for users we should be consistent; we shouldn’t have a different system for scores on the question page vs. contributions to a person’s “answer rep”, so to speak.

I think a lot of us want to come up with the optimal, complex, mathematically-beautiful solution to this problem. Bear in mind that we have to be able to explain it to users clearly, and most of them do not share our fascination with systems-tweaking.


Actually it seems to me that you really want the log of number of votes or views or whatever.

I don’t have a problem with scores for various activities being available somewhere. But, there is nothing like a single rep number to foster competitiveness. At least on certain sites, this will be important. There has to be some simple thing you can aspire to be the top of.

1 Like

A total score can be as easy as a weighted sum of the individual scores (weighted because some activities are more important than others; a good answer should count more than a helpful flag).

Actually the centered Wilson score shines here: Disregarding all theory behind it, the effect is simply:

Calculate the ratio of upvotes to total votes pretending there were N additional votes of each type.


Despite my preference for the SteamDB method, I fully support using the centered Wilson score (proposed here for answers) - (positive+2)/(positive+negative+4). The only thing that really matters for this specific proposal is that the observed average quality of contributions be adjusted according to the number of contributions.

Further, if desired, we can go even simpler: define trust thresholds as a pair - (average quality, number of contributions) - and require that both be exceeded before a particular privilege is granted. That way we can pretty much abstract away whether we’re using a centered Wilson score or SteamDB’s ranking algorithm.