Related: MVP Discussion: Reputation and MVP Proposal: User Trust and Reward System.
Proposal
I propose that a user’s “reputation” be based on both 1) the average quality of their contributions and 2) the level of confidence in the user’s ability to maintain or improve their quality of contributions.
Example
To start with, let’s consider a simpler situation: answers! Suppose there are four answers on the same question with the following scores:
A. +100 / -10
B. +20 / - 1
C. +10 / -1
D. +6 / -0
If you consider merely the sum of upvotes and downvotes, as SE does, then these answers are ranked the way I’ve listed them. However, if you consider the ratio of upvotes to downvotes (up / (down + 1)
to avoid divide-by-zero errors), then they’re ranked like so: B (10), A (9.09), D (6), C (5.5). Going to extremes can be pretty illuminating - an answer that scores +1 / -0 will tie with an answer that scores +1000 / -999, using either ranking method. Does it make sense to say that both answers are equally good?
This is where confidence comes in. This is what drove the development of ranking methods like Wilson scores. I, however, much prefer SteamDB’s method for ranking games which is quite a bit simpler and more straightforward.
The gist of it is that we arbitrarily assume that if one answer has +100 / -10 and another answer has +1000 / -100, then we are twice as confident that the second answer has a score that reflects its true quality. That is, as more upvotes and downvotes come in, we are more certain over time that an answer’s current score is indicative of its future score.
Application
As has been said many times, SE’s system of having one rep number determine everything is highly flawed, so an alternative approach is to have multiple numbers, in a sense, that are based on asking, answering, flagging, etc separately. If a user casts many flags and all but one or two are deemed valid, then it naturally follows that they should be trusted more when it comes to flagging.
This is where using this kind of average + confidence method really shines. A new user casts three flags, and two are valid but one is rejected. If this were a user that cast 300 flags and only two thirds of them were good, that’d be a problem. But this is a new user, we don’t know for sure if this pattern will continue. They’re probably just learning. So we can be a little lenient at first.
Over time, however, as this user continues casting more and more flags, let’s say they get to 1000 valid and 10 invalid flags. At this point, we can be highly confident that they make good decisions and allow them some more power in their flagging. Or give them the privilege to validate flags from other users. That sort of thing.
Here are some ideas on how this could be applied to various categories of actions:
- Asking: score is based on good vs bad questions; potential benefit for good performance could be the ability to bounty immediately upon posting; potential detriment for poor performance could be rate limiting.
- Answering: score is based on good vs bad answers; potential benefit could be a boost in ranking among competing answers; potential detriment could be getting put in a review queue.
- Commenting: score is based on helpful vs unhelpful comments; potential benefit could be ability to comment on other users’ posts; potential detriment could be rate limiting.
- Flagging: valid vs invalid; benefit could be super flags; detriment could be inability to flag others’ posts.
So on and so forth. On top of all these, there should be a general level of “reputation” whereby positive usage of the site in general can partially or wholly contribute towards a user getting privileges such as flagging. This general “reputation” can be calculated as a weighted combination of each of the category-specific “reputation” scores - a user that asks many great questions isn’t necessarily good at flagging, but their track record of positive contributions should count for something.
TL;DR:
For each category of site actions, a user’s trust level should be based on both the quality and quantity of their contributions within that category; the latter tells us how confident we are about the former.