If it’s not open source, who is going to develop these tools?
I think it’s theoretically possible to have such tools be open source, as long as they are near perfect. 99% of spammers & trolls aren’t going to take the time to develop something specific to get around our system.
Do you mean spam detection or copied/stolen copyrighted source code detection? Because fraud seems like a social engineering issue rather than something you can just search the text of the post for.
Spam, voting fraud, troll detection – these are all areas where we don’t want to make public exactly how we detect issues. This is a good question; how do other open-source projects handle sensitive parts of their projects?
a system to deter spammers (e.g. email and ip blocking), and
a system catching vote fraud and similar things (sockpuppets, targeted upvoting)
On SE, the exact algorithms for the latter are private, “because otherwise people could work around it, by violating the rules, but not the implementation”.
Sorry, the above is right that I meant voting fraud, I’ll update my post.
Not sure how other open source systems handle this; I know when Reddit was open source, the spam/troll detection was the only thing that was closed source.
Perhaps the system can be based on a bunch of tweak parameters. If made “complicated” enough, knowing the algorithm doesn’t help much without known what the dozen or so tweak parameters are set to. Those would be in the database and only visible to the people running the site and possibly mods. Letting the mods adjust the parameters might be useful to react quickly to tune out specific threats as they pop up.
Security through obscurity isn’t really security. So leaving parameters hidden isn’t really going to help, because spammers can use scripts to automate spam posts until they figure them out.
It’s the same thing with leaving code hidden. It doesn’t truly make it more secure, it just makes penetration take more time (sometimes).
If increasing the difficulty of subversion is all that is required, then it is enough to do so.
Project director for Charcoal (including Smokey) here. SmokeDetector’s detection systems are highly tuned for spam; more to the point, they’re highly tuned to spam on SE. Every site that has a textbox gets a different variety of spam; Smokey’s rules may completely miss any spam we get.
In any case, spam isn’t a problem we need to worry about for MVP. MVP should be about getting up and running; bells and whistles like spam detection are v1.1 features - not too long after MVP, but shouldn’t block initial release.
@Olin and @mbomb007 have hit the nails on the head here. We don’t need security, we just need difficulty—spammers (and trolls) are lazy, and will give up if you start putting roadblocks in their way. So, yes, we can find our algorithm, then make its numerical parameters configurable site settings with sensible default values: that’s more than sufficient obscurity to be enough of a roadblock.
And I’m hoping that with Codidact being both open-source (so we can write hooks for spam detection) and the primary instance well governed (ahem), that we can have a much tighter integration with Charcoal than SE does. (OK, having the same person in charge helps too.)
Some basic rate limiting on the post functionality will most likely be more than enough in the beginning. Spam in general scales with the eyeballs the spammer can get and so it’s highly unlikely that we’ll have a significant spam problem from day one.
Furthermore, it doesn’t necessarily help a spammer to know our intricate system of spam detection if certain parameters and configuration on our production deployment are not part of the code.
As the author/owner of Codidact we could incorporate non-free modules without violating the license (licenses are for other people). But it might be worth considering whether an exception should be allowed for people who want to host their own instances.
Note however that every single contributor will have to agree with the non-free modules.
If we go for the special exception, the exact scope of the special exception should be very well considered, or else we might undermine the very reason of selecting the AGPL in the first place.
I’m not so sure. If there are well-defined, easily-accessible rules for how the filter works, what’s stopping someone from writing code to automate a process to get around the filter?
It’s the “script kiddie” problem: All it takes is one non-lazy spammer to create this and distribute it, and then all the lazy ones have the power to make use of it.
There’s nothing to stop folks doing that on SE either, and their algorithm is secret. I don’t believe there’s anything to be gained by secrecy, as long as the variables are configurable.