Will "sensitive code" e.g., voting fraud detection be open source?

Dustytrash · 20 January 2020 16:16

If it’s not open source, who is going to develop these tools?

I think it’s theoretically possible to have such tools be open source, as long as they are near perfect. 99% of spammers & trolls aren’t going to take the time to develop something specific to get around our system.

user1306322 · 20 January 2020 16:54

Do you mean spam detection or copied/stolen copyrighted source code detection? Because fraud seems like a social engineering issue rather than something you can just search the text of the post for.

cellio · 20 January 2020 16:57

Spam, voting fraud, troll detection – these are all areas where we don’t want to make public exactly how we detect issues. This is a good question; how do other open-source projects handle sensitive parts of their projects?

luap42 · 20 January 2020 16:57

I think Dustytrash refers to these two systems:

a system to deter spammers (e.g. email and ip blocking), and
a system catching vote fraud and similar things (sockpuppets, targeted upvoting)

On SE, the exact algorithms for the latter are private, “because otherwise people could work around it, by violating the rules, but not the implementation”.

user1306322 · 20 January 2020 16:59

Isn’t Smokey/SmokeDetector bot system open source? Can we reuse its algos to detect potentially problematic content?

luap42 · 20 January 2020 17:02

I thought about that, too. I think @ArtOfCode is active there and can respond to that.

Dustytrash · 20 January 2020 17:15

Sorry, the above is right that I meant voting fraud, I’ll update my post.

Not sure how other open source systems handle this; I know when Reddit was open source, the spam/troll detection was the only thing that was closed source.

Olin · 20 January 2020 17:43

Perhaps the system can be based on a bunch of tweak parameters. If made “complicated” enough, knowing the algorithm doesn’t help much without known what the dozen or so tweak parameters are set to. Those would be in the database and only visible to the people running the site and possibly mods. Letting the mods adjust the parameters might be useful to react quickly to tune out specific threats as they pop up.

mbomb007 · 20 January 2020 17:53

Security through obscurity isn’t really security. So leaving parameters hidden isn’t really going to help, because spammers can use scripts to automate spam posts until they figure them out.

It’s the same thing with leaving code hidden. It doesn’t truly make it more secure, it just makes penetration take more time (sometimes).

If increasing the difficulty of subversion is all that is required, then it is enough to do so.

ArtOfCode · 20 January 2020 18:25

Project director for Charcoal (including Smokey) here. SmokeDetector’s detection systems are highly tuned for spam; more to the point, they’re highly tuned to spam on SE. Every site that has a textbox gets a different variety of spam; Smokey’s rules may completely miss any spam we get.

In any case, spam isn’t a problem we need to worry about for MVP. MVP should be about getting up and running; bells and whistles like spam detection are v1.1 features - not too long after MVP, but shouldn’t block initial release.

ArtOfCode · 20 January 2020 18:27

@Olin and @mbomb007 have hit the nails on the head here. We don’t need security, we just need difficulty—spammers (and trolls) are lazy, and will give up if you start putting roadblocks in their way. So, yes, we can find our algorithm, then make its numerical parameters configurable site settings with sensible default values: that’s more than sufficient obscurity to be enough of a roadblock.

manassehkatz · 20 January 2020 18:36

And I’m hoping that with Codidact being both open-source (so we can write hooks for spam detection) and the primary instance well governed (ahem), that we can have a much tighter integration with Charcoal than SE does. (OK, having the same person in charge helps too.)

Helmar · 20 January 2020 20:01

Some basic rate limiting on the post functionality will most likely be more than enough in the beginning. Spam in general scales with the eyeballs the spammer can get and so it’s highly unlikely that we’ll have a significant spam problem from day one.

Furthermore, it doesn’t necessarily help a spammer to know our intricate system of spam detection if certain parameters and configuration on our production deployment are not part of the code.

mattdm · 20 January 2020 21:22

Fedora’s wiki spam detection tool is open source — https://pagure.io/basset.

curiousdannii · 20 January 2020 22:51

As the author/owner of Codidact we could incorporate non-free modules without violating the license (licenses are for other people). But it might be worth considering whether an exception should be allowed for people who want to host their own instances.

celtschk · 20 January 2020 23:20

Note however that every single contributor will have to agree with the non-free modules.

If we go for the special exception, the exact scope of the special exception should be very well considered, or else we might undermine the very reason of selecting the AGPL in the first place.

curiousdannii · 20 January 2020 23:25

Before too much longer it would probably be good to get a CLA set up as well as join a foundation (or start one, but I’d prefer joining one.)

MasonWheeler · 21 January 2020 13:01

I’m not so sure. If there are well-defined, easily-accessible rules for how the filter works, what’s stopping someone from writing code to automate a process to get around the filter?

It’s the “script kiddie” problem: All it takes is one non-lazy spammer to create this and distribute it, and then all the lazy ones have the power to make use of it.

mbomb007 · 21 January 2020 14:35

We may not need security right now, because we are a small target, but we would want it in the long run. It’s better to plan ahead.

ArtOfCode · 21 January 2020 17:15

There’s nothing to stop folks doing that on SE either, and their algorithm is secret. I don’t believe there’s anything to be gained by secrecy, as long as the variables are configurable.