How do we separate distinct communities?

Olin · 3 February 2020 12:08

Right. I usually don’t read such things either. That’s because I know Amazon (or whatever merchant) it trying to make money. You can’t do that if you get really obnoxious with your customers. At best you can get away with it for a short while. Any site that has been around long enough for me to recognize the name is therefore not likely much of a problem. Then there are also various consumer protection laws, and protections I get by buying via credit card.

First, none of the above applies to a site where they aren’t trying to make money, and in the end don’t really need you. You come here to get something, which is an answer. We have rules and guidelines, that if you don’t follow them, you don’t get the desired result.

All that said, I agree with you. Way too many users just barge in and blurt out their question. They don’t want to read anything. i need u 2 hlp now urgnt! That’s why I was trying to come up with ways to slow down the blurters, and make them read something first. If nothing else, there will be no excuse for “I didn’t know”.

So, what should the mechanism be? I think the first time you try to write a question or an answer, the system should give you an introduction. The question and answer introductions would probably be different. Keep it simple and to the point. It should fit on a reasonable page without scrolling. The user then has to click on something like I have read and understood the rules before the system continues to the post editor.

Now the problem becomes, how do you ensure the user actually did read the rules, not just clicked the button. There are some possibilities:

Instead of just a button I have read ..., follow with a quiz. We don't care how you know the rules only that you do. If you don't get 4 out or 5 right, you get dumped back to the rules page.
Bury necessary details to proceed in the rules, like:
- Take the number in the top left corner, add 3, and enter it into the box.
- The magic number is 27. (This changes every time the rules page is reloaded).
- You must click the blue monster icon on the next page to unlock it.
- The post editor has an URGENT button. If you click it, your post gets delayed for 24 hours and will start with -2 score (will take 3 upvotes to release). This is all clearly spelled out in the rules.
Have a separate button for each rule. If each button is associated with only a sentence or two, you are more likely to read it than a whole page. You might even read the text inadvertently.

There is a lot of latitude in the spectrum from obnoxious to ineffective. I don’t know what the tradeoff should be, but we need something better than what SE does.

manassehkatz · 3 February 2020 14:56

Except on https://puzzling.stackexchange.com/ I can’t see the complicated stuff like magic number formulas working. I think too many people would leave and never come back. Enough people are annoyed at “simple” Captcha. Make it more complex for no reason except “we want you to stare at the words next to this stuff for a minute” and you will lose a lot of people including the experts we want to attract to be answerers.

That being said, we do need to do *something. Separate button for each “rule” has the advantage of extending engagement with the page without making it obnoxious.

celtschk · 3 February 2020 15:03

A variant of the mandatory quiz would be that you can post your question immediately, but then it will start in the closed state (with the specific close reason of “uninformed user” or similar), and will have to be. However if you did the quiz before asking your question, your question will start in the open state. You can gain the status “informed” either by completing the quiz, or by getting a positive question record.

At the top of the question box, there would be a clearly visible text similar to:

Information: If you take our quiz before asking, your question will be immediately eligible for answers. Otherwise it first has to be confirmed by the community.

One might even have users go back to the uninformed status if their question record goes bad afterwards (as they obviously have forgotten the rules). This would be similar to how in Germany you may be required to take some extra theoretical driving lessons if you’re found repeatedly violating traffic laws.
Of course if re-taking the quiz doesn’t help either, harsher measures are appropriate.

Mithrandir24601 · 14 February 2020 10:37

Summary Post

We’re not having one database to rule them all?

Aims and approach

SO/SE started off not planning to be a network, so uses 2 databases per community (1 for main, 1 for meta) and an extra 2 special databases - stackexchange.com and cross-site data. For SO Teams, they use multiple schema within one database. They used Object Relational Mapping (ORM) but spent years getting rid of it, as they felt it was a mistake. As they’re aiming for performance, they use a lot of caching (for nearly everything except the questions and answers). While the aim is to grow at least as big as SE, we want to create our own thing, so while it may be useful to think/model ourselves with SE levels of communities and data, we’re not aiming to recreate SE. We can however, learn from what they did right and what they did wrong. To quote Skilvyz,

That is, we want to start with the Minimum Viable Product and learn from how it grows to implement what is best for the product, instead of forcing our views on it from the start, constricting it to be something worse than what it could be. In that vein, network features only become necessary once we have a network and different potential usage scenarios come up with different possible database structure optimisations.

Requirements

So far, we’ve thought about the data licensing under which posts are created and are making the software open source. As we’re aiming for community ownership, we want both a policy based on our Code of Conduct to give reasonable limits of what a community can and cannot do. We also want a good amount of customisation possible and an easy way for a community to leave with all their (non-PII) data.

It’s essential that sensitive data/PII be kept separate from the public data, so this should be done on a single separate database serving all communities. Authentication etc. shouldn’t rely on 3rd party login services and should be done by a separate module.

We would like to have some form of integration between sites and potentially even instances (in the long run) and some way for users to access network profiles, which could contain questions from sites they visit, notifications etc. (similar to a user’s network profile). On the other hand, few users want to be a user on every site and a number actively don’t want this. So, we want an easy way for users to join new sites without automatically joining them to all sites.

There will be a lot of data and a lot of tables as virtually everything (e.g. votes, edits etc.) will need to be permanently saved in detail. Storage and RAM are cheap, so this should be fine from a cost perspective. In order to have good performance, we need to ensure that the right data is in the right place for easiest access. Having said that, it might turn out that we’ll find a better way of storing everything after MVP, so the database structure will likely change at some point. The most common data access will be to look at question pages, so this needs to be heavily optimised and we want to minimise the number of queries made to a reasonable minimum, although it will be more than one per page. A Stored Procedure (sproc) will help with this as it has a smaller overhead. Having separate tables for questions and answers may also help with performance as while PostgreSQL has methods to help with performance of large tables, these methods can add extra complexity to the code and we want to keep things as simple as possible for as long as possible.

Options

1 instance per community
1 database per community
1 schema per community
- a lot of people have different definitions of ‘schema’, so what this means is complicated to figure out sometimes
- here, something like ‘multiple copies of the database design/structure (containing different data) but all contained within a single database’
each table has a ‘community’ field, so there is 1 database for all communities combined
multiple databases with multiple communities in each

In addition, there will be a separate database for PII etc.

Advantages and disadvantages

Each community having its own instance is a possibility as this allows for a lot of flexibility with each community. It also allows for eacy spreading over multiple servers in multiple locations and we wouldn’t need to worry about how to deal with communities ‘splitting off’. However, that’s because this is having communities split off and everything is very separated. This might end up happening naturally to some extent in the future (e.g. SO, Meta SE and the SE Network are all fairly separated and we want communities to be free to make this decision themselves).

Having a single database per community allows for better optimisation of non-network queries, which are going to be the most frequent queries by far. Cross-site queries would have to be done by middleware/APIs, although will be much rarer than regular (‘within-site’) queries. This is nevertheless within the constraints of Codidact, although outside the constraints of the Codidact ‘core’ repository. Middleware would also be required for user profile management and question migrations. The extra database for PII can be used throughout the network and any changes to the database design or high level moderation can be scripted. In particular, this allows for canary release to minimise problems caused by new software versions and limits the damage caused by potential ‘disaster events’.

Using a single schema per community makes the communities less isolated than a single database per community and makes for more easily implement cross-community queries. Although regular queries are still fast, PostgreSQL would struggle to cope with the sheer number of tables at SE scale making these cross-community queries potentially slow at scale. It’s also less secure and would need to be able to cope with huge numbers of concurrent connections. As a result, this method would be difficult to scale well.

A community in each table would require fewer tables and allow for easier network usage and moderation but would also be difficult to implement, would run into issues with concurrent connections and volume of queries and would simply get too big at scale, meaning either difficult to implement methods to deal with massive databases would have to be used, or we’d have to split the database anyway. While it would create a more ‘linked’ network, allowing for easier migrations and easier network moderation, this would also lead to corresponding issues, such as it being harder to perform maintenance on individual communities and making it harder for individual communities to start their own instance.

Having multiple databases with multiple communities in each would scale much better than having a single table but as cross-site queries would still require work to be done across databases, this solution is no easier than having a single community per database.

Resources

Decision made on the 4th February:

each community to have its own database, without PII
Account management will have PII and be a separate service
instance database and service hopefully without PII

cwellsx · 14 February 2020 12:46

Sorry to comment at this late stage, but perhaps there’s some use-case for a materialized view – e.g. of joined data – instead of denormalising tables.

Apparently however they cannot be updated incrementally – e.g. to update a specific record of a view:

https://dba.stackexchange.com/questions/194905/how-to-do-an-incremental-refresh-of-a-postgresql-materialized-view

So you could think of mimicking them with tables – but conceptually distinguish the permanent/normal tables from the temporary/current tables.

I’d suspect that data is read more often that it’s written – so writing twice, to INSERT to a data table and to UPDATE a corresponding view table, might have relatively little effect on performance, compared to benefits of improved SELECT performance (from a table that’s the moral equivalent of a materialised join).