How do we separate distinct communities?

Marc.2377 · 12 December 2019 17:09

@ArtOfCode: For the (IMO rare) instances where we need to query cross-database, there are facilities to allow that.

How big of a trouble do you have in mind? Just a matter of repeated/multiple queries vs just one (unified), or something more limiting?

Unfortunately it appears that “schema” is a bit of an overloaded word and means something different in MySQL - but, assuming it carries the same meaning in Postgres as it does in SQL Server: a schema is analogous to a namespace in an application, while a database is similar to an application instance itself.
I prefer to have separate databases where application instances are entirely (or almost entirely, save for very few corner cases) independent; while reserving schema separation for concepts that pertain to each application instance (like we do with namespaces at the application level).

The advantages of one DB per site are that backup and general maintainance operations can be much faster (and follow different schedules), and taking down specific clusters without affecting the entire “network” is possible. In terms of runtime performance, it very much depends on the engine and I’m not familiar enough with Postgres to make any authoritative assertions, but in general, smaller DB sizes allows more optimal working of internal query optimizers, optimizing memory usage and also CPU usage to some degree, and it’s sometimes the case, though not always, that replication works better too. Optimal working doesn’t necessarily translate to better, perceptible performance, but it most certainly translates to cost savings.

Some reading material:

The bold part: Not entirely sure about that, as it’s not the approach I’ve primarily had in mind… each site would, as I envision them, be a separate web application that runs on its own context. This differs from the current SE software topology (see Discord) but for it’s own, relevant set of reasons. Anyway, this right here reminds me we gotta talk, myself and devs, about this specifically. Will ping contributors over the chat.

No, unfortunately it’s much, much earlier than that, believe me. Some of our communities should one day hit impressive db size stats, and when the day comes we’ll be thankful we didn’t make the wrong decision to support that scenario.

[Edit] @manassehkatz, I’ll have to get back to your message later - tonight.

manassehkatz · 12 December 2019 17:36

Not that much of an issue with AWS (or similar). DB, no matter what the size, should not be locked or down for any significant amount of time during backup.

Well, that is consistent with “separate DB”. But I envisioned a single instance being a single web application, no matter how many communities are inside it. It just makes a lot of things easier, when in conjunction with a single database. That doesn’t mean limit to one physical server - there are a number of ways to have multiple web servers, but having all communities within a Codidact instance use one set of server(s) means that any scaling does not concern itself with the relative size of each community - as they grow or shrink (due to many different factors), the system as a whole should adapt (or be modified as needed), independent of each specific community’s traffic, storage, etc.

JackDouglas · 13 December 2019 19:52

I’m generally aligned with manassehkatz’s post, but on the particular question of partitioning I don’t think you will need it or benefit from it:

partitioning adds significant complexity. it is worth it if it solves an important problem of course
the important problem partitioning solves isn’t ‘breaking down large tables for performance’. Large tables (ie billions or trillions of rows) are not necessarily slow: any more than looking up someone’s number in a large telephone directory isn’t necessarily much slower than doing so in a small directory. With this sort of application all access will be begin with some sort of index scan. The problem partitioning does solve is fast archiving of old data — dropping a partition is much quicker and cheaper than a big delete.

HeapUnderflow · 14 December 2019 19:16

Separate databases make it harder to do cross-posts, which may be desirable.

cbrumbaugh · 14 December 2019 21:53

It seems to me that storing the comments (for example) from multiple sites in a single table instead of in a different DB means that we would need to add a site table and then an foreign key back to that site table.

I don’t see a single INT column and foreign key making that huge of a difference to performance, and especially not right away. It’s might also make development go faster, because with multiple DBs there is always the chance that table structure drifts. There is also a time cost when you have to make changes to multiple DB’s instead of just one.

I have done small systems where it was done with many similar (but not perfectly exact DB) and in my experience, a single system with a foreign key to the site makes more sense.

appgurueu · 15 December 2019 12:58

I am also in favor of a single database. I don’t see why we should have separated, distinct communities. Choosing SE as an example, I see lots of intersections across communities like Worldbuilding, writing, etc or code review, SO and the like. I’m therefore for somehow grouping together similar content, but based on a hierarchical organization. If one is interested in writing, they would have to look for topics about writing, much like SE tags. Possibly “tags” (which would have hierarchical structures, not like on SE, where tags are just a huge set of words) would also allow to somehow “score” them, sort of rating “how much” something got to do with a certain topic…

manassehkatz · 15 December 2019 15:16

@appgurueu I think we need separate distinct communities. As others have mentioned, simply using tags to separate doesn’t work well - for new users in general and for everyone with controversial topics (e.g., religion). That being said, the database should all be together, which will allow cross-posting (if we allow that), migration between communities, consolidated reporting, etc. to all be much simpler than with separate databases.

HeapUnderflow · 15 December 2019 16:06

The only downside I can see to using the same DB is that it makes migrating a community from one instance to another a la MathOverflow harder.

manassehkatz · 15 December 2019 16:12

Actually, it doesn’t make moving an entire community harder, at least not in any big way, because when you make a move like that, except in a very controlled environment which multiple instances would not be in our context, you can’t simply move a bunch of DB files around, you export a bunch of data and then import a bunch of data. The only extra piece would be selecting on the community ID in the source instance (trivial) and changing the community instance on-the-fly during import. Of course, if the destination instance already has any data at all then all ID #s would get changed in the import process anyway, so community ID as one of them is truly no big deal.

lavendercake · 18 January 2020 13:06

My 2 cents: databases at SO’s scale tend to push SQL Server to its limits (of course, they do have a lot of data). If Codidact had one database for all data, it does help development upfront and make data management easy but when there’s a lot of data it can cause problems.

I’d vote for decentralizing bulky parts of the database and make use of cross server queries or make use of a middleware that makes working with the data easier. For instance, a separate database for outgoing data like emails or notifications, or a database for user-uploaded content.

A single database can handle the load from questions, answers, and users for all the sub-sites.

DoctorJones · 29 January 2020 10:14

Here are my 2 cents on the database discussion.

Using schemas to seperate communities seems like an abuse of schemas. Using schemas is supposed to separate different tables into different logical groups. Tables can have the same name across different schemas, but they are not treated as the same table, i.e. they can, and IMO should have different column definitions. If they should have the same column definitions, they should be within the same schema (i.e. be within the same table), to enforce that they don’t diverge.

In my opinion, grouping multiple comminities into a single database only gives disadvantages. The approach doesn’t scale. By sharing the same database, you cannot perform maintenance on individual communities. You have to take them all down at the same time, and if you encounter problems during maintenance, you’ve caused maximum inconvenience to everyone, instead of just one community. If (or when) we have DB related performance problems, you won’t be able to insulate your smaller communities from the problems, again because they’re on a shared DB. By going down this route, you make everything else less straightforward, and you place all of your eggs in one basket.

What’s the gain for placing multiple communities in a single database? What advantage does it provide? If the answer is “it makes reporting easier”, I’d say we’re putting the cart before the horse. Reporting should be achieved by establishing a data warehouse, that acts as an aggregate source.

In my opinion, every database should be identical. They should all have the same tables, and the same schemas. This allows you to use diff tooling (SQL compare), and ORM tooling that expects schemas to be used in the way they’re intended (i.e. to group different tables, not to have multiple copies of identical tables). It also allows us to move databases freely to different servers, or even different regions, e.g. it might make sense to host European language sites on European servers. You couldn’t do this if you were using one mega database.

Finally, having a single mega database just seems dangerous to me. If anything bad were to happen to that database, or that server, then you’ve lost everything, or affected everyone. I’d feel much more comfortable with having each community insulated from the rest. What if the data was lost, what if we have a data breach, what if a security exploit is discovered that exposes data, etc. By splitting the communities up, you’re greatly limiting the impact any of those disaster scenarios can have to just one community, instead of all communities.

celtschk · 29 January 2020 12:56

One negative effect of separate databases on SE is that you have different user numbers on different sites.

DoctorJones · 29 January 2020 13:13

~~Some people see that as a positive effect.~~

Sorry, I misread that as “different user names”. What do you mean by “user numbers”? Do you mean user id?

celtschk · 29 January 2020 14:53

Yes, I mean the user Id. Given that the accounts are publicly linked, you cannot even take it as a privacy feature; there’s simply nothing positive about it.

DoctorJones · 29 January 2020 15:06

User id is just an implementation detail, there’s no need for any users to know or care about it, it’s literally something that only developers should be concerned about.

Assuming that we’re going to use some kind of single sign on / oauth provider, it would be easy to associate accounts that sign in via the same gmail account, etc. User id doesn’t even need to be exposed in the site url, so I don’t see the importance.

manassehkatz · 29 January 2020 15:12

Not going to rehash everything. We’ve (not you @DoctorJones but a bunch of other people over the last few months) gone through this multiple times and I thought we had a final decision to have one database. There are multiple reasons: moderation (often helpful to see posts by a user on multiple communities), migration between communities, reporting, user profile page (showing total questions or rep. or whatever from all communities), backup/restore, etc. The data breach concern doesn’t change as the PII is all together in one place (single set of user records/credentials/etc. for all communities in an instance). We are not, realistically, talking about a single instance that will be so large that treating it all as one database will be a practical issue, and there are definite advantages (in addition to the above, not having to worry about one community having a different version of the schema than another community) to having one database for all.

Olin · 29 January 2020 15:51

I’m not sure how relevant this is because some of this discussion is past my expertise. However, I did find it necessary to know my user ID on SE.

You can pick anything for your SE account name. That’s good, because you aren’t forced to pick something not used before, like Smith97, Smith98, etc. The downside is that a user name doesn’t uniquely identify an account. That causes some ambiguity when @<username> is used, but not a big deal. The scope of @<username> is limited enough from context that there are only a small number of aliases, if any.

However, I found this a problem when running searches looking for something I had previously written. The reliable way to limit that was “user:4512” on EE.SE, especially when I wasn’t logged in.

So, if you allow arbitrary user names and therefore rely on an internal user ID for uniqueness, make sure that this user ID is somehow available, like on the profile page of the user or something.

DoctorJones · 29 January 2020 16:36

It seems dubious to handwave this away. You’re effectively saying “The data breach concern doesn’t change if we don’t follow best practice, and recycle credentials for each environment”.

My point was that this cannot be mitigated if we use a single database for all communities. It absolutely can be mitigated if we follow best practice, and keep all environments isolated from each other as they should be, including different credentials for each environment.

There are other ways to achieve all of these, that don’t involve using one large database. If you go down this route, and make these features by relying on everything being in one database, when you finally do need/have to split communities into their own databases, you won’t be able to without breaking these network wide features.

You’re painting yourselves into a corner that will be very difficult to get out from.

Sorry for being late to the party, but I was asked specifically for my input on this issue.

manassehkatz · 29 January 2020 17:34

Perhaps part of the reason for our difference of opinion has to with communities vs. sites vs. instances. In my understanding, Communities are part of a single Instance of Codidact. In some ways that makes them no different from multiple Chat Rooms in chat system or multiple Topic Categories in a forum. Yes, there will be some positive action required to say “I want to be a part of Writing” vs. “I want to be a part of Worldbuilding” etc. But in reality you start by saying “I want to be a part of Writing” and the system responds with “OK, set up an account (email, confirmation, blah blah blah” and now you are part of the Codidact system and have privileges of “Trust Level 0 on Writing” and “read only, just like the people who haven’t registered, for everything else”. Then you post some Q&A, build up to Trust Level 2 and then say “I want to join Worldbuilding”. The system says “OK, you’re already a member of Codidact so let’s see: You have TL2, so we’ll get you in to Worldbuilding starting at TL1”. It is all one system. So one database, one set of credentials, etc.

Marc.2377 · 30 January 2020 21:43

Indeed I specifically requested Matthew’s input on this.

(Thanks for sharing your insights, and we’ll get back to it soon.)