How do we separate distinct communities?

Marc.2377 · 31 January 2020 07:25

Here’s one single off-topic message from me on this thread. If it goes too far, we’ll split it onto a separate topic.

This is a topic where you and I disagree, and have done so since our earlier discussions. And more so as time passes.

You see, members joining this project sometimes have a sort of emotional attachment to their “communities” so to speak, a personal motivation that drives their efforts when it comes to a particular endeavor. I know Monica had deep involvement with the Writing and Judaism community, to mention just a few. I imagine you had your own.

As for me, my “main” community was Stack Overflow. As I stated before and will reinforce now, that’s what I’m aiming to replace in the long term. That’s the new home I’m looking to build.
(addendum: I am/was *very* fond of Unix.SE, Workplace.SE, EE.SE and others too. Mainly SO, though.)

I understand you said, then and now, that we can’t realistically hope to compete with / displace SO. And I’ll continue to disagree.

Even if you end up being right in the end, there’ll be no harm done by pursuing higher requirement standards from the get-go. You know, “shoot for the moon and if you miss…” and all.

DoctorJones · 31 January 2020 10:18

Thank you for the explanation manassehkatz, I can see your line of reasoning now, so thank you.

As you read the following, please keep in mind the spirit it is intended in. I’m trying to help steer the project in the right direction, and I would be remiss if I don’t mention my concerns. I hope this input is helpful.

You say that each community in the network will be analogous to a chat room, but I’m not sure if that is the appropriate paradigm to have in mind. Correct me if I’m wrong, but I believe the intention of this project is to replace Stack Overflow / Stack Exchange, so in an ideal world, we would become just as big as they are. That’s the paradigm we should be using, and the goal we should aim for.

If we get to the point where we are a victim of our own success, on the current plan, we will have painted ourselves into a corner, as we cannot split the database apart and scale up without performing a major rewrite on the backend, because all of the site network features rely on everything being in one database. In this worst case scenario, we will have serious performance issues on every site on the network, because the database will be the bottleneck, and we won’t be able to perform a fix without breaking or rewriting all of the site network features.

Have we considered how many communities we might have in the network, and what would this mean for the database schema? Let’s say we’re 10 years down the road, and we’ve got 100+ tables that each site needs. Let’s say we have as many communities as Stack Exchange currently has, so that’s 170+. That means we’ve got 17’000 tables in a single database. Have we done any research into how PostgreSQL scales to such a large amount of tables? From some initial searches, it doesn’t look like it’s an advisable approach.

I don’t know much about PostgreSQL, but I can of lots of problems this would cause with MS SQL Server.

Execution plan caching would not be able to cope with the volume of frequently used queries. I assume that the “view question” query would be the most popular, but there would be a version of this query for each community (170+). This means that there will be lots of other queries that should have cached plans, that won’t be able to have cached plans.
Concurrent connections would be an issue, more connections = more RAM usage, which will already be at a premium from all of the cached execution plans. With so many users, from so many sites, all hitting the database, we’d need to support a large amount of connections.

Think about how some of the queries would scale as we add more communities. If we have a query that involves just 3 or 4 tables from each community, it would become unwieldy very quickly as the number of communities increases (170 * 3 to 170 * 4 for a modest query, which is between 510 to 680 tables in a single query). Imagine how that would scale up for a more complex query.

Other questions that need answering

Should we limit how many communities we can support? What would that limit be?
Are we going to support communities self hosting the software, with their own DB?
How could we include them in the network if we’re relying on everything being in 1 DB?
Would self hosted communities be able to be part of the network of sites?

At the very least, we need to make a prototype to see if this approach is viable. I don’t know for sure, but my experience tells me this is a bad idea, and we’re setting ourselves up for failure.

raphaelschmitz · 31 January 2020 10:48

As educated guess, I agree with you. However, this discussion has become so big (rightly so) that I think we need to let some numbers do the talking. Maybe the two of us (and everybody who agrees) are actually panicking way too much, who knows.

…and just now, DoctorJones posted something the same direction I was going.
Except, assuming new tables for every community, where I think the idea was to have them all in the same tables?

However, even in that case, it doesn’t seem like a smart decision to just go for the “everything burrito” (everything in one DB) unless somebody can show, modeled on SE’s numbers and growth, how long we would be able to sustain storage with acceptable performance.

celtschk · 31 January 2020 11:14

I admit I don’t know much about databases, but I wonder why we would need separate tables for separate communities. All communities can have the same type of posts (questions, answers, discussions, blogs), therefore my (probably naive) assumption would be that you could just add a single field in each table that says to which community the entry belongs.

That makes migration of a post to another community nothing more than the update of a single field.

DoctorJones · 31 January 2020 11:19

Everything is so fragmented, so I cannot find where I read it (perhaps Discord), but I was under the impression that we were going with the proposal of having a schema per community.
i.e.

Writing.Questions vs Programming.Questions vs SuperUser.Questions.

If that’s not the case, we can ignore the concerns about having a very large amount of tables.

We seriously need a decision log wiki (or something equivalent), with references to where these things were discussed. If we already have that, could someone please point me at it?

cellio · 31 January 2020 14:10

One of the things I have been hoping for (down the road) is integration between (among) instances using a common API. This would allow users to see all their communities – the ones on codidact.com and the ones on programmers.co.uk and the ones on fan-communities.com and so on – in a single site list. This list would be grouped by instance, so it’d be really clear when you’re on whose, which matters because instances might have different policies. If we do things right with the API, we could even support navigating to/from TopAnswers communities this way.

I said the list would be grouped by instance, but for instances hosted by the same organization with the same rules, we would include an option to mingle the communities from the instances involved. This means that to the user, our replacement for SO would be just another community on our list, even though at the back end it’s a different instance.

By taking this approach, we gain the simplicity of a single instance for a bunch of smaller communities while preserving our ability to expand. I think of this as just-in-time federation; don’t do a lot of separation now that we don’t need, but keep the door open by planning for expansion now.

Tagging @Marc.2377 and @ArtOfCode to make sure they see this, because we haven’t discussed it yet.

Olin · 31 January 2020 14:31

I’m not qualified to weigh in on the database architecture discussion, but can discuss the user experience.

I remember being rather confused when first joining SE about what was SE-wide and what was per-site (like EE, Photography, Physics, etc). I thought that once I log in, I’m “in”. Having to “join” individual sites and then ending up with separate profiles didn’t make any sense and was confusing. And then there was SO, which was apparently special somehow.

I can see some point to being able to say something different in your profile per site, but that’s pretty low priority. It’s not what I would consider MVP in your parlance.

It also never made sense to me that I had to explicitly “join” a site to participate. I’m already logged in, so the system should know all it needs to know about me. If I follow a HNQ link or otherwise stumble upon something interesting, why can’t I comment or answer on the spot? I understand I start with 0 rep, so there needs to be some per-site information stored about me. But, it was unintuitive that I had to stop what I was doing and “join” for that to happen. I remember this stopped me a few times adding some information that was missing from the existing answers. I thought to myself screw this and moved on.

If there is a good technical reason for this, that’s fine, but then it should be explained better up front. Perhaps all that would have been needed was something you read when first creating an account. That would explain the overall structure of the whole system, the individual sites, what is per-system and per-site, what you need to do to move around, etc.

cellio · 31 January 2020 14:54

It’s not a technical reason (AFAIK) but a human one. On SE, moderators have access to PII for all users on their sites. Not all users trust all 500 moderators. Users have deleted accounts specifically for this reason. I’ve seen it. I’ve done it. It’s rare but it happens.

On Codidact we are not planning to give moderators this access. Instance admins will have it, but they had it anyway.

There is a second reason: some people don’t want to be publicly seen as being associated with certain sites. This led to SE implementing pseudo-hidden communities (not really hidden), but it still does nothing for Google results. Some people don’t want a search on their names to turn up user profiles on certain sites. I don’t know how important this consideration is for us, at lest until we have good SEO juice.

All that said, we can and should make it as easy as possible for a signed-in user on one site to say “yeah, add me here” on another. The front-end team is talking about flows for login/signup, so that when you try to do something that requires an account you get a dialogue right there prompting you to do that and then returning you to what you were going to do. We haven’t discussed the “network association” case specifically yet, but hooking it in there is straightforward from a UX perspective.

Olin · 31 January 2020 15:33

Yeah, maybe you want to use your real name on Physics, Photography, and Electrical Engineering, but not on The Workplace where your boss might see you. You should still have a system-wide account, but site-specific accounts can override some information displayed for that site. Again, I don’t think this needs to be MVP.

I also don’t want to make the implementation difficult. Choose what you need to. But, if it ends up being unintuitive (like SE), there needs to be a document explaining this all clearly that is presented to you when you create your master account.

manassehkatz · 31 January 2020 15:46

Sorry for all the confusion - that’s what happens when a bunch of random people on the internet get together to solve the world’s problems

I see 3 very different ways of handling the data and if I’ve got this wrong, including wrong terminology, let me know. (I am an “expert” compared to my customers, but here I am just “ordinary”). Examples using 100 tables needed to “run a community” and 170 communities.

One database, one schema: 1 database, 101 tables (100 tables with community_id all over the place plus one communities table).
- Advantages: Few tables, easy queries (including migration, user activity across communities, system-wide reporting, etc.), all data for all communities always in sync. with structural changes (which will be needed despite our best efforts), easy backup/restore.
- Disadvantages: Tables may get “too big”, corruption of data (malicious or software errors) likely to affect all communities simultaneously.
One database, multiple schemas: 1 database, 170 schemas, 100 tables per schema, 17,000 tables per database. (Though likely plus 1 small schema for management of the system-as-a-whole-across-all-communities.)
- Advantages: Easy backup/restore, centralized storage, moderate level of difficulty for cross-community queries/migration/etc., queries within a single community are fast.
- Disadvantages: 17,000 tables! (I 100% agree that PostgreSQL would not likely do this well), structural changes to tables have to be performed across 170 separate databases.
Multiple databases: 170 databases, 1 schema per database, 100 tables per schema. (Though likely plus 1 small database/schema for management of the system-as-a-whole-across-all-communities.)
- Advantages: Highest level of protection between communities. Queries within a single community are fast.
- Disadvantages: Cross-community queries require connections to multiple databases, backup/restore (system management in general) much more complex, structural changes to tables have to be performed across 170 separate databases.

I am clearly in favor of “one database, one schema, 101 tables”. I realize that this could be a problem if we get to 170 communities in one instance (aka the primary Codidact instance, which is the main point of this discussion). I actually tend to think there will be fragmentation before then, but who knows.

The flip side is we might have the same traffic but more communities - e.g., a Community for each major language family (Python, PHP, etc.) instead of just a giant StackOverflow-like instance for general programming questions. We really don’t know yet.

I do think that PostgreSQL can handle a moderate scale of combined communities traffic just fine. According to Google, from Quora (!), SO had 18,179,078 questions in September 2019. If we look at the total Q&A-world-that-we-host as 2x SO (SO being clearly the largest of the SE network) that’s 36 million questions after many years (but we may get there quickly if, as planned, we import a lot of existing stuff). 36 million questions, 3 answers per question (made that number up out of thin air) = 144 million nominal posts. That number is, IMHO, not an issue at all for an appropriately sized (enough RAM, etc.) PostgreSQL instance, as long as it is designed (indexed) well to support the common queries. Throw in 10 comments per post (made that up too), that’s 1.4 billion records in the largest table (except for audit trails) - that starts to creep up on the “is it too big?” scale, though if your largest community is 1/2 that size then the issue really doesn’t change much - i.e., almost the same for that one community as for all communities together. (If your largest community is 1/170th the size then it is a whole different ball game - 1.4 billion records/170 communities = ~ 8 million records, which is “nothing”).

celtschk · 31 January 2020 16:23

Well, there’s a middle ground between 1 and 3: Allow multiple databases, but also allow multiple communities per database. That way one can start out with one database, and then when that database starts to grow too big, the databases can be split based on the actual sizes at that time.

So to fit it into your table, the scheme would be something like:

Multiple databases with multiple communities each: 1 master database, 3 content databases. 1 schema per database. 101 tables per content database. The master database contains all user data, a list of communities, and the information which community lives in which database. The content databases contain the data from a subset of communities. For example, it might turn out that the first content database contains the 5 biggest communities, the second content database contains the next 50 communities, and the third contains the remaining 115 communities. Or whatever distribution is more advantageous.
- Advantages: We don’t need a new database for every small community; schemes have to be kept consistent over only a few databases rather than many, and generally every complexity that scales with the number of databases will be much lower than with one database per community. Moreover, the complexity of several databases can be deferred until the database size really gets an issue. On the other hand, unlike with only one database total, we don’t have any scaling issues (unless a single community gets large enough that even a database containing only that one gets issues).
- Disadvantages: We still have cross-database complexities to deal with.

celtschk · 31 January 2020 16:30

But for that, the SE style partial separation is no good. I already argued on Discord back before we had this forum, that we should provide true multiple identities per account (that is, identities where only the user and the site admins can know that they are the same user; since the shared identity is known to the system, usage of this feature for sockpuppeting can be prevented by the system). But each identity should be allowed to span multiple communities.

manassehkatz · 31 January 2020 16:37

Agree with the advantages regarding database size, backup/restore, server management, etc. - it becomes a “happy medium”.

But the disadvantage as far as cross-database complexities actually becomes worse than with the other schemes because now the code has to support both multiple databases and multiple schemas within a database.

Helmar · 31 January 2020 17:43

We’re still discussing this mostly on a schema level without considering what we actually need and what our constraints are.

Use Cases

99% of use cases are community based and have nothing to do with cross-community stuff anything.

What do we really have?

Post Migration
“Network” User Profiles
Admin Stuff
HNQ

What should (for thousands of reasons) not be cross-community anyways?

The user identity (e.g. the login) should be separate anyways (and already is)
Everything community moderators do
- Putting that on “network level” is a GDPR lawsuit waiting to happen.
- Since that is not allowed proper technical decoupling is also advised.
The Core Q&A functionality

Remark: All the fancy stuff Monica envisions screams to decouple communities from each other anyways.

Technical

Moving to the technical perspective I totally agree with @DoctorJones that there’s simply no good reason to bet on a horse we can’t properly verify to be viable.

We should also keep in mind that everything that you have to take into consideration when you need to scale happens later when we separate the databases. Whether it’s clustering issues, sharding, separation of read and write instances, code-refactoring to reduce lock-times, fine-tuning indexes and whatnot. All these things will happen significantly later when you separate the databases.

Small databases keep code complexity for the core job of Codidact lower.

Flexibility

One of the major points I see for having separate databases is being able to pilot whatever feature for whatever community exactly there. Testing stuff on smaller communities before rolling out to the whole network is a major architectural value. There are so many things where it’s great to have a live instance that is not affecting the rest of the network from a flexibility standpoint that I personally would not give up easily.

celtschk · 31 January 2020 17:44

Where do you see multiple schemas inside a database in my version?

manassehkatz · 31 January 2020 17:58

I don’t see that in your version. It was a reference to:

manassehkatz · 31 January 2020 18:10

This one may not be “big” in some ways. But I look at my own profile and I also look at other people’s profiles. There are ways around it with a split-schema/split-database - essentially caching the info. But that adds more complexity to the code. In one schema/database, the user profile page can be built very easily with a handful of queries.

Admittedly, migration, admin and HNQ (which we may not ever have, though I see value *if done right, which is not the case in SomeOther systems) are less frequently used (HNQ is displayed everywhere but that can reasonably be stored in one place).

Really? My understanding is that within an instance, any user has one identity. They may choose to have multiple public profiles - e.g., different display name, “about me” text, etc. - on each community (or “all the same”, user’s choice). But they have one set of login credentials (changeable) and one user ID # and one (changeable) email address across all communities in the instance. Some community-level user information then determines what to display, how to display it, trust level within that community, etc.

That may be true. But the flip side is: Fix a bug on one and you now have to fix it on 169 others separately.

I see that on the User Interface side of things. Which is easy with any configuration that we are discussing. But I would argue that structural changes (adding fields to tables, etc.) are much easier if done in the sequence of:

Local testing by individual developers
One separate instance that is exclusively for development - same type of web server and database server as production (e.g., if production is AWS EC2 + RDS + S3 then so is this dev. system) but smaller to keep costs down, and access limited to developers only.
Move to production.

Helmar · 31 January 2020 18:22

Regarding identity, my mistake. I advocate to have the identity separate completely, i.e. have an identity server with its own DB (which has completely different usage patterns and scaling requirements than the Q&A functionality) that handles login et cetera.

That should imho not be within the Q&A database at all. User profiles can be replicated however often and whatnot. However, user identity and user profiles shouldn’t be mixed up.

Regarding testing.

I do not mean functional testing. I mean testing in a sense of figuring out if stuff works the way the feature was envisioned at all. If done properly a certain percentage the features will never be rolled out to the network. Thus, you only need to change the database back for one community.

manassehkatz · 31 January 2020 18:28

As I have stated before, I don’t really care whether the authentication database is part of the same database server instance or totally separate. My only concerns (for which your concept works perfectly) are:

It should not rely on a 3rd-party service (Facebook, Google, etc.)
It should be one server/database for all communities within a Codidact instance.

I just saw a reference somewhere to the point of “change a database back never actually works”. That is debatable of course. But my preference still stands - do enough testing and then roll out the feature. If you end up getting rid of the feature, fine - another code change and (if needed) another database change.

I suspect that most of the “hmm, will users like & use this feature that the devs thought was great” that end up with “not!” will be more on the UI side and not so much relating to actual database structure changes.

Marc.2377 · 1 February 2020 14:02

I’ll offer counter-arguments to some specific bits here, as my time allows:

Which is to be done by middleware, probably via API(s) (not necessarily WEB/REST APIs); plus, these should be rare enough, compared to the volume of day-to-day data querying from the actual core application(s).

backup/restore (system management in general) much more complex

Not quite, no. Unless one insists on doing it manually all the time, without any sort of scripting or automating.

structural changes to tables have to be performed across 170 separate databases

… and can/should be automated, too.

Authentication and *basic* authorization is to be done via a separate module.

A middleware implementation will allow for management, reporting, data querying (a la SEDE) and similar needs. And it is to reside outside the core repository.

…Which is, in fact, why I named the repository core in the first place. Not everything goes in core. “Core Q&A” implementation goes.

I’ve implemented proof-of-concepts using both approaches and my honest assessment is that managing a SiteId / CommunityId FK column (as demonstrated in this minimalist diagram) and its related constraints is a painful technical complexity in itself. The additional constraint checking imposes additional taxing on the DB engine as well.

There will be complexity with the separate DB approach too, implementing the middleware, background jobs and whatever else. As an integrations specialist, I certainly know that. It just happens to appear, to me, that such complexity makes a lot more sense here.

I can try and compile/enumerate all of the reasons why I strongly believe that to be true _{(not today, though - still a little bit swamped with managing other parts of the project - but in 1-3 days, likely¹)}, but maybe the point made by some of you fellows in the above posts about the vast majority of the data being site-specific is already a strong enough argument.

Oh, and I’m currently exchanging ideas and experiences with our fellow advisor @sklivvz, and it’s been most insightful so far. I’ll share whatever pieces of that conversation he allows me to with you folks here

_{¹: hopefully by that time I can have a couple more diagrams to share - architecture and infrastructure - not to mention we have a talk scheduled for Sunday which will perhaps allow for substantial improvements to technical documentation.}