How do we separate distinct communities?

Marc.2377 · 1 February 2020 14:19

Marco (@sklivvz) gave me the greenlight to share the contents of our conversation here. I’ll paste what I found most relevant, as food for thought.
Note, much of what is shared here is already available in the form of public talks, which can be found on YouTube. I recommend for those who want to delve deeper and can find the time to watch them to do so.

You are correct, Stack Exchange uses one db for each community (in fact two, one for meta and one for main). The reason for this is mostly historical – the site was not built to be a network. This approach has a lot of advantages but also some disadvantages. It’s super easy and safe to use: once the correct connection string is loaded for the request, everything that happens will never interfere with the other sites. Also this is great for canary deployments: you can run db migration and deploy new code only on some test sites and not all the network very easily. (…)

On the other hand some other stuff becomes pretty hard and in fact characterizes how the network works.
For example, there’s one copy of your user on every site and you have to “join the community” to create a user. Almost all you see on the site applies only to the site: reputation, moderation, powers, reputation levels, help center, etc. Which is great if you want this, and basically you have a lot of data that is different acriss sites (imagine having a “siteid” column in basically every database table!), but not so good if you want it to be the same.

User syncing is a complex matter on stack exchange, well in fact any information passing between sites is challenging and goes through API calls or background jobs, Do you want to ban a user on all sites? It’s a batch that calls every site to ban the user there. Do you want to migrate a question? It’s a background job.
On the other hand, if in your design you want to share information across sites (e.g. do you want to have a single user with a single rep everywhere?) then perhaps a multisite db can be better.

In other words: the best solution depends on what you want to get out of it.

(…) there are in fact two special dbs which are also used: one is the stackexchange.com website (…) and the other is the “sites” database which is the database which contains cross-sites data (like for example the ~300 database connection strings).

Scalability wise, of course, one db per site is much much better, as it has scalability built in. At the moment SO runs on a db cluster and SE on another cluster. This was achieved effortlessly, just by changing the connection strings.

Finally for Teams I think they used different schemas to achieve something similar to different dbs, whilst keeping open the possibility of making cross site queries (which is super hard with sharded sites, and SE essentially achieves via elastic search).

sklivvz · 1 February 2020 14:50

I’m here if you need to ask any more questions. In general I strongly advise you to do the smallest possible site that can sustain a community and not a grand design. We don’t know what will happen in the future, so it’s impossible to do the correct trade offs right now. So don’t. Do the smallest possible amount of work and the smallest possible amount of decision and then use facts to guide you.

cwellsx · 1 February 2020 15:26

Quite, and so I’m surprised you consider something else.

PostgreSQL supports multiple schemas or multiple catalogs – I’m not sure of the difference between these two, perhaps there isn’t much difference from the application software’s POV (e.g. queries are isolated) – and so the two forms of isolation are like logically equivalent at run-time – and are only different from a deployment+devops POV (e.g. you backup a whole catalog).

For cases where you don’t want this I’d suggest still using separated datab for separate sites but expanding what you mean by “a site”.

E.g. do want to discuss Maths and Physics on one site? C# and Python? Well do that on one site then, one database, but using a more powerful or more visible version of tagging (e.g. more like Discourse’s than SE’s) to differentiate types of topic.

Ditto if you want Meta on the same site, and/or a Sandbox, or Code and Code Reviews, etc.

And the featured feature.

https://meta.stackexchange.com/q/342538/139866

SE evidently relies on caching some of the queried/displayed data – my guess would be including within a site (the first page of active question) – as well as across sites e.g. HNQ etc.

If it’s meant to be “open” then I guess the second/next use case might be two communities running two separate instances (therefore two databases), but maybe wanting to share a bit of something-or-other (i.e. some amount of interop) – that’s instead of, Codidact the Terrible Tyrant fiercely running a walled garden of sites on a jealously-guarded single instance (if I’ve got the right terminology).

sklivvz · 1 February 2020 15:44

SE relies on caching for almost everything, except actual questions and answers. In that case there’s a clusterf…k of caches in play. Some bits come from APIs, like meta featured posts, some from the network, like meta.se posts and some from RSS feeds. So all have their own timeouts. Plus the actual html block (or its data I don’t remember) is also cached. That tiny box appears in a lot of pages, so everything is fine tuned for perf.

Marc.2377 · 1 February 2020 17:14

Olin, in fact I don’t see any actual technical constraint for that, no. But I see other factors.

Different rules, help topics, site (community) scopes, possibly different privacy policies… Not to mention that I, for one, enjoy the distinct separation in terms of picking which communities I participate and which ones I’m not interested in at all.

These days, “joining” a new community on SE is as simple as clicking two buttons: One on the top of every page, and the other one on the confirmation page. Wait 1-2 seconds and you have joined the community. Wait another ~10 - 15secs for the association bonus to apply.

We can of course optimize these even further. But I would feel, personally, that this sort of distinction is desirable.

That being said, I wouldn’t actually oppose the idea of having one, “universal” account, like on Reddit. It works, too.

celtschk · 1 February 2020 17:29

One page I rely heavily on is the Stack Exchange user profile page listing all my sites. I use that as starting point to go to the specific site I want.

That page would be essentially useless if I were auto-member of all sites.

Note that this individual community membership question is independent of the question of a network-wide user account. You can have a single user account and still have a flag/table/whatever telling whether you are member of a specific community.

user1306322 · 1 February 2020 21:58

(Just a heads up: don’t click these, the second link full of chinese ads :x)

I would love to be able to see a combined “Hot Questions” list from all the Q&A communities I frequent on every one of them. And, of course, inbox notifications, so I can have just one tab open and have the total sum of all notifications from all sites available.

luap42 · 1 February 2020 22:10

Fixed.

manassehkatz · 2 February 2020 00:36

True. I would definitely not want to be an auto-member of every community within an instance. If anything I said suggested that, sorry I didn’t mean it.

So what I am really suggesting is the “all in one database” method of simulating what SE has come up with - i.e., 2 clicks to join a new community but all with the same (only registered & confirmed once) account. I really don’t care whether it takes 2 seconds to complete or takes 1 minute until the Association Bonus/Automatic TL1/etc. shows up - my concerns about one-vs.-many databases are about other issues.

manassehkatz · 2 February 2020 00:49

(This is a quote from @Marc.2377 but he is quoting from @sklivvz )

These are the kinds of concerns I have with getting started. We know we want a network of communities, but with shared users (register/confirm just once). We know that user syncing, migrating questions, banning users (in some fashion) across all communities, etc. will all be issues. Why start off with a system where we have to build a lot of extra stuff just to make all that work?

I’m surprised that would be so hard. Is managing a CommunityId FK fundamentally any different from picking a connection string? I would think you’d end up with things like ReadHomePageQuestions(UserId,CommunityId) which would use those two keys to figure out what to do - e.g., use UserId (or None for non-logged-in user) to figure out Trust Level to determine whether to show Deleted/Closed/etc. and CommunityId to determine which community. Only at the very back when talking to the database do you then turn that into a query (ORM based so pretty clean either way) that either hits “big database” with “CommunityId in query” or “community database” (using specific connection string). Really just not a big deal.

I can’t speak to the performance issue. That, to me, is the one big unknown. But we know that:

Storage is the same either way
Servers keep getting bigger & better
We need one main user/authentication database (i.e., whether or not the Q&A is split)
We will have multiple communities in the primary instance and need to support that from day one

IMHO, writing all the extra code to handle the edge-but-important-pieces like migration, multi-community-user-maintenance (bans, deletes, personal data retrieval requests, etc.), user profile pages showing all member communities, etc. will be much easier when using a single database. I am not worried about speed - frequent stuff can be cached, less frequent batched, etc. But I am concerned about the extra work writing all that code.

sklivvz · 2 February 2020 01:19

We know we want a network of communities, but with shared users (register/confirm just once). We know that user syncing, migrating questions, banning users (in some fashion) across all communities, etc. will all be issues. Why start off with a system where we have to build a lot of extra stuff just to make all that work?

Because we might know what we want, but we do not actually whether we are going to have an actual network of communities. It is certainly our wish, but we have absolutely zero ideas on what kind of “network” we will be able to build.

By assuming to know st,uff all we can obtain is forcing a pre-made idea from the start instead of building something small and easy to adapt to whatever actual, not imaginary, challenge we will face in the future.

The shape of the current Stack Exchange network has nothing to do with what Codidact can achieve. Once you have half a dozen healthy communities, then you will know how to make this choice properly.

Is managing a CommunityId FK fundamentally any different from picking a connection string?

Yes, and I am sure that if you know a bit about database performance and management and decide to spend some time thinking about the issue, you will get to the same conclusion.

ORM based so pretty clean either way

ORMs used as a matter of fact are terrible and Stack Overflow took 10 years to get rid of the one they mistakenly used in the beginning. You might inquire as to know why we built Dapper.

I am not worried about speed

You better not be, because building a single database for all the sites is clearly the design that performs less.

manassehkatz · 2 February 2020 01:35

I base this on already having Writing and key people from a number of other SE sites having expressed interest in joining Codidact, combined with my understanding that the Codidact “organization” plans to have one “primary instance” for any communities that we support. I actually doubt we’ll end up with 170 communities any time soon (someone mentioned that number based on SE) but I could easily see us having between 5 and 20 within a short time after launch.

10)% agree. We are dealing with a lot of unknowns and the closest comparison we have is SE since most of us are “refugees” from SE. But the sky is the limit and we just don’t know.

Like anything else, they have to be used judiciously. That includes (as with any significantly sized database) setting up appropriate tables and indexes. It also, specific to ORMs, means making sure that queries that seem easy/fast are not being turned into significantly suboptimal queries due to ORM limitations. TANSTAAFL.

I am concerned about speed for all the routine stuff - Q&A page loads, etc. That stuff needs to be optimized (both at the database structure, which is what we’re talking about, and in how the queries are done, data cached, pages designed, etc.)

What I meant was “I don’t care if the cross-community stuff runs slowly, as it does in some cases for SE because of the design”. But I do care about the work involved in developing that code.

cellio · 2 February 2020 04:21

Thank you for helping us sort out this question. Having your SE experience available here is great!

I have a few questions, possibly arising from insufficient knowledge on my end and if so please forgive any naivete on my part:

You’re suggesting one DB per community, right? Do you just mean DB, or do you mean we should also have one “application stack” (not sure if that’s the right term) per community?
There’s additional administrative complexity; are there standard tools or practices for dealing with them? I assume that pushing schema updates to a bunch of databases is a solved problem. What about backup? Replication?
What are the implications for deployment? Does each DB run on its own server instance to avoid resource contention (memory, ports, etc)? Or does everything run on one server instance in walled gardens of some sort (containers? something else?)?

Thanks.

sklivvz · 2 February 2020 08:58

How many communities do you need to support six months down the line? I.e. how many will be big and healthy? My guess is as good as yours but Codidact communities will have two main problems they need to overcome in order to survive:

You are competing against SE for users’ number and user engagement. At the start, most of the visits and engagement on the sites will come from people who are also active on SE.
All new communities have a high chance of effectively dying withing six months simply because they are small.

The first challenge means that unless you can effectively “move” communities out of the small SE sites over here, the number of users in many cases will be way too low. Remember what happened to the physics/physicsoverflow split: even though some of the top 20 contributors moved, you need hundreds of users to make a site work effectively or the site simply becomes inactive.

The second challenge is simply a “tax” on new sites. Not all of them will work even in the best of cases.

Given these two parameters, my personal guess is that six months down the line there will definitely less than 10 active sites with perhaps one or two that “matter”.

Now: if I am correct, then network features are basically unnecessary until you actually have a network and not very few sparse sites. You can simply go on with completely separate systems for now and then add the solution of your choosing when things become clear.

Let me give you a few scenarios to explain:

the SE 2.0 scenario: you get a big hit and then a plethora of small unrelated sites
the internationalization scenario: you get a big hit and add more languages
the specialization scenario: you get a big hit and shard off subcommunities based on tags
the SE 1.0/SO Enterprise scenario: you get a big hit and companies adopt your software and deploy different topic clones
the “teams” scenario: you don’t get a big hit and want to monetize by offering a private option, too

All of these are clearly best served by different database strategies. The choice of which scenario to adopt is very partially yours to make, as it’s mostly due to what will happen as emergent behavior.

They are indeed all solved problems: for db migrations I can suggest to use a tool I wrote, badgie migrator, which is built similarly to SE’s own migrator and which already has features to support N databases. For backups and replication the standard database tools already deal with these scenarios.

At deployment the big advantage of not having a sites network built in is that you can choose to deploy on a single site, multiple sites or the whole set of sites. This is super useful for debugging, it’s what SE does with meta.se and meta.so, and it is a standard practice for high load scenarios known as canary release.
Regarding of “where” to run these sites you have essentially 4 options I can think of:

The SO Enterprise option: each site and meta site get a dedicated database server and each live in its own database instance (same server different instance for meta/main, different servers for different sites)
The SE Network option: each site and meta live in its own database instance within the same database server (or very few servers)
The Teams option: each site and meta live within a the same database server and instance but are separated in different database schemas
The All-in-one option: all sites live in the same database server, instance and schema and are differentiated by a SiteId on most tables.

Speaking of contention there are two broad categories of contention: contention at the server level (bandwidth, memory, CPU power, temp db…) and contention at the schema level (locks). It’s easy to see that solutions 1, 2 and 3 minimize the contention on locks and 4 does not. Solution 1 furthermore minimizes the contention on server resources. Solution 2 can easily move out a big site to its own server in case of contention.

In the cloud perhaps solution 1 is the simplest of all. In the case of solution 4, and at the scale of SE network excluding SO, your database will have to be big iron (hundreds of cores, and terabytes of memory) so that sort of rules out either scaling or cloud hosting.

My personal advice is, since the beginning, to create a distributed network if you have the skills. This allows sites to run cooperatively on small clouds, has network capabilities built in and has very nice properties which I assume you want by design, like resiliency to corporate takeover.

Helmar · 2 February 2020 09:36

I couldn’t agree more.

Adding in the obvious ease of canary releases to me personally the advantages seem overwhelming.

manassehkatz · 2 February 2020 15:09

Please explain. I am really confused as to what this has to do with “one database on one server vs. multiple databases on one server vs. multiple databases on multiple servers”.

sklivvz · 2 February 2020 16:23

A distributed system provides a natural solution for hosting a network of sites: it solves a superset of the problem by its own nature. In a distributed system at the architectural level, each site (or site shard) is fully autonomous and only speaks to other sites (or shards) via a documented API. There will be different API implementations, some hosting multiple sites or shards and other implementation will be single site. At the end, as long as they implement the distributed API they are part of the distributed network and thus we have a natural sites network, no matter what implementation choice we make.

manassehkatz · 2 February 2020 16:33

That is a good explanation of a distributed system communicating via API. I see advantages and disadvantages to that, which is in the nature of our general discussion.

But what I am asking is about:

That’s the part I’m confused about. Please clarify.

sklivvz · 2 February 2020 16:50

If the community and the contents are distributed in different shards owned by different people and entities, then it is impossible for someone to “buy” or “control” the whole network. If, instead, we have a centralized system, no matter how good the initial intentions, it is always possible that the central authority will turn against the community, or monetize the site with spam, or gate the content. With a distributed system, providing a bad service would simply mean that the users will go to another nicer shard to contribute and do their business.

manassehkatz · 2 February 2020 17:05

Interesting. I’m not sure if that would really work in a practical way for a bunch of reasons. My thoughts on the same basic problem (aka “Codidact primary instance administration ~~becomes like SE~~ goes bad”), is that communities would leave by starting up a new instance and then migrating their data (their data being covered by license agreements, etc. as discussed elsewhere). But the primary instance (actually in any instance of Codidact) database(s) would all (within an instance) be controlled by a single entity - i.e., the split between databases within an instance would be technical but not legal/administrative. Or to put it another way, I would expect all servers (web/API, database, etc.) for a single instance of Codidact to be within one AWS account (or similar entity).