This is so true. It is waaay easier to remove history than to try to backfill it. (The latter is sometimes/often impossible!) I had to recognize this painfully in a prior project.
Oh no by no means is what I posted intended to be anywhere near complete, I wanted to show the approach I had in mind for pages/posts and community separation. I can readily name several tables that are missing, which I didn’t include because they werent relevant. (UserTagSubscription, UserSiteSubscription, Badges, UserBadgeMapping, FeaturedPosts, CommunityAds, etc) I only included the tables I wanted to talk about, and the tables referenced by them with foreign keys. This is currently written in MySQL, but we want PgSQL, too. I think translating this to ORM code would not be too difficult, its to get a rough idea mostly. I think discussing this early is useful, and writing SQL queries is the most unambiguous way to communicate SQL schema ideas i know of.
I second the voices that say that it’s far too early for database schemas when we haven’t settled on what we want to build yet. So I didn’t do any detailed review. But one thing is obviously wrong: password information does not belong in the Users table. Authentication data belongs in table(s) of its own which may even be stored separately. Oh, and the salt is part of the hash. Please leave credential management to the experts.
Just noticed something: Having questions, answers, and comments all lumped together in one POSTS table makes simple querying a lot simpler, but what about pagination? When you have a large, popular post and you want to grab the question and no more than X answers, it seems like this system is going to get in the way…
SELECT
Posts.*
FROM Posts
WHERE
Posts.Root = 123456 AND
Posts.PostTypeId IN (1, 2) -- questions, answers
ORDER BY Posts.PostTypeId -- question first, then answers
LIMIT 11; -- one question, ten answers
Yeah, that would work. But now you’re not grabbing all the comments in the same query.
Also, it gets more complicated if you want to get the second or third page rather than the first. Heck, it’s more complicated than that sample query already, if we want to sort by votes rather than by post time!
It just feels like the plan to put everything all together where we can grab it all in one simple query, while appealing in its simplicity, is a bit too simplistic to handle even basic real-world usage for doing something like a SE system. We’re either going to end up with a complex mess of multiple sub-selects, or running multiple queries. Probably the best solution would be to build a sproc that does the multiple queries inside the DB, so we don’t incur the hits from multiple network roundtrip overhead, and processes everything into JSON for easy deserialization on the server.
Keep in mind that “one query per page display” is pie-in-the-sky, not-going-to-happen. There will be multiple queries, and that’s OK. The key is to keep to a reasonable minimum. So if you have to read User name/avatar for all the posts and the page and you can’t get it in the initial query (you probably can, but this is a hypothetical example), then you gather all the User IDs you need from the results of the Q&A&comment query and make one query instead of a query each time through the display-a-post loop. Etc. Don’t expect it to be easy or simple. But it can be done, and it can be very fast from the user perspective, which is what matters.
I think that both technical and visible distinction between sub-sites should be just a big site-tag, first in the tag list. It should make migrations easier and provide a sense of tight connection between all sites on the network.
Users who are at the high enough rep level on both source site and target site should be able to insta-migrate posts by changing the site-tag. Kinda like gold tag badge.
I don’t think it makes sense to separate data structures on database level for the MVP if the plan is to create a single logical “network” first. It would make sense if the goal is to create the main network of sites and separate instances controlled by other groups, where security and legal concerns about cross-access would arise.
Yes pagination complicates things, but I don’t see how separating the comments into another table helps. The described query covers >90% of questions: very few questions need pagination. Giving up a gain for nothing in exchange does not seem like a good idea. Revised view query:
SELECT post.Id,post.TypeId,post.Body,post.Score,post.ChildCount,post.ParentId,post.UserId,user.Username,post.CreatedOnUTC,post.ModifiedOnUTC
FROM Posts post
ORDER BY post.TypeId ASC -- puts questions and answers first
JOIN Users user ON user.Id = post.UserId
WHERE post.PageId = @PageId AND NOT post.Deleted
LIMIT 50 -- max 50 questions per page
SELECT page.Title,page.Views,page.Tag1Id,page.Tag2Id,page.Tag3Id,page.Tag4Id,page.Tag5Id
FROM Pages page
WHERE page.Id = @PageId
Three changes:
- added ChildCount column to posts, to know if all comments or answers were fetched for a post
- added order by clause to put question and answers at the top, comments at the bottom
- added limit clause to chop off everything after the first 50 posts
This lets us keep 1 query for nearly all questions, and on those other questions that need more work, we can do more queries.
This topic is going off-topic. There doesn’t seem to be any discussion about the original question (“how to separate”), but about the whole database design. That is not within the scope of this thread.
I’m for the first option proposed (“don’t”). I’m against having > 200 sub-sites, but would rather favor a decent (hierarchical) category/tag system for questions. Opinions?
Unless I’m very much mistaken, Codidact is intended as a platform for communities to create a Q&A. I don’t believe they’re all subsites owned by us - I think we’re providing the software for them to host their websites.
Actually, it will work in two ways:
- Any group of people that wants to can set up their own Codidact instance.
- Each Codidact instance will have one or more “communities” or “topical sites” within it. If all goes well, the primary Codidact instance (i.e., the one that will be run by the same people developing Codidact) will include several communities.
My proposal:
-
Every subsite/community will query a separate database (not just schema). This is for both isolation and performance reasons. “But deployment will be a nightmare”, I guess some of you might worry. Fear not; there are ways to deal with it, either with ORM-managed migrations or even plain old-school upgrade scripts (my style!).
-
Each database will, almost certainly, have a few schemas for separating different aspects of the implementation (i.e. core Q&A, integrations, metadata that doesn’t fit anywhere else).
-
The database structure is obviously still not defined but I’ll try to push in that direction during the next couple of weeks (despite having a planned travel between Dec. 26 and Jan. 5).
-
On a single table for “posts” vs separate tables for “questions” and “answers”: In principle, separate tables are better if we plan to have many millions of records, even if there’s little to no logical difference between their structure - because of performance (not only querying, but also indexing and other maintainance tasks). But I understand that PostgreSQL might have effective ways to work around that particular concern (at the DB engine level - not involving external caching strategies or anything), the simplest of which appearing to be “partitioning”. Given that, if there’s indeed only marginal, if any, difference between both types of posts, it would appear a single table is preferable. We’d benefit from hearing what @JackDouglas might have to say about this! (other than that we’ll all discuss it over Discord prior to defining one approach vs. the other.)
-
ASP.NET Core
has plenty of existing solutions for authentication and authorization, some of which are quite good, mature and customizable. The one I’m most familiar with (in .NET, but it has a Core port) uses a separate DB context (at the ORM level) by default, so it should be trivial (and recommended) to move it to a separate DB, around which a different and more strict set of rules might apply. Perhaps some sort of authentication service or “module” can be designed, but it doesn’t necessarily need to be so from the very beginning. We’ll evaluate that.
If each site is a separate database, it will be years before any site has millions of rows in any of the key tables.
I don’t know exactly how partitions work in Postgres, but I’ve worked with them in Vertica, a Postgres derivative, and yes they make a difference. Also remember that you can use views to build, essentially, virtual tables that pull only the columns you care about from other tables.
Nice. ^^ The longer, the better in this case.
About building views to make queries work faster (specifically, materialized views in case of Postgres), that would fit in the other “maintainance tasks” I mentioned. From what I read, while they improve (read) query performance substantially, the same is not true for write performance and, while I’m not sure whether they can (or even must) be refreshed outside of usual write operations, in case it does, it certainly takes time as well. So, in any case, smaller DBs and table sizes are, as pretty much always, best.
The main concern I’d have about using separate databases for separate communities is reporting. It’s the same problem we have using SEDE - you can’t effectively create complex cross-site queries, you have to create a single-site query and run it on every site instead.
If we want to include “community health” reports at a later stage, we’ll have trouble doing that effectively if communities are separate databases.
What are the DB performance tradeoffs of (a) separate databases vs (b) separate schemas within one database? Since in the latter you’d still only be querying and writing stuff from one site, I don’t understand where the performance hit would come from until we hit the “DB is just too darn big” stage, which I would expect to not be before tens or hundreds of terabytes? We’re not going to hit that.
I strongly disagree. I’ve done both types of systems. Managing this as separate databases is a recipe for disaster. Advantages of each method:
-
Multiple Databases
- Query speed increased in your largest communities, provided your largest communities databases are each on a separate server.
-
Single Database (everything “simpler queries” and/or “faster” except if specified as other reasons):
- Schema changes
- Consolidated reporting
- Backup/restore
- HNQ
- User Profile management (at many levels from authentication to history viewing to view of “all my communities”)
- High-level Moderation (e.g., User “bans”)
- Migration of Questions between communtiies
- Split or Merge of Communities (relatively unlikely, but could happen)
As @cellio noted, the only query speed issue of Multiple Databases really only makes a difference when you get to huge databases. In addition, if we need more “isolation” for a particular community then it may be that the better solution is for that community to spin up its own instance of Codidact.