How do we separate distinct communities?

ArtOfCode · 17 November 2019 22:55

It’s been mentioned several times across threads on here that an instance of Codidact should support multiple communities. Technically, how do we do this? Our options as I see them are as follows:

Don’t. One instance of Codidact is one community. Each new community needs a new instance.
Have a community field in every table in the database. Scope all database queries to the appropriate community (i.e. SELECT users.* FROM codidact.users WHERE users.community = 'Programmers';).
Use separate schema for each community. Database queries need not be scoped, but database context must be set to the correct schema on each request.
?

bertieb · 17 November 2019 23:17

Related but different: Decentralized website

manassehkatz · 17 November 2019 23:17

Actually, not exactly this, but close. My initial thoughts are:

Primary User information (username, hashed password, security stuff) - per instance
User Profile - per instance and per community (I originally thought just “per instance” but have been informed quite clearly that many people have different profile “per community” while others (me) do not)
Question - marked for a specific community
Answer - linked to a Question, which determines the community
Comment - linked to a Q which determines the community, or to an A which links to a Q, which determines the community
Voting History, Flags, Moderator Actions, etc. - all linked to a Q to a community or to an A to a Q to a community

This makes certain things really simple:

Any search of Q&A in a community (which is the only way most people would be able to search, and includes home page of a community, searching for text, searching for questions matching a tag, etc. and also searching for all of a User’s community activity to show in their profile) is a fairly simple join.
Any migration of a question is literally just a single field in a single record - as atomic as you can get in a complex system (plus some history logging, etc., of course)
Any complex data queries (e.g., analysis of Q/A ratio or tag usage or whatever) within a community are a simple join. Any such data queries of all communities together (i.e., “management reports”) are equally simple.
Any user-specific issues can be handled with a single action - e.g., “I’m out of here, delete all the content I submitted” (subject to license issues, etc. which is not the question right now) or more likely “this user is a spammer/hater/etc. and all content is to be deleted based on moderator consensus” can be handled quickly & easily because everything is in one database.
Adding an existing user to a new community is super easy - just add the profile record (which can default to “point to master record”) and basically nothing else as the user is already in the database.

As alternatives:

One instance per community - moderation, full user delete, migration - all kinds of things become far more complex.
Separate schema for each community - not as bad as “instance per community” but makes “global” queries and migrations more complex than one database.

cellio · 18 November 2019 02:19

I don’t know the technical merits of different schemas vs a community column in every table, but for what it’s worth, SO Teams does separate schemas as far as I know. Whether that’s to provide additional guardrails for data privacy or because it has other technical merits, I cannot say.

manassehkatz · 18 November 2019 02:23

For security reasons, I would consider each “SO Teams” customer/account/system/whatever as a separate instance of the entire system, and therefore a separate schema is appropriate. You wouldn’t want to have any reasonable possibility of an SO Team system getting mixed up with the public SO or with the SO Team of your competitor (e.g., if Ford and GM both have SO Teams then SO had better make sure they are as separate as can be).

However, in the case of Topical Sites with a Codidact Instance, there is no need for such separation.

cellio · 18 November 2019 02:25

Right, if it’s just a security concern, it doesn’t apply to us. I don’t know if there are non-security reasons for them to have done that, so I just wanted to supply that bit of info to the people who know enough to evaluate it.

ArtOfCode · 18 November 2019 12:18

I seem to recall seeing a blog post from Nick saying that separating schema was for security reasons rather than anything else. I can’t imagine it’s particularly good for performance…

rodolphito · 19 November 2019 10:38

I am still a proponent of the pages/posts distinction. A page belongs to a Site, a post belongs to a page. This is the schema I have in mind:

CREATE TABLE codidact.Users (
    Id BIGINT NOT NULL PRIMARY KEY AUTO_INCREMENT,
    Username VARCHAR(255) NOT NULL,
    Email VARCHAR(255) NOT NULL,  -- https://stackoverflow.com/questions/7717573
    PasswordHash VARCHAR(255) NOT NULL,
    PasswordSalt VARCHAR(255) NOT NULL,
    Reputation INT(11) NOT NULL DEFAULT 1,
    CreatedOnUTC TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    ModifiedOnUTC TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

CREATE TABLE codidact.Sites (
    Id BIGINT NOT NULL PRIMARY KEY AUTO_INCREMENT,
    Title VARCHAR(255) NOT NULL,
    -- can have other things here like banner, description, help page text, etc.
    CreatedOnUTC TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    ModifiedOnUTC TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

CREATE TABLE codidact.Tags (
    Id BIGINT NOT NULL PRIMARY KEY AUTO_INCREMENT,
    Deleted BIT NOT NULL,  -- deleted tags cannot be recreated
    SiteId BIGINT NOT NULL,
    Title VARCHAR(255) NOT NULL,
    Body TEXT NOT NULL,
    CreatedOnUTC TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    ModifiedOnUTC TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP

    INDEX TagsOnSiteId (SiteId),
    INDEX TagsOnTitle (Title),
    CONSTRAINT FKTagsSites FOREIGN KEY (SiteId) REFERENCES Sites (Id)
) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

CREATE TABLE codidact.Pages (
    Id BIGINT NOT NULL PRIMARY KEY AUTO_INCREMENT,
    Title VARCHAR(255) NOT NULL,
    Views BIGINT NOT NULL DEFAULT 0,
    Deleted BIT NOT NULL,
    SiteId BIGINT NOT NULL,
    Tag1Id BIGINT NOT NULL,  -- must have at least one tag
    Tag2Id BIGINT,
    Tag3Id BIGINT,
    Tag4Id BIGINT,
    Tag5Id BIGINT,
    CreatedOnUTC TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    ModifiedOnUTC TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
    
    INDEX PagesOnSiteId (SiteId),
    INDEX PagesOnTag1Id (Tag1Id),
    INDEX PagesOnTag2Id (Tag2Id),
    INDEX PagesOnTag3Id (Tag3Id),
    INDEX PagesOnTag4Id (Tag4Id),
    INDEX PagesOnTag5Id (Tag5Id),
    CONSTRAINT FKPagesSites FOREIGN KEY (SiteId) REFERENCES Sites (Id),
    CONSTRAINT FKPagesTag1s FOREIGN KEY (Tag1Id) REFERENCES Tags (Id),
    CONSTRAINT FKPagesTag2s FOREIGN KEY (Tag2Id) REFERENCES Tags (Id),
    CONSTRAINT FKPagesTag3s FOREIGN KEY (Tag3Id) REFERENCES Tags (Id),
    CONSTRAINT FKPagesTag4s FOREIGN KEY (Tag4Id) REFERENCES Tags (Id),
    CONSTRAINT FKPagesTag5s FOREIGN KEY (Tag5Id) REFERENCES Tags (Id)
) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

CREATE TABLE codidact.Posts (
    Id BIGINT NOT NULL PRIMARY KEY AUTO_INCREMENT,
    TypeId TINYINT NOT NULL,  -- ENUM (Question, Answer, Comment)
    Deleted BIT NOT NULL,
    Body TEXT NOT NULL,
    Score INT(11) NOT NULL DEFAULT 0,  -- cached upvotes - downvotes, updated with pgsql atomic increment
    ParentId BIGINT NOT NULL,
    PageId BIGINT NOT NULL,
    UserId BIGINT NOT NULL,
    CreatedOnUTC TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    ModifiedOnUTC TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
    
    INDEX PostsOnTypeId (Score),
    INDEX PostsOnTypeId (TypeId),
    INDEX PostsOnPageId (PageId),
    INDEX PostsOnUserId (UserId),
    CONSTRAINT FKPostsParents FOREIGN KEY (ParentId) REFERENCES Posts (Id),
    CONSTRAINT FKPostsPages FOREIGN KEY (PageId) REFERENCES Pages (Id),
    CONSTRAINT FKPostsUsers FOREIGN KEY (UserId) REFERENCES Users (Id)
) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

CREATE TABLE codidact.Events (
    Id BIGINT NOT NULL PRIMARY KEY AUTO_INCREMENT,
    TypeId TINYINT NOT NULL,  -- ENUM (HitHNQ, Rename, Migrated, Edit, Upvote, Downvote, Star, Deleted, FlagInappropriate, FlagSpam, FlagNotAnswer, etc)
    Body TEXT,  -- stores post body for edits, new title for renames, or source and destination for migrations
    Reason TEXT,  -- for edits, flags, renames, and migrations
    PostId BIGINT NOT NULL,
    PageId BIGINT NOT NULL,
    UserId BIGINT NOT NULL,
    CreatedOnUTC TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP

    INDEX EventsOnType (TypeId),
    INDEX EventsOnPost (PostId),
    INDEX EventsOnPage (PageId),
    INDEX EventsOnUser (UserId),
    CONSTRAINT FKEventsPosts FOREIGN KEY (PostId) REFERENCES Posts (Id),
    CONSTRAINT FKEventsPages FOREIGN KEY (PageId) REFERENCES Pages (Id),
    CONSTRAINT FKEventsUsers FOREIGN KEY (UserId) REFERENCES Users (Id)
) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

CREATE TABLE codidact.Alerts (
    Id BIGINT NOT NULL PRIMARY KEY AUTO_INCREMENT,
    TypeId TINYINT NOT NULL,  -- ENUM (Upvoted, Downvoted, RepliedTo, Edit, Deleted, Mentioned, etc)
    PostId BIGINT,  -- sometimes there's no relevant post (NULL)
    UserId BIGINT NOT NULL,
    Seen BIT NOT NULL,
    CreatedOnUTC TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP

    INDEX AlertsOnUser (UserId),
    CONSTRAINT FKAlertsPosts FOREIGN KEY (PostId) REFERENCES Posts (Id),
    CONSTRAINT FKAlertsUsers FOREIGN KEY (UserId) REFERENCES Users (Id)
) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

example queries:

-- View @PageId
SELECT post.Id,post.TypeId,post.Body,post.Score,post.ParentId,post.UserId,user.Username,post.CreatedOnUTC,post.ModifiedOnUTC
    FROM Posts post
    JOIN Users user ON user.Id = post.UserId
    WHERE post.PageId = @PageId AND NOT post.Deleted
SELECT page.Title,page.Views,page.Tag1Id,page.Tag2Id,page.Tag3Id,page.Tag4Id,page.Tag5Id
    FROM Pages page
    WHERE page.Id = @PageId

-- List newest 50 questions
SELECT page.Title,page.Views,page.CreatedOnUTC,post.Score,post.UserId,user.Username,user.Reputation
    FROM Pages page
    JOIN Posts post ON post.PageId = page.Id
    JOIN Users user ON user.Id = post.UserId
    WHERE post.TypeId = @Question
    ORDER BY page.CreatedOnUTC
    LIMIT 50

-- List 50 top voted questions in @TagId
SELECT page.Title,page.Views,page.CreatedOnUTC,post.Score,post.UserId,user.Username,user.Reputation
    FROM Pages page
    JOIN Posts post ON post.PageId = page.Id
    JOIN Users user ON user.Id = post.UserId
    WHERE post.TypeId = @Question AND (
        page.Tag1Id = @TagId OR
        page.Tag2Id = @TagId OR
        page.Tag3Id = @TagId OR
        page.Tag4Id = @TagId OR
        page.Tag5Id = @TagId )
    ORDER BY post.Score
    LIMIT 50

Sorry for the code dump, it’s just more productive and efficient here for me to show, not tell. Is there anything that this doesn’t cover?
(Thanks @manassehkatz @ArtOfCode and @MasonWheeler for discord discussion that lead to this.)

rodolphito · 19 November 2019 11:23

As a side note, can we get rid of the “join this community to participate” stuff? I’d much rather have my action go through when I attempt it (say, cast a vote, post an answer, write a comment) than be asked to join and then do whatever I was going to do. It just seems like unnecessary friction for no gain whatsoever.

luap42 · 19 November 2019 11:48

@rodolphito

This are some ideas/comments to your table proposal (not complete!)

I wouldn’t call it deleted, though. What about Blacklisted?

I would add a key DeletedOnUTC TIMESTAMP key.

Also I think we need a vote table:

codidact.Votes

PostId, BIGINT, refers to the post that is voted upon (NOT NULL)
UserId, BIGINT, refers to the user casting the vote
TargetUserId, BIGINT, refers to the user receiving the vote
VoteTypeId, TINYINT, refers to the vote type (Accept, Upvote, Downvote, SuggestedEditAcceted, Closure, Reopen, Delete, Undelete, …)
CastOnUTC, TIMESTAMP, when the vote was cast
InvalidatedOnUTC, TIMESTAMP, when the vote was invalidated, NULL if it wasn’t

rodolphito · 19 November 2019 12:27

Awesome, keep it coming.

Believe it or not, I actually named it Blacklisted first, but then renamed to deleted to stick to the naming scheme of other tables. I’m fine with either, don’t mind.

Votes etc are covered by the Events table. I had them as separate tables but I realized that they could easily be combined since they had mostly the same stuff. Please check if the Events table suits your need, I feel it has a very similar spirit to what you proposed here.

Deleted would be better as an event type IMO. Combined with the Deleted field, you have quick access to its current status, as well as a log entry showing when it was deleted. Event model is better here because you can track several delete/undelete cycles.

celtschk · 19 November 2019 14:46

What I don’t like about your proposal is that you put security critical data (like authentication data) and public data (like posts) into the same database. I think the application should strictly separate sensitive data from public data at the deepest level possible (indeed, as I wrote earlier on Discord, I’d even like to have them handled by different processes, so even arbitrary-code-execution attacks have a hard time getting to the sensitive data, as the password for the sensitive database is in another process).

rodolphito · 19 November 2019 14:56

Oh sure, I don’t mind changing that at all. Good point. Seems a bit overkill, and has to be done correctly in order for it to be actually safer. Meaning: you can’t use a loopback port to communicate the two processes, it has to be a named pipe/unix socket, because otherwise a third application can sniff the port and then you have even worse security than we started with. So yeah, swap PasswordHash/Salt fields for a PasswordId field, then have an auth service hooked up via a named pipe and throw auth queries at it like “is hunter2 valid for password id 1248”? (Salt is stored in the other process, as such the hash must happen there too.)

rodolphito · 19 November 2019 14:58

Along this same line of thinking, we can also have an email process to avoid leaking email addresses. Queries (over a named pipe) would look like “send Welcome to Codidact! to user id 1492”.

celtschk · 19 November 2019 15:06

The email could (should) be handled by the same process as the passwords; one important use of the email is for password reset mails.

cellio · 19 November 2019 15:30

I haven’t done a careful review but don’t want to lose track of one thing I noticed: you have pages having tags, but it sounds like those should go with questions. There are other types of pages on a site besides questions – tag wikis, help topics, and some others. Those won’t have tags.

manassehkatz · 19 November 2019 16:54

Needs to be done carefully. Arguably it may be better to salt & hash (obviously then storing the salt in the first location, but salt can actually be as simple as the autoincrement user id) before sending the (hashed) password for verification. That way the plaintext password is still sent down only one path (as it has to be with any system) and not two paths (browser to web server, web server to authentication server). Remember, salts done right do NOT need to be secret - see https://stackoverflow.com/questions/213380/the-necessity-of-hiding-the-salt-for-a-hash

manassehkatz · 19 November 2019 17:22

All:

I am a bit concerned about the way the diving into database minute details is going. There seems to be an emphasis on “let’s solve this little problem” without thinking about big picture real-world database structure. I don’t know if I’m a true “expert” (I’m certainly not from an academic point of view) but I’ve been doing this for well over 30 years. I’ve learned a lot along the way. Some of what I have done on other projects applies here, some does not. This project has the possibility/hope/dream of scaling up far larger than (broadly speaking) all but one project that I have worked on, but that project (can’t divulge the details) has taught me a lot (and reinforced a lot of what I already knew). I will, hopefully, provide a more complete database description in a day or two, but a few key things to consider:

An ORM may be very beneficial. I don’t know what ORMs are available for C#/ASP.NET. But an ORM can hide a ton of complexity, make database migrations easier & more reliable and basically prevent a ton of “stupid mistakes” that we (myself included) naturally fall into when writing SQL code. There are drawbacks - an ORM can also hide complexity in a way that additional queries are generated that hurt performance, so profiling, testing, etc. become potentially even more important than without an ORM. But the benefits can be significant too. Using an ORM is something we would have to have consensus on (both “to use one” and if so, “which one”), and it would not exclude the use of pure SQL code for trouble-shooting, database cleanup, one-off fixes, etc.
History is critical. This is obviously the case for post edits. But it actually comes into play in a number of different ways - e.g., knowing the specifics of when tags were added/dropped from a question; every vote (item/who/when/up-or-down); etc. Everything (except as specifically noted for legal/PII/etc. reasons) gets logged/saved forever.
Summary/latest information gets cached by saving to records in the database, but that doesn’t eliminate other queries in some situations - i.e., “one query for everything in a question” is great but not guaranteed. For example (as discussed), the current number of up/down votes for each q/a/comment gets saved in the main post record along with the latest revision of the text and the metadata. But the details are always available elsewhere in the database for rebuild or for appropriate queries - e.g., in this Forum, I can see who “liked” my posts - in SE I can’t though I presume moderators can - but at some level the detail will be needed (e.g., to consider removal of votes based on an sock puppets/voting rings/requested-removal-by-user/etc.). Similarly, history data of edits/tag changes/etc. won’t normally be needed but will be needed on demand.
There will be a lot more tables and fields added. Guaranteed. I don’t know what they are. But it will happen. We’ll add “kudos” or “rep” and need to track the latest (linked to user record) and history (somewhere). We’ll add info. about posts that came from SE or elsewhere. We’ll add moderation history details. Etc. So saying "OK, this is our fantastic database and it has 13 tables and the posts table only has 17 fields with 218 bytes (+ the text fields) is fooling ourselves. Storage and RAM are dirt cheap. Access is fast. The key is having the right data in the right place (e.g., summary where it makes sense like total votes) accessed the right way (there is nothing wrong with joins provided there are good indexes and foreign keys and the joined tables are either small (sit entirely cached in DB server RAM) or well designed (e.g., history records for a post, all easily and quickly linked if you have an index on post ID + time).

I’ll have plenty more to say when I (ha ha ha) have some more time.

rodolphito · 19 November 2019 17:53

Yes, I was ignoring other pages, that’s a good point. I was thinking of pages as “that thing with a question on it that you view”. Maybe there is a clearer name, or maybe we can just add a specific post type for static pages and just leave nulls in the tag ids on their row.

Excellent, this is what I’m going for. A complete database description would be great. Also, please note that the database description I provided stores full history in the Events table—every edit, title change, flag, upvote, downvote, star, etc. with timestamps and origin user. I eagerly await improvement.

For sure, but why bring this up? Does an ORM make it easier or something? The database will grow organically just like the code will, I don’t see this is a downside. Any definition we come up with will “suffer” from this.

manassehkatz · 19 November 2019 17:56

An ORM can make it easier. Depends on the ORM. I’m currently working with a large (by my standards) Django project - the ORM makes database changes trivial. However, not every ORM actually handles database changes so well, so YMMV.

My main point though is that I got the impression, perhaps incorrectly, that the various database samples being thrown around were “here it is, almost ready for production” and I think there is a long ways to go…and there will be changes even after we actually do start coding…and even after we have a production system.