Proposal - Schema

This is where I got to on the schema, please read my convention post first.

Users

This table will hold the global user records for an Codidact Instance. A user should only have one email to login with, that would be stored here. I am leaving the password storage and hashing to the experts.

Email (Text 255) UNIQUE
IsEmailVerified (Boolean default FALSE)
IsGlobalSuspended (Boolean default FALSE)
GlobalSuspensionEndDate (DATE)

Sites
Each of the individual sites inside of a Codidact installation

Name (Text) UNIQUE
Tagline (Text) UNIQUE
URL (Text) UNIQUE

TrustLevel
Name for each trust level and an explanation of each that a user should get when they get to that level.

Name (Text) UNIQUE
Explanation (Text) UNIQUE

UsersSites
A user may want to join one site and not another, and having a join table between users and sites allows a user to be suspended or gain/lose privileges on a single site. I listed the privilege booleans in the order of lowest to highest, the privileges after IsModerator arenā€™t listed because we probably are going to take any of moderatorā€™s lower privileges away without demodding them.

SitesId (bigserial)
DisplayName (Text)
Bio (Text)
PersonalWebsite (Text)
Github (Text)
Twitter (Text)
SuspensionEndDate (Date)
IsSuspended (Boolean default FALSE)
TrustLevelId (bigserial)
IsAbleToAsk (Boolean default TRUE)
IsAbleToAnswer (Boolean default TRUE)
IsAbleToComment (Boolean default FALSE)
IsAbleToSuggestEdits (Boolean default FALSE)
IsAbleToFlag (Boolean default FALSE)
IsAbleToUpvote (Boolean default FALSE)
IsAbleToDownvote (Boolean default FALSE)
IsAbleToCloseVote (Boolean default FALSE)
IsAbleToEdit (Boolean default FALSE)
IsAbleToReview (Boolean default FALSE)
IsAbleToProtectPosts (Boolean default FALSE)
IsAbleToLockPosts (Boolean default FALSE)
IsAbleToSoftDeletePosts (Boolean default FALSE)
IsModerator (Boolean default FALSE)
IsDeleted (Boolean Default FALSE)
UNIQUE CONSTRAINT (UsersId,SitesId)

UsersSuspensions
This table is for recording the specific privilege suspensions. The idea is that we will have an scheduled task running nightly that once the suspension is over will set the correct flag on the UsersSites table.

UsersId (bigserial) UNIQUE
QuestionBanEndDate (date)
AnswerBanEndDate (date)
CommentBanEndDate (date)
SuggestedEditBanEndDate (date)
FlagBanEndDate (date)
UpvoteBanEndDate (date)
DownvoteBanEndDate (date)
EditBanEndDate (date)
ReviewBanEndDate (date)
ProtectBanEndDate (date)
LockPostsBanEndDate (date)
SoftDeleteBanEndDate (date)

Posts
I thought about splitting into a Answers table and and QuestionsTable but doing it in the same table lets comments have a PostsId instead of a QuestionsId and a AnswersId. Meta posts are denoted by the IsMeta column. Still unsure on community wiki/FAQ posts.

SitesId (bigserial)
Title (Text Default NULL)
Body (Text)
Upvotes (smallint)
Downvote (smallint)
Score (smallint)
IsAccepted (Boolean Default FALSE)
ParentId (bigserial Default NULL)
IsMeta (Boolean Default FALSE)
IsDeleted (Boolean Default FALSE)
UNIQUE CONSTRAINT (SitesId,Title)

Comments
Table for the comments on posts, both questions and answers.

PostsId (bigserial)
ParentCommentId (bigserial DEFAULT NULL)
Body (Text)
Upvotes (smallint DEFAULT 0)
Downvote (smallint DEFAULT 0)
Score (smallint DEFAULT 0)
IsDeleted (Boolean Default FALSE)

CommentsVotes

CommentsId (bigserial)
UsersId (bigserial)
VoteTypesId (bigserial)
UNIQUE CONSTRAINT (CommentsId,UsersId)

Tags

SitesId (bigserial)
Name (Text NOT NULL)
Description (Text)
Usages (smallint)
UNIQUE CONSTRAINT (SitesId ,Name)

PostsTags

PostsId (bigserial)
TagsId (bigserial)
UNIQUE CONSTRAINT (PostsId,TagsId)

VoteTypes
Table for the vote types, upvote/downvote.

Name (Text) UNIQUE

PostVotes
The reason for this table is so that votes by spammers/serial voters can be undone.

PostsId (bigserial)
VoteTypesId (bigserial)
UsersId (bigserial)
UNIQUE CONSTRAINT (PostsId,UsersId)

1 Like

Little risky, arenā€™t these, given that thereā€™s an SO post that would be bumping the limits?

I assume UsersSuspensions is intended to be site-specific, so should have a SitesId.

2 Likes

These names feel awkward. What about CanAsk, CanSuggestEdits instead? While it doesnā€™t technically fit with the proposed naming conventions, we could update those to allow other predicate names as well.

3 Likes
  • Users - maybe GravatarHash because itā€™s used frequently

  • TrustLevel - You mentioned ā€œNameā€, is that a human-readable name, and does the schema support localisation?

  • Posts - datePosted is important, so is dateEdited

    The isMeta boolean might be generalised to an enum or string (e.g. Discourse supports several ā€œcategoriesā€ of post).

    You might want a additional, denormal ā€œsummaryā€ of the body (e.g. when displaying a list of several posts)

    You might want a denormal ā€œisTheActiveAnswerā€ boolean otherwise it may be expensive to select the ā€œActiveā€ (i.e. sort by most recently-edited answer) view

    Iā€™m not sure you want ā€œScoreā€ as well as ā€œUpvotesā€ and ā€œDownvoteā€ (also the latter is singular and the former is plural?)

    There might be a benefit to splitting Posts into Topics and Answers ā€“ or the Topicā€™s header might be a subclass of post (or however you compose that, ā€œis-aā€ versus ā€œhas-aā€) ā€“ e.g. on SE a topic must have a title and at least one tag and may have one accepted answer, and the topicā€™s OP might have topic-specific privileges (e.g. ā€œcan commentā€), whereas answers cannot.

  • VoteTypes - I didnā€™t understand this (which doesnā€™t mean itā€™s wrong)

  • UsersSites Iā€™m not sure privileges belong with sites. In future privileges might be associated with something other than a site, e.g. you might be able to edit specific posts or a tags. Is this a time to consider that?

Perhaps you could have Posts (for the question body and/or answer bodies) and Topics (in a separate table). Then every post has a topic-id, and comments have a post-id, and topics have a title and tags (and/or tags are associated with topics not with posts). The topicā€™s question (as opposed to its answers ā€“ ā€œwhich of these posts is the question?ā€) might even be identified by its date, i.e. the question is the ā€œfirstā€ post associated with a topic, though you could use a boolean instead for that with some (slightly complicated) uniqueness constraint on it.

In addition to my comments below, almost every table will need History (aka Audit Trail aka Logging). In some cases this will be obvious (Edit history of Q&A which can be viewed by anyone) and in some cases behind the scenes/moderator only (to research various types of bad behavior). In most cases this will be ā€œanother table same as the original with a prefix in the table name, FK to the original table, a copy of all fields from the original table plus a few extra fields to indicate who/when the change was madeā€.

I notice here and elsewhere many text fields that are 255 or length-not-specified. I think that in many cases a maximum length, often less than 255, is beneficial as it can help with display/UI related issues. For example, a Username at 255 characters would be a bit unwieldy given all the places it needs to go, but the limit canā€™t be too short as it would prevent things like adding ā€œReinstate Monicaā€, so SEā€™s arbitrary (and any value will be arbitrary) 40 characters for DisplayName is too limiting. I will suggest some limits but they are arbitrary.

Add:

  • DisplayName (Text 100)
  • Bio (Text)
  • Personal Website (Text 255)
  • Github (Text 100)
  • Twitter (Text 100) Only if it is also in UserSites. But I would be thrilled if Twitter were not an inherent part of Codidact as I see it repeatedly used in awful ways (on all sides of the political and other spectrums). People can always stick it in their Bio.
  • Location (Text 100) optional
  • IsFromStackExchange (Boolean default FALSE)
  • StackExchangeId (Bigint)

Most of those fields are in order to have a ā€œmasterā€ user definition used when a user connects with a new community or in an existing community sets ā€œIsSameAsInstanceā€.

As far as IsFromStackExchange and StackExchangeId, not sure how the whole import process will be handled, but I believe the plan is to have some form of ā€œImport a lot of stuff and then let users claim their stuffā€ - and this is for that purpose, though the details may dictate a different way of storing this.

Suggest considering calling this Communities. It fits with the theme of the whole system (a community of people rather than just a site visited by a bunch of anonymous people) and avoids ambiguity between the entire system (Instance) and individual Communities. On the other hand Community vs Communities can confuse things (as opposed to Site vs Sites and Instance vs Instances where you just add an ā€˜sā€™) and Sites is a lot shorter than Communities.

Changes:

  • Name (Text 40) UNIQUE - this needs to be fairly short for a lot of display, link, etc. reasons.
  • Tagline (255) UNIQUE - needs a limit, though arbitrary
  • URL (100) UNIQUE - needs a limit, though arbitrary - but a lot of reasons it shouldnā€™t be very long

Add:

  • HelpText (Text)
  • FAQ (Text) - or might be a bigger structure, TBD
  • IsActive (Boolean default TRUE) - could be FALSE while a site is under development or if ever disbanded
  • IsBeta (Boolean default FALSE) - aka Area 51

plus a ton of other fields that will be used to determine lots of settings relating to voting, ā€œpointsā€, trust levels, etc.

Change:

  • Name (Text 40) UNIQUE

We might want to make the TrustLevel names community-specific, in which case this would have Community FK as well (with None = primary Instance, and if there is no match for a Community, defaults to using the Instance Name for each TrustLevel).

Changes:

  • DisplayName (Text 100)
  • PersonalWebsite (Text 255)
  • Github (Text 100)
  • Twitter (Text 100) Consider removing

Add:

  • IsSameAsInstance (Boolean default TRUE)
  • StackExchangeValidated (Date)
  • StackExchangeLastImported (Date)

Add (unless it is all in another table somewhere):

  • StartDate for each of the included EndDate fields.
  • Reason - not sure what form this will take - might be FK to a ā€œReasonsā€ table or might be text, moderator, etc. But need to store somewhere why/when the action was taken.

Add:

  • UsersId (bigserial) - probably the most important thing of all, except for Title & Body!
  • IsQuestion (Boolean default TRUE)
  • IsAnswer (Boolean default FALSE)
  • IsBlog (Boolean default FALSE)
  • IsCanonical (Boolean default FALSE)

or alternatively have a PostType.

  • IsClosed (Boolean default FALSE) - need to also figure out how to link to Closed info (who voted, reasons, etc.)
  • NetVotes (Bigint)

Change:

  • Title (Text 255 default NULL) - Note that NULL is only OK for Answers. Q, Blog, Canonical will always need a title.
  • Score (some sort of decimal) - this will be the calculated/weighted score used internally for ranking as discussed elsewhere (with the simplest variant being the same as NetVotes)
  • As noted by others, all vote counts should be Bigint, not Smallint.
  • Downvote should be Downvotes (plural for consistency)

Add:

  • UsersId (bigserial) - probably the most important thing of all, except for Body!
  • NetVotes (Bigint)

Change:

  • Downvote should be Downvotes (plural for consistency)
  • Score (some sort of decimal) - this will be the calculated/weighted score used internally for ranking as discussed elsewhere (with the simplest variant being the same as NetVotes) Not sure if needed on Comments - if not needed then remove it and just use NetVotes to match the meaning on Posts.

Add:

  • TagWiki (Text default NULL) - The Description is relatively short (though 255 may be too short so not setting a limit) and should be plain text. TagWiki can be full Markdown - could be as much as a Canonical post about the Tag
  • IsActive (Boolean default TRUE)
  • TagsSynonymId (bigserial FK to Tags default NULL) - Use this to indicate a synonym - simply deleting synonyms doesnā€™t work because users will search and either not find and be frustrated or not find and decide to add in an endless cycle. This will point them to the ā€œapprovedā€ tag.

Change:

  • Name (Text 100 NOT NULL)
  • Usages (Bigint)
2 Likes

Is that how people usually do that, may I ask ā€“ you donā€™t just leave existing data unaltered in the table and add a new record with a newer date to the same table to supercede the old record?

1 Like

@cwellsx Excellent question. It could be done as a new record in the same table. As far as total stored data, it would be the same. But there are some functional differences that make a separate History/Audit/Log table superior:

  • Any links based on the original Id (e.g., PostsId or UsersId) do not change - e.g., you donā€™t have to change all the Answers, Comments, Votes to match the new Question Id.
  • The main tables stay (relatively) small. For pretty much everything except massive historical data analysis, that means almost all queries that relate to more than one record in the main table donā€™t even have to look at the history. For example, main Questions page for a community - select most recently updated records from Posts based on Community + IsQuestion - e.g., 50 records out of 1,000,000 rather than 50 out of 10,000,000 (or whatever). When you go to History of a Question then you go to that 10,000,000 record HistoryPosts table but are querying on one indexed PostsId field to get the 10 matching records so that is very fast.

I wasnā€™t thinking of changing the question-id ā€¦ instead a new version of the question would have the same id and a new date.
Other records which use the question-id as a foreign key are unchanged.
The ID isnā€™t unique ā€“ itā€™s ID-plus-date (a compound key) which is now unique.

ā€¦ assuming an average of 10 edits.

If you did keep everything in one table that makes update much simpler, itā€™s purely an insert ā€“ without locking the table or the record while you re-create the duplicate record?

And I think you can archive older records into an aux table ā€“ immediately or more likely later, an overnight job, with the aux tableā€™s having column just like the original table.

I think thereā€™s a theoretical elegance to not/never deleting data, but I just donā€™t know how developers actually tend to implement this in practice.

Seems like there is too little global information per user. Maybe someone wants to customize their bio or have a different display name per site, but usually not. At the least, the per-user-per-site data should be auto-filled from the default global per-user data. Iā€™d be pissed if I had to enter the same data separately each site.

I donā€™t see the point of per-user-per-site customizations for MVP at all.

You also donā€™t seem to have a userā€™s real name anywhere. Thatā€™s not the same as the ā€œdisplay nameā€.

Almost by definition within MySQL, PostgreSQL and similar RDBMS, there should be a unique single-field Id in each table. Using Id+something-else leads to problems. Trust me - been there, done that, too many times. Wonā€™t do that again.

That lock is a single record lock, not the table. It is trivial and normal. The lock is only to update changes in the primary record. The history record is an insert into another table, which is done without any locks.

Canā€™t do it that way. What if the Question gets edited 10 times within an hour? Which might be an Edit War or might be simply a lot of people making helpful changes - and the history is needed to know which one it is. The only practical way to do this is to archive at the same time as the update is done.

Iā€™m a developer. This is how I implement it in practice. Not the only way. But I believe a pretty common way.

Closing this, because a newer version is available here: