Schema Proposal Round 3

Changes this iteration,

  • Singularized the table names and changed Users to Member because User is a reserved word.
  • Added reference tables to replace the CanUpVote etc columns and for the PostTypes.
  • Changed how the TrustLevel and the Privileges+Suspension of privileges are recorded.

Member
This table will hold the global user records for an Codidact Instance. A user should only have one email to login with, that would be stored here. I am leaving the password storage and hashing to the experts.

DisplayName (Text 100)
Bio (Text)
Personal Website (Text 255)
Github (Text 100)
Twitter (Text 100)
Email (Text 255) UNQIUE
GravatarHash (Text)
Location (Text 100)
IsFromStackExchange (Boolean default FALSE)
StackExchangeId (Bigint)
StackExchangeValidated (Date)
StackExchangeLastImported (Date)
IsEmailVerified (Boolean default FALSE)
IsSuspended (Boolean default FALSE)
SuspensionEndDate (DATE)

Community
Table for each of the individual sites inside of a Codidact installation

Name (Text 40) UNQIUE
Tagline (Text 100) UNQIUE
URL (Text 255) UNQIUE
HelpText (Text)
FAQ (Text)
Status (ENUM)

TrustLevel
This needs to be specific to a site for when we get different language sites and it needs to be localized.

Name (Text 100) UNQIUE
Explanation (Text) UNQIUE

TrustLevelCommunity

CommunityId (bigserial)
IsSameAsInstance (Boolean default TRUE)
Name (Text 100) UNQIUE
Explanation (Text) UNQIUE

MemberCommunity

A user may want to join one site and not another, and having a join table between users and sites allows a user to be suspended or gain/lose privileges on a single site. I listed the privilege booleans in the order of lowest to highest, the privileges after IsModerator aren’t listed because we probably are going to take any of moderator’s lower privileges away without demodding them.

CommunityId (bigserial)
DisplayName (Text 100)
Bio (Text)
Personal Website (Text 255)
Github (Text 100)
Twitter (Text 100)
Email (Text 255) UNQIUE
GravatarHash (Text)
Location (Text 100)
IsFromStackExchange (Boolean default FALSE)
StackExchangeId (Bigint)
IsSameAsInstance (Boolean default TRUE)
StackExchangeValidated (Date)
StackExchangeLastImported (Date)
IsSuspended (Boolean default FALSE)
SuspensionEndDate (Date)
TrustLevelCommunityId (bigserial)
IsModerator (Boolean default FALSE)
IsDeleted (Boolean Default FALSE)
UNIQUE CONSTRAINT (MemberId,CommunityId)

Privilege
Table for privileges

Name (Text 100)
Description (Text)

MemberCommunityPrivilege
For recording which members have which privilege in a community. If a member has a privilege suspended, then that is also recorded here, and a nightly task will auto undo the suspension once the PrivelegeSuspensionEndDate has passed.

CommunityId (bigserial)
MemberId (bigserial)
PrivilegeId (bigserial)
IsSuspended (boolean DEFAULT FALSE)
PrivilegeSuspensionStartDate (Date)
PrivelegeSuspensionEndDate (Date)
UNIQUE CONSTRAINT (CommunityId,MemberId,PrivilegeId)

PostType
Records the type of post, question/answer/blog etc

Name (Text 100)
Description (Text)

Post
I thought about splitting into a Answers table and and QuestionsTable but doing it in the same table lets comments have a PostId instead of a QuestionsId and a AnswersId. Meta posts are denoted by the IsMeta column. Type of post is determined by the PostTypeId

MemberId (bigserial)
CommunityId (bigserial)
Title (Text 255 Default NULL)
Body (Text)
Upvotes (bigint)
Downvotes (bigint)
NetVotes (bigint)
Score (decimal)
IsAccepted (Boolean Default FALSE)
PostTypeId (bigserial)
IsClosed (Boolean default FALSE)
IsProtected (Boolean default FALSE)
ParentId (bigserial Default NULL)
IsMeta (Boolean Default FALSE)
IsDeleted (Boolean Default FALSE)
UNIQUE CONSTRAINT (Community,Title)

Comment
Table for the comments on posts, both questions and answers.

MemberId (bigserial)
PostId (bigserial)
ParentCommentId (bigserial DEFAULT NULL)
Body (Text)
Upvotes (bigint DEFAULT 0)
Downvotes (bigint) DEFAULT 0)
NetVotes (bigint) DEFAULT 0
Score (decimal DEFAULT 0)
IsDeleted (Boolean Default FALSE)
DeletedAt (Date)

CommentVotes

CommentId (bigserial)
MemberId (bigserial)
VoteTypesId (bigserial)
UNIQUE CONSTRAINT (CommentId,MemberId)

Tag
Table for all of the tags

CommunityId (bigserial)
Name (Text 100 NOT NULL)
Description (Text)
TagWiki (Text default NULL)
IsActive (Boolean default TRUE)
TagSynonymId (bigserial FK to Tag default NULL)
Usages (bigint)
UNIQUE CONSTRAINT (CommunityId,Name)

PostTag

PostId (bigserial)
TagId (bigserial)
UNIQUE CONSTRAINT (PostId,TagId)

VoteType
Table for the vote types, upvote/downvote.

Name (Text) UNIQUE

PostVote
The reason for this table is so that votes by spammers/serial voters can be undone.

PostId (bigserial)
VoteTypesId (bigserial)
MemberId (bigserial)
UNIQUE CONSTRAINT (PostId,MemberId)

1 Like
  • c/UNQIUE/UNIQUE
  • I have a hunch that FKs are actually considered bigint rather than bigserial since they are not themselves autincremented. That being said, in this Schema Proposal mode, bigserial is an indication that a field is really an FK.
  • Any FK that is not a field named “TableName” + “Id” should have a comment explaining it for clarity. For example, Post ParentId is an FK into Post. Actually, now I see you did that for some like TagSynonymId.
  • VoteType needs a value smallint field. (“Up”, 1), (“Down”, -1)
2 Likes

Sorry I didn’t notice this before, but there seems to be no field for a member’s real name. I’m assuming the “display name” is intended to be something short, and for some people, have nothing to do with their real name.

1 Like

Just a small itch, the IsMeta on the Posts table, wouldn’t it be better to have MetaQuestion and MetaAnswer as posts type than a boolean on the post itself (which allow extending types of posts to blog and announcement type for example as well as meta) ?.

2 Likes

I think that was intentional. Since this is an open to anyone system with no costs or employment or taxes etc., there is no functional need for real name.

  • As with SE, some (like me) will put their real name in their display name.
  • Some will use a pseudonym and others a totally fake name.
  • Some will put their real name in their profile text.
  • Some will have it indirectly via their website.
  • Some nothing at all for good reasons (especially in Workplace, Academia and IPS).

We have no business need for real name. We have no way to verify real name, especially on a free global system. Therefore no real name field.

2 Likes

One thing I notices make me worrying: There’s an IsEmailVerified field. This looks like something that doesn’t belong into the general database, but into the (separate!) accounts database, together with most other sensitive user data (if the email field here is meant as login email, as opposed to an optional display email, it also doesn’t belong into the main database).

As I wrote earlier (I don’t remember exactly where, but I think it was in the early Discord discussions), I’d actually prefer to have all the login handling to be done by a separate process. But definitely it should be a separate database (with a separate and different database password).

Whether email address & hashed password go in the same DB or different, I’ll leave to the security experts. I’m actually not so sure it is really a problem, but “whatever”. But “IsEmailVerified” is useful to have in the primary Member table as it is effectively a “user status” of importance for a variety of reasons and is not in-and-of-itself a security item. For example, having that lets admins know how many people have come in and registered accounts but never verified them, and therefore not used them. As opposed to users who registered & verified but never actually “did anything” - which is a different type of problem.

1 Like

IsMeta determines whether it shows on the main site or the meta site, answer/question/blog/announcement go in the PostType table and can show on either main or meta.

1 Like

Can we seperate the profile social media links to another table too? As in github, twitter, personal web. Cause tomorrow you get linkedbook or something and need a new column
I just want to minimize future schema changes

3 Likes

Real name isn’t a necessary field to have. Users who want to have it displayed on their profile can include it in their about me text, but having a separate field for it opens us up to some issues surrounding GDPR, so I’d rather not.

3 Likes

WRT a separate accounts database: I’d recommend against it. My personal experience recommends against it - it makes things harder for questionable gain, but if you’d rather hear the experts on it, there’s a question about it on Security.SE, which concludes much the same. Most ORMs aren’t set up to deal with multiple databases and multiple connections, which (among other issues, like the lack of JOINs, and the problem of how to connect the code-native data structures) makes the programming significantly more difficult.

The security gains are also minimal: in a well-set-up application in a good environment, there’ll be safeguards against common attacks like SQLi, XSS, and CSRF that do not rely on individual developers to remember them. The likelihood of an attacker gaining raw read-output access to the database is small.

What risk there is can also be mitigated in an easier manner: see the top answer on the question I linked. By setting up two database users with different column access to the same database, you can protect password (and other PII, if you wish) access without needing a separate database at all. The answer talks about MySQL, but I imagine a similar concept is possible in Postgres.

3 Likes

You don’t, but it’s something users might like.

I do that on SE too. If that’s how you intend users make their real identity obvious, then you have to make sure that the display has a sufficiently long maximum, and that the maximum length is properly shown in all cases where it is displayed.

On this forum, there is a separate display name and real name. I used something short for the display name, and my real name in the real name field.

Either way works, but you do need to realize that some people will want to be clear about who they are, just as other want to hide their identity.

That’s rather awkward.

Not if it’s totally optional. Looking back I wasn’t clear, but I only meant this as something you can fill out if you want to, just like on this forum.

1 Like

How would you feel about a pronouns field? Also something users might like, also totally optional, could also go in about me, could also open us up to GDPR.

That’s a misconception. If you have the field, you are obligated to recognise it individually and to take steps to protect the PII in the database, whether it’s public or not. If it’s part of a general field, you need only take minimal precautions.

2 Likes

Please no. People who want to share that info can put it in the bio. We’re not even providing a place to enter professional credentials, which would be way more relevant to evaluating someone’s posts, so having a field for pronouns seems like a misplaced priority to me.

4 Likes

That’s my thinking, too - I was applying much the same train of thought to the necessity of a field for real name vs. just putting it in the bio.

2 Likes

VoteType needs a value smallint field. (“Up”, 1), (“Down”, -1)

I was reluctant to add this to round 4 because there is a also an issue of how much the vote influences the reputation and what if we need to recalculate. It will require a little bit more math.
Plus it kind of looks out of place because every other FK is a bigserial

Since we agreed to not calculate scores as upvotes minus downvotes (and indeed, the scoring formula we agreed on does not contain a single minus sign), and since IIRC we also have a consensus to show upvotes and downvotes separately, giving downvotes the value −1 IMHO would be counterproductive.

I think we will be offering database dumps; that is, the database is not purely an implementation detail. If sensitive and public data are in different databases, the dump is a no-brainer: Just dump the public database, and don’t dump the sensitive database. Otherwise you’ll have to be careful to sanitize your dump; and one mistake may inadvertently make sensitive details public.

And don’t say we will be careful enough. I bet every single company that had a data leak thought the same. And the fact that data leaks are not uncommon, and not restricted to no-name companies, shows that you can never be careful enough.

On the use case of determining how many people registered but never confirmed their email, that one can be derived entirely from the sensitive data; no cross-database join needed.

1 Like

As far as I can see, that still doesn’t create a need to separate the databases - it’s also solved by the two-user approach with column permissions.

2 Likes

My point is that otherwise there is nothing, except embedded application code, to indicate what is an Upvote vs. a Downvote. We plan on having score (currently two types - net = Up - Down, and a calculated score based on the formula discussed previously) and both need to know what “Up” is and what “Down” is. You could hardcode it all in the code - in which case no need for a VoteTypes table, just hardcode 0 and 1 or 1 and 2 or -1 and 1 (depending on how you look at things). Or you can put a value 1 or -1 in the VoteTypes table. You can’t do "figure out the Upvotes by checking VoteType.name = ‘Up’ because that totally falls apart with any language customization.

I am not referring to SE-style 10 points. Just a mathematical 1 or -1 to make things simple to recalculate - e.g., drop a Member and all associated Votes and you can recalculate any affected Q/A/etc. by Post.NetVote = SUM(VoteType.Value) FOR PostVote.PostId = Post.Id etc.