Schema Proposal Round 2

cbrumbaugh · 1 January 2020 18:21

I read through the feedback and updated the naming convention and the schema. Here is the new updated one.

Users
This table will hold the global user records for an Codidact Instance. A user should only have one email to login with, that would be stored here. I am leaving the password storage and hashing to the experts.

DisplayName (Text 100)
Bio (Text)
Personal Website (Text 255)
Github (Text 100)
Twitter (Text 100)
Email (Text 255) UNQIUE
GravatarHash (Text)
Location (Text 100)
IsFromStackExchange (Boolean default FALSE)
StackExchangeId (Bigint)
IsSameAsInstance (Boolean default TRUE)
StackExchangeValidated (Date)
StackExchangeLastImported (Date)
IsEmailVerified (Boolean default FALSE)
IsGlobalSuspended (Boolean default FALSE)
GlobalSuspensionEndDate (DATE)

Communities
Table for each of the individual sites inside of a Codidact installation

Name (Text 40) UNQIUE
Tagline (Text 100) UNQIUE
URL (Text 255) UNQIUE
HelpText (Text)
FAQ (Text)
IsBeta (Boolean default FALSE)

TrustLevel
This needs to be specific to a site for when we get different language sites and it needs to be localized.

CommunitiesId (bigserial)
Name (Tex t 100) UNQIUE
Explanation (Text) UNQIUE

UsersCommunities

A user may want to join one site and not another, and having a join table between users and sites allows a user to be suspended or gain/lose privileges on a single site. I listed the privilege booleans in the order of lowest to highest, the privileges after IsModerator aren’t listed because we probably are going to take any of moderator’s lower privileges away without demodding them.

CommunitiesId (bigserial)
DisplayName (Text 100)
Bio (Text)
Personal Website (Text 255)
Github (Text 100)
Twitter (Text 100)
Email (Text 255) UNQIUE
GravatarHash (Text)
Location (Text 100)
IsFromStackExchange (Boolean default FALSE)
StackExchangeId (Bigint)
IsSameAsInstance (Boolean default TRUE)
StackExchangeValidated (Date)
StackExchangeLastImported (Date)
SuspensionEndDate (Date)
IsSuspended (Boolean default FALSE)
TrustLevelId (bigserial)
CanAsk (Boolean default TRUE)
CanAnswer (Boolean default TRUE)
CanComment (Boolean default FALSE)
CanSuggestEdits (Boolean default FALSE)
CanFlag (Boolean default FALSE)
CanUpvote (Boolean default FALSE)
CanDownvote (Boolean default FALSE)
CanCloseVote (Boolean default FALSE)
CanEdit (Boolean default FALSE)
CanReview (Boolean default FALSE)
CanProtectPosts (Boolean default FALSE)
CanLockPosts (Boolean default FALSE)
CanSoftDeletePosts (Boolean default FALSE)
IsModerator (Boolean default FALSE)
IsDeleted (Boolean Default FALSE)
UNIQUE CONSTRAINT (UsersId,CommunitiesId)

UsersSuspensions
This table is for recording the specific privilege suspensions. The idea is that we will have an scheduled task running nightly that once the suspension is over will set the correct flag on the UsersCommunities table.

CommunitiesId (bigserial)
UsersId (bigserial) UNQIUE
QuestionBanStartDate (date)
AnswerBanStartDate (date)
CommentBanStartDate (date)
SuggestedEditBanStartDate (date)
FlagBanStartDate (date)
UpvoteBanStartDate (date)
DownvoteBanStartDate (date)
EditBanStartDate (date)
ReviewBanStartDate (date)
ProtectBanStartDate (date)
LockPostsBanStartDate (date)
SoftDeleteBanStartDate (date)
QuestionBanEndDate (date)
AnswerBanEndDate (date)
CommentBanEndDate (date)
SuggestedEditBanEndDate (date)
FlagBanEndDate (date)
UpvoteBanEndDate (date)
DownvoteBanEndDate (date)
EditBanEndDate (date)
ReviewBanEndDate (date)
ProtectBanEndDate (date)
LockPostsBanEndDate (date)
SoftDeleteBanEndDate (date)

Posts
I thought about splitting into a Answers table and and QuestionsTable but doing it in the same table lets comments have a PostsId instead of a QuestionsId and a AnswersId. Meta posts are denoted by the IsMeta column. Still unsure on community wiki/FAQ posts.

UsersId (bigserial)
CommunitiesId (bigserial)
Title (Text 255 Default NULL)
Body (Text)
Upvotes (bigint)
Downvotes (bigint)
NetVotes (bigint)
Score (decimal)
IsAccepted (Boolean Default FALSE)
IsQuestion (Boolean default TRUE)
IsAnswer (Boolean default FALSE)
IsBlog (Boolean default FALSE)
IsCanonical (Boolean default FALSE)
IsClosed (Boolean default FALSE)
IsProtected (Boolean default FALSE)
ParentId (bigserial Default NULL)
IsMeta (Boolean Default FALSE)
IsDeleted (Boolean Default FALSE)
UNIQUE CONSTRAINT (CommunitiesId,Title)

Comments
Table for the comments on posts, both questions and answers.

UsersId (bigserial)
PostsId (bigserial)
ParentCommentId (bigserial DEFAULT NULL)
Body (Text)
Upvotes (bigint DEFAULT 0)
Downvotes (bigint) DEFAULT 0)
NetVotes (bigint) DEFAULT 0
Score (decimal DEFAULT 0)
IsDeleted (Boolean Default FALSE)
DeletedAt (Date)

CommentsVotes

CommentsId (bigserial)
UsersId (bigserial)
VoteTypesId (bigserial)
UNIQUE CONSTRAINT (CommentsId,UsersId)

Tags
Table for all of the tags

CommunitiesId (bigserial)
Name (Text 100 NOT NULL)
Description (Text)
TagWiki (Text default NULL)
IsActive (Boolean default TRUE)
TagsSynonymId (bigserial FK to Tags default NULL)
Usages (bigint)

PostsTags

PostsId (bigserial)
TagsId (bigserial)
UNIQUE CONSTRAINT (PostsId,TagsId)

VoteTypes
Table for the vote types, upvote/downvote.

Name (Text)

PostVotes
The reason for this table is so that votes by spammers/serial voters can be undone.

PostsId (bigserial)
VoteTypesId (bigserial)
UsersId (bigserial)
UNIQUE CONSTRAINT (PostsId,UsersId)

cwellsx · 1 January 2020 18:53

You ignored my previous suggestion that you separate Posts and Topics – that might be important for a ‘normal’ database.

And include the date of a post – essential, no?

And several categories instead of only a boolean IsMeta (etc.).

One good thing about the current Posts structure (i.e. with ParentId which is presumably a reference to another Post) is it would support an n-level tree of posts – instead of only 2-level i.e. “topics contain posts”. Is that good though, and done on purpose?

cbrumbaugh · 1 January 2020 19:01

What’s the difference between Posts and Topics? I thought about splitting questions and answers but that makes the comments harder.

Part of the convention is that every table has a CreatedAt and a LastModifiedAt date fields so that is already there. Unless we want drafts in which case the time posted wouldn’t be the created date hmmm.

Are these the categories you are refering to?

IsQuestion (Boolean default TRUE)
IsAnswer (Boolean default FALSE)
IsBlog (Boolean default FALSE)
IsCanonical (Boolean default FALSE)

The posts structure is so that an answer can be linked to its question.

misha130 · 1 January 2020 19:05

People in the community don’t want to use enums or flags instead of adding columns?
Instead of:

Becomes this:

TABLE UserSocialMedia
SocialMediaType (INT)
Link (TEXT)

and

CanAsk (Boolean default TRUE)
CanAnswer (Boolean default TRUE)
CanComment (Boolean default FALSE)
CanSuggestEdits (Boolean default FALSE)
CanFlag (Boolean default FALSE)
CanUpvote (Boolean default FALSE)
CanDownvote (Boolean default FALSE)
CanCloseVote (Boolean default FALSE)
CanEdit (Boolean default FALSE)
CanReview (Boolean default FALSE)
CanProtectPosts (Boolean default FALSE)
CanLockPosts (Boolean default FALSE)
CanSoftDeletePosts (Boolean default FALSE)

Into

TABLE UserPermissions
UserId (BIGINT)
Permission (INT)

Also some of the bans don’t have an end date atleast on SO.
Like question bans, if you get banned you dont get unbanned after a specific amount of time but you get unbanned if you just keep answering questions and getting rep in a certain amount of time.

cwellsx · 1 January 2020 19:12

If it’s a simple non-tree structure like SE then:

You can comment on a question or an answer (therefore you said you want to call these both “a post”)
Posts have an author, a body, a date, votes, revision history, etc.

Conversely, Topics must have …

At least one first post (which maybe or may not be a question) and possibly more
A title
At least one tag

… whereas Posts cannot.

Similarly a post’s category – whether it is meta – might depend on which topic it belongs to (so IsMeta might be a Topic property not a Post property).

Therefore …

There should be Topics table
Title (and maybe IsMeta, also even CommunitiesId ) should be in Topics and in not Posts
The Post.ParentId should be a foreign key into the Topics table
The PostsTags should be the TopicsTags table instead
AcceptedId could be a foreign key in Topics into Posts (instead of IsAccepted being in Posts)

… that would be simple.

The schema you proposed is potentially far more general/flexible – it allows or can model an n-level tree – which is admirably general but also possibly error-prone – i.e. it defines a big gap between how the data is stored (i.e. what data exists), and how it’s presented (e.g. as topics each with an list of posts instead of a multi-threaded tree) – I don’t know whether that was intentional, maybe it was.

cbrumbaugh · 1 January 2020 19:20

IDK, I threw this together last night and then updated it this morning with the feedback. Maybe enums are better and we should use those. I think that’s worth its own topic where we can discuss the pros and cons.

misha130 · 1 January 2020 19:33

Also shouldn’t most of these tables have the same basic 3 columns that are consistent on all tables:
Like row creation date, row update date & a soft deletion column?

cbrumbaugh · 1 January 2020 19:34

Yes, see the convention here Proposal - DB Naming Conventions I didn’t write those columns out here because they are assumed to be on the table.

misha130 · 1 January 2020 19:36

Well in that case you should change the name of this column because its already the one defined in the name conventions you linked.

manassehkatz · 1 January 2020 20:21

@cwellsx I agree with @cbrumbaugh 's concept. A Question, an Answer, a Blog or a Canonical post are really more the same than different. They all have:

User
Community
Body
Linked Comments
Can be Closed/Deleted/etc.
Need an Edit History and Edit Reviews.
Have Votes Up/Down
Can be Searched together

There are some differences:

Title is not in Answers - but is in all the others.
Tags are not in Answers - but are in all the others.
Answers need a parent Question - the others do not.
Conversely - Questions can have Answers, the others can not.

But overall, the differences are relatively minor - and having a few extra fields hanging on with Answers that will never get used are a small price to pay for much simpler Searching and other common code.

cwellsx · 1 January 2020 21:11

I don’t see why “searching” is much simpler:

If there were separate Posts and Topics, and the UI were like Discourse, then the main page selects topics (not posts).
Or if the UI were like SE’s, then the main page selects topics joined to one post (either the topic’s first post, or its “active” i.e. most-recently-edited post).
If you’re on a page like this one (i.e. showing the details of one selected topic) then the back end selects the topic plus all posts associated with that topic.

Still, have it your way and I’m glad you understood what I was saying at least – or do you?

Having extra stuff in posts (e.g. Title and IsMeta in answer records) means the database doesn’t (even) satisfy second normal form.

I’m all for breaking rules, am I not, but if you’re sure you want to do that this early I imagined more reasoning than this,

I thought about splitting into a Answers table and and QuestionsTable but doing it in the same table lets comments have a PostsId instead of a QuestionsId and a AnswersId. Meta posts are denoted by the IsMeta column. Still unsure on community wiki/FAQ posts.

Maybe we’ve just got different ideas about DB design – mine’s too purist or something.

Given What kind of database should we use? I thought the schema wouldn’t attempt to support n-level trees, yet this one does.

As you say, maybe that’s right though.

manassehkatz · 1 January 2020 21:31

What I’m referring to is more like searching for “any question or answer in a community that has 'xyz” - like the Search field at the top of SE pages. That searches through Questions and Answers - and arguably would search through Blogs and Canonical posts if they existed. Keeping them all in one table makes that easy. And searching for the Home page - just Questions - adds one field to the search (IsQuestion = True or PostType = “Q” or whatever we settle on). And searching for everything in one specific Question is easy “(Id = x) or (ParentId = x)” or something like that. With separate tables it gets messier.

As far as the rest - I am not a DB purist or an expert in RDBMS theory. But I’ve built plenty of DBs and learned a lot the hard way (this stuff didn’t exist in the early 80s when I was in school).

cwellsx · 1 January 2020 21:55

Yes I accepted the theory that questions and answers have a lot in common (i.e. they’re “posts”).

I don’t think answers are topics though – or are they? – so topic-related data (titles, tags, meta-categories, …) could and therefore should be removed from the posts table, is what I was saying.

Come on, I sure you can cope with an elementary JOIN – you’re going to have to anyway, whether it’s joining two separate tables, or in the present schema it’s a self-join .

Me neither.

So you went to school in the 80? “You were lucky!” In the 70s I dreamed of one day growing up to work for some company big enough to own a computer – so that I could program one instead of only reading about programming. I graduated in the early 80s (never using a computer at uni), first job was help build packet switched data network software for a national telco (the proto-internet).

But I read like some of Joe Celko’s books in the 90s – like, SQL for Smarties, see its bit there on Normalization – when I had to act as a DB developer.

manassehkatz · 1 January 2020 22:36

As a kid in the 70s started programming BASIC on an ASR33 connected to my Dad’s company’s minicomputer. 80s at University of Maryland - perfect timing - just missed having to use punch cards, everything on terminals, beginning of PCs, etc.

But back to the topic: I see what you’re saying, even if I disagree with some of it. But some of what I have said is (I think) echoing the initial Schema Proposal from @cbrumbaugh I’d like to see what @JackDouglas and some others have to say about all of this.

misha130 · 2 January 2020 00:50

Made a thread per your suggestion:

cellio · 2 January 2020 01:54

I hadn’t read all of the first version before you posted this second version, so I’m reviewing this without having read all that. Sorry if I’m repeating something you already addressed.

Some of those fields, like the ones about SE accounts and importing data, seem like they should be at the community level, not the instance level. I might have imported my data from Writing but not (yet) from Worldbuilding.

What is IsSameAsInstance in this instance-level table?

What are HelpText and FAQ – literal text or URLs or what? Ultimately a community will have many help topics, many of which will be common across the instance.

IsBeta implies that there are two kinds of sites, beta and not. We haven’t talked about beta-ness at all yet, and I think one thing we’ve learned from SE is that there are more than two natural stages to a community. If we have the notion of site types/stages let’s make that an enum so we can adjust it when we need to.

The ID column here is the actual trust level, right? (0 through 6 as currently proposed.)

Count me with those wondering whether Can[many things] is the best approach. I was expecting something like:

TrustLevelId - which is what drives baseline privileges
Restrictions - some representation (waves hands) of overrides, like temporary edit bans (I guess this is your UsersSuspensions table?)

And then a table for trust-level info that says what privileges come with each level.

For faster lookups, you can then create a view that joins all this together. Maybe the table you’re proposing here is that view and the underlying implementation is not exactly as shown here? Sorry if these are naive questions.

Please plan for there being more types of posts in the future. I think all of the following are Posts:

Questions
Answers
Discussions (a type of post useful on meta)
Blog posts
Canonical posts
Tag wikis
Election nominations
Help topics

All of these are, at their core, blobs of text. All have authors and edit histories.
Most have titles. Most are votable. Some have comments.

Should we be thinking of posts as having a type column rather than a bunch of IsSomething columns that mean we have to change the table definition when we add a post type? (Oops, I guess this is that other conversation again.)

IsMeta is good, because it allows (in principle) any type of post to be meta or not. Meta announcements might be, under the covers, blog posts with IsMeta set, for example.

And anything else that takes comments (election posts, blog posts, etc).

I don’t understand this one. What’s this for?

General question: how does edit history get stored? Posts, comments, and user profiles should keep history. Is each revision of something a new row in the corresponding table, and the server code is responsible for finding and displaying the latest one and making the rest available via history links where applicable?

cbrumbaugh · 2 January 2020 02:25

These fields are duplicated at both the Users and UsersCommunities table in order to take care of both someone wanting to import all, and someone wanting to import just a specific community. That’s also why the Bio field etc. is on both places.

This was proposed by manassehkatz

Most of those fields are in order to have a “master” user definition used when a user connects with a new community or in an existing community sets “IsSameAsInstance”.

here

I would say that those are the starting help text fields, as you mentioned we will probably have more than just the one, but that structure hasn’t been fleshed out yet.

This makes a lot of sense will change it next iteration.

I think I goofed up here, there should be a TrustLevel table where the id column is the actual level and then a TrustLevelCommunity table for each community to customize the text. This would also be where the customized how many items it would take to progress to a level.

With more time to think about this, I am leaning towards reference tables too.

VoteTypes is a reference table for votes, upvote and downvote right now.

See this,

cellio · 2 January 2020 02:49

Suppose I have 20 communities on SE, two of which have Codidact analogues. How is this flag set at the instance level? What happens when a third community gets created and I import data to that?

Tracking data import makes sense at the community level, but I’m having trouble seeing how it makes sense at the instance level. There’s no “import all my SE data” notion; we can only import from and to specific communities, which have to exist.

Ah, that makes sense. Since that’s a column that is only meaningful at the community level and not the instance level, I’m confused by its presence in the instance-level user table.

I’d rather leave help and FAQ out of the schema definition for now, until we work out how we really want these to work. There might end up being a Help table where rows are posts for individual topics (with columns for categories and some way to manage presentation order). I don’t think “help” is one column, and I don’t know that “FAQ” is a separate concept.

On edit history, remember that we’ll need to access that pretty frequently, which is another reason for it not to be off in another entire database. We’re already discussing this on that other thread, though, so I won’t continue that thread here.

Finally, thanks for all your work on this! I apologize for not leading with that.

manassehkatz · 2 January 2020 03:20

Mistake. It should only be in UsersCommunities - to indicate that a particular user says “make all my profile stuff for this community same as at my top-level (Instance) profile”.

We could leave them out. My thought was that every Community will need to have at least one Help and at least one FAQ (both Markdown blobs), each of which could be linked to additional pages, etc.

If we plan from the start for more than one, then each could be a PostsId not NULL.

That’s fine with me.

The one disadvantage of a PostTypesId field is a drop of code needed at the start of the program to “read the PostTypesIds for Q, A, Blog, etc.” because there will be a lot of code specific to each particular type of post and it will need to know “For Questions, SELECT * WHERE PostTypesId = 3” because throwing an extra “JOIN PostTypes ON Posts.PostTypeId = PostTypes.Id WHERE PostTypes.Type = ‘Q’” on every PostType-specific query is technically correct but does not make sense.

That’s the whole discussion about history tables (same schema/DB or separate, all fields or just changed fields, backup/restore vs. history, etc.) In my mind, every revision is a new history row in a separate history table “somewhere” but the server code doesn’t find the latest, it goes to the primary table and retrieves the only copy from there. Only when actual history is needed (view Edit History, research User changes, etc.) do we read from the history tables.

cellio · 2 January 2020 03:38

I think we should assume that each community will have multiple help topics, some of which will be inherited from the instance (how do to X) and some of which will be customized (scope). I’m not a big fan of premeditated FAQs, which usually aren’t really FAQs but just things a site wants to push at users. Emergent FAQs are different, but we have to wait for them to emerge.

Given that there will be multiple topics, I think a HelpTopics table makes sense. If there’s only one entry in the MVP that’s fine, but there will be more before too much longer.

Right. I wasn’t assuming that would be burdensome. I mean, SO stores post types separately like this, and if they can handle that at their scale, we should be fine, right?

My naive preference is to have a history table for each table where we track history, in the same schema in the same database. The primary table always has the current version; the history table stores prior revisions when they exist, with postID+revision# together forming a unique ID (key?) for the row. Does it need to be more complicated than that?