Schema Proposal Round 4

Changes this iteration:

  • Changed the naming convention from PascalCase to snake_case
  • Added a reference tables for social media for the members table.

member
This table will hold the global user records for an Codidact Instance. A user should only have one email to login with, that would be stored here. I am leaving the password storage and hashing to the experts.

display_name (text 100)
bio (text)
email (text 255) UNQIUE
gravatar_hash (text)
location (text 100)
is_from_stack_exchange (boolean default FALSE)
stack_exchange_id (bigint)
stack_exchange_validated (date)
stack_exchange_last_imported (date)
is_email_verified (boolean default FALSE)
is_suspended (boolean default FALSE)
suspension_end_date (date)

community
Table for each of the individual sites inside of a Codidact installation

name (text 40) UNQIUE
tagline (text 100) UNQIUE
url (text 255) UNQIUE
help_text (text)
faq (text)
status (enum)

trust_level
This needs to be specific to a site for when we get different language sites and it needs to be localized.

name (text 100) UNQIUE
explanation (text) UNQIUE

trust_level_community

community_id (bigserial)
is_same_as_instance (boolean default TRUE)
name (text 100) UNQIUE
explanation (text) UNQIUE

member_community

A user may want to join one site and not another, and having a join table between users and sites allows a user to be suspended or gain/lose privileges on a single site. I listed the privilege booleans in the order of lowest to highest, the privileges after is_moderator aren’t listed because we probably are going to take any of moderator’s lower privileges away without demodding them.

member_id (bigserial)
community_id (bigserial)
display_name (text 100)
bio (text)
email (text 255) UNQIUE
gravatar_hash (text)
location (text 100)
is_from_stack_exchange (boolean default FALSE)
stack_exchange_id (bigint)
is_same_as_instance (boolean default TRUE)
stack_exchange_validated (date)
stack_exchange_last_imported (date)
is_suspended (boolean default FALSE)
suspension_end_date (date)
trust_level_community_id (bigserial)
is_moderator (boolean default FALSE)
is_deleted (boolean default FALSE)
UNIQUE CONSTRAINT (member_id, community_id)

social_media
The types of social media that the member can display in his profile

name (text 100)

member_social_media
The social media that the member would like to display in his profile per community

social_media_id (bigserial)
member_id (bigserial)
community_id (bigserial NULL)
url (text)
UNIQUE CONSTRAINT (community_id,member_id,social_media_id)

privilege
Table for privileges

name (text 100)
description (text)

member_community_privilege
For recording which members have which privilege in a community. If a member has a privilege suspended, then that is also recorded here, and a nightly task will auto undo the suspension once the privelege_suspension_end_date has passed.

community_id (bigserial)
member_id (bigserial)
privilege_id (bigserial)
is_suspended (boolean default FALSE)
privilege_suspension_start_date (date)
privelege_suspension_end_date (date)
UNIQUE CONSTRAINT (community_id,member_id, privilege_id)

post_type
Records the type of post, question/answer/blog etc

name (text 100)
description (text)

post
I thought about splitting into a answers table and and questions_table but doing it in the same table lets comments have a post_id instead of a questions_id and a answers_id. Meta posts are denoted by the is_meta column. Type of post is determined by the post_type_id

member_id (bigserial)
community_id (bigserial)
title (text 255 default NULL)
body (text)
upvotes (bigint)
downvotes (bigint)
net_votes (bigint)
score (decimal)
is_accepted (boolean default FALSE)
post_type_id (bigserial)
is_closed (boolean default FALSE)
is_protected (boolean default FALSE)
parent_id (bigserial FK to post default NULL)
is_meta (boolean default FALSE)
is_deleted (boolean default FALSE)
UNIQUE CONSTRAINT (community_id, title)

comment
Table for the comments on posts, both questions and answers.

member_id (bigserial)
post_id (bigserial)
parent_comment_id (bigserial FK to comment default NULL)
body (text)
upvotes (bigint default 0)
downvotes (bigint) default 0)
net_votes (bigint) default 0
score (decimal default 0)
is_deleted (boolean Default FALSE)
deleted_at (date)

comment_votes

comment_id (bigserial)
member_id (bigserial)
vote_types_id (bigserial)
UNIQUE CONSTRAINT (comment_id, member_id)

tag
Table for all of the tags

community_id (bigserial)
name (text 100)
description (text)
tag_wiki (text default NULL)
is_active (boolean default TRUE)
tag_synonym_id (bigserial FK to tag default NULL)
usages (bigint)
UNIQUE CONSTRAINT (community_id, name)

post_tag

post_id (bigserial)
tag_id (bigserial)
UNIQUE CONSTRAINT (post_id, tag_id)

vote_type
Table for the vote types, upvote/downvote.

name (text) UNIQUE

post_vote
The reason for this table is so that votes by spammers/serial voters can be undone.

post_id (bigserial)
vote_type_id (bigserial)
member_id (bigserial)
UNIQUE CONSTRAINT (post_id, member_id)

1 Like

While writing this up I noticed something missing.

Maybe I misread the road map but isn’t there supposed to be a reputation system and actions that hold a certain amount of reputation?

For instance an upvote on a question is +5 & on an answer is +10 and an edit is +2, etc.
I don’t see this represented in the database at all.

A basic idea would be to have a table called “post_action” instead of “post_vote” and “post_action_type” to represent how much each action on a post would do.

Reputation was deliberately left out at this point, I don’t think there is a consensus yet if there is even going to be a point system.

2 Likes

This is just something I noticed, will comment more extensively later.

Seems to miss a link, right?


Regarding reputation: I think we have consensus not to have a reputation system, but a trust level system:

1 Like

A trust system does not mean that we we won’t have a rep system, so far it just looks like privileges will be tied to actions and not to rep.

Given how few people on SE actually use their privileges, I doubt that gaining privileges will be a motivator like rep is.

1 Like

How do we know that? For example, I very much value being able to Edit without Review Queue/Approval Process on the sites where I have gained enough Rep. While I like seeing the Rep count up, part of the reason is that I know it helps get me that privilege. A bunch of privileges I never use, but some I use quite a bit. It actually, to SE’s credit, flows naturally in many ways - actions lead to reputation lead to privileges lead to more actions.

3 Likes

For example on Outdoors.SE

  • Of those who could upvote 4572/9398 → 49% did at least once.
  • of those who could downvote 430/2057 → 21% did at least once.

From memory there are like 6 people who will cast delete votes and like 10 who will review.

1 Like

Thanks for all your work on this! This looks pretty clear and logical to me. I have a few comments/questions.

Quotes from table definitions are edited down to the columns I’m talking about in this comment.

I think 100 characters is short for a tagline (if you mean the short descriptions like on SE), and 255 seems really long for a URL. Or is that what help_text is? I mean the lines like “for writers/artists using science, geography and culture to construct imaginary worlds and settings”.

I don’t think faq belongs here.

The value won’t always be a URL. For Twitter I’d expect a user name. Maybe rename url to value? (Is value a reserved word?) Then the semantics of the value depend on the platform.

I expected the author of a post to be populated in the created_by column. What is member_id here if it’s something other than the author?

We might also have locks, and locks might come in different flavors. Do we need the broader concept of “restrictions” that go in their own table and are linked here? That way we aren’t changing the definition of post any time we add some new restriction.

So different questions with the same title get blocked at the DB, instead of potentially having duplicate questions with different post IDs. That seems fine for initial creation, but what happens when somebody edits a title and unknowingly duplicates an existing question? Will the front end appear to work and then there’ll be an error from the DB through the server and back to the user? (No criticism; just asking.)

Think about meta titles like “why was this question closed?”. Those are bad titles, but this happens fairly often.

And deleted_by?

There might be value in having tag_wiki be a post, so we can track creation time/user, modification history, and other stuff. Yeah, tag wikis don’t get votes, but not all posts can be closed either (only questions), so that seems ok?

1 Like

Early on there was consensus to not have a single reputation number, and later there was some feedback critical of that idea. If we have reputation it will presumably be just a display number (privileges are handled separately). I’ve no objection to designing with the idea that we’ll have that number, deferring the question of whether to use it.

If we had DB support for reputation, what would that entail? A rep score on the user (for easy access), a table of what things produce what amount of rep, and what else? If rep comes only from votes then I don’t think we need anything else; if we wanted to give rep for other things (like edits) we’d need to do more work, I think.

1 Like

That is short for a tagline, and IDK about the URL. Faq may indeed need to be moved.

For all our columns that just hold the name of the thing, I would suggest content because that describes what it is, without being a reserved word.

In this case yes its the same thing, but that’s not always the case across the DB, and for the sake of consistency I would rather have it here.

It might be worth expanding this to a status and a post_status table. That adds two joins but we can add as many statuses as we want.

That’s up the UI to show an intelligent error message that explains why the edit won’t go through. Otherwise people think the system broke.

At least right now, in the proposed history table we are tracking all changes to all tables. Eventually that’s probably not going to be the case as we realize that some changes don’t need to be tracked, but just because its tag table doesn’t mean we won’t have a full edit history.

1 Like

I think all it would take is an extra column on the member table called “rep” and then a community specific table so that each community could adjust the point values of different actions. I know CodeGolf wanted question upvotes to be equal to answer upvotes for a long time.

2 Likes

Yes, good idea – “status” is much better than “restrictions”.

I was more asking how hard (and performant) it would be for the UI to handle that. I guess you always have to ask the DB if it’s (still) ok to make that change, though – maybe there was an edit collision, or maybe the post was just locked or deleted, etc.

Oh, there’s one history table for everything? OK, I thought history tables were paired with regular tables (and thus we’d decide which ones warranted keeping history). Keeping everything now, and we can always prune later if we decide that’s not necessary, seems fine.

Does the history table include the creation event? Because I think we do still want to know who created a tag wiki and when.

1 Like

Actually, I think we had discussed previously that both help_text and faq will end up being one-or-more and likely variants of Post or similar to have the full flexibility needed.

Tagline should be longer - 255 sounds good to me.

URL generically is considered a fairly long field. But that is where url might be https://www.example.com/section/page/?parameters=included_because_no_proper_structure For our purposes, URL here (as opposed to some other parts of the system) is the URL of the homepage which will be something short like https://writing.codidact.com and 100 characters is plenty for that.

Agree that URL is not the right term at all for social media - @cbrumbaugh suggested content - that sounds as good as anything.

There are a bunch of reasons, including:

  • Posts that get migrated between communities will have the created_by assigned to the Moderator who pushed the button but member_id will be the real author
  • Posts that are imported through some automatic process from SE or elsewhere will have created_by assigned to the automated process but member_id will be the real author (when known).

This (as with some of the other issues raised by others) has to do with the automatic fields (created by/time, modified by/time, audit trail stuff, etc.) being a step removed in some cases from “visible actions by users”. This is normal for a big database.

Yes, we will have variants and additional status types, etc. But the idea is that the main Post record will contain a lot of stuff that allows the most common operations to have useful information without extra joins, etc. is_closed will likely be a filter automatically applied to a lot of standard views - e.g., the home page will probably always skip anything marked as is_closed. When is_closed is being included in a display then another query can find out all the details. Similarly, is_protected by itself is enough if we want to show that on the home page (some visible indicator) and then when the full details of the question are displayed the details can be retrieved (along with Answers, Comments and everything else).

Actually, this can come up in a bunch of ways:

  • New user comes over from SE and says “retrieve all my stuff” - a new question not previously retrieved has a title that conflicts with something already added in Codidact.
  • A question is deleted. Normal users can’t see it. User tries to add a new question with the same title - which might even be because OP tried previously, had a horrible question that got deleted and is now coming back (having learned how to ask questions better) and uses (appropriately) the same title.

I noticed this previously and wasn’t sure. Now I am sure remove this constraint as it is too restrictive.

I’m not so sure about that. Unlike most other posts, Tag Wikis not only don’t get votes, they also don’t get comments or get deleted (a tag may get deleted, but that’s different). There are other instances (e.g., Member profile text) where a text field will have Markdown capabilities + History. Actually, almost everything will have History. It is fine.

1 Like

History table (as I define it, though not everyone agrees):

  • Paired with almost every table in the system
  • Named with a consistent prefix or suffix (e.g., history_whatever or whatever_audit or something like that).
  • Includes every field in the original table plus an autoincrement Id (a few ways to handle the Id name conflict - could give the 2nd table history_id or could name the original id to be whatever_id - each has advantages)
  • Record inserted when an insert is done in the original table
  • Record inserted when any changes are made in the original table, containing all fields of the changed record, whether changed or not.
  • While technically slightly more efficient (especially with unchanged text fields) to only include changed fields, this should include all fields as that will allow easy rollback (think “spam undo”) and comparison is easy either way (just have the history view function skip any fields that match).
1 Like

That’s not how I understand it. There was a discussion about rep (or whatever it might be called), and I don’t think there was a consensus. My observation was that while some individual sites might not want it, it will be important to others. This therefore needs to be in the database from the start.

I agree that the method of computing rep, when enabled at all, has not been decided. It may be subject to per-site tweak parameters. However, this discussion is about the database, so the rep score needs to be in there per user per site.

This I think is fine. We can handle the type of link it is in the business data. Rather than it sometimes being url sometimes username, I think it rather just be all URLs.

Agreed

So post_status table. Sounds good.

Also agreed. Saw this and it seemed “smelly” if you don’t mind the technical jargon.

It was proposed in the DB Naming convention round 2 that all tables should have a deleted_by by standard

I think in general we should have a history of all the actions done and a table to how much reputation it gives to each action. We can later decide if to display rep, only display trust level, how to calculate etc. These kind of question are related to business logic.

But for the sake of data persistence we should have these an actions table and a value per action table at least. Also a column that was mentioned here for rep on the member per community.

Also maybe move history to another database? It will kill the storage pretty quickly (terras of data are expensive to keep on a cloud)

History needs to be readily available for things like showing edit history on a post and certain moderator tools. But I also don’t think we’re talking about terabytes of data, at least in our first several years of operation. If we do this right, we can refactor parts of the DB when we need to without disrupting the application code that uses is (because we’ll have an API between them).

1 Like

Storage is cheap these days. And the data that will grow the most with history - editing of Q & A - will absolutely need to be database accessible (as opposed to slow-access-archive). Using AWS RDS Multi-AZ (aka cloud mirrored database storage) as an example, we’re looking at $0.23/GB/month. So 100 GB x 0.23 x 12 = $276/year. Not peanuts, but not a huge amount in the grand scheme of things. Images are where the big storage space is needed (a picture is actually worth quite a bit more than 1,000 words) and the pricing there is on the order of $0.023/GB, so 1 TB/year for the same $276. That 1 TB is enough for somewhere in the range of 100,000 - 500,000 images - and more if we compress to a reasonable level on initial upload.

1 Like

As I understood Menasheh wasn’t talking about a post_history table but rather a log of everything that is happening in the system. Every record that was inserted/deleted/updated.
My comment about the history being in another database was regarding Menashes suggestion.

But also I think we are missing a post_history table that should be in the same database.

1 Like