Database schema (Round 8) - wip

mcalex · 25 March 2020 18:14

Following discussion with @misha130 last weekend, I have made changes to the pgModel to generate the initial database.

From the Round 7 discussion:
Added category_tag_set table
Changed social_media_type to external_association_type
Added ‘soft delete’ fields to all tables - I expect some (enum type) tables don’t want this, but I figured it was easier to add to everything and delete what isn’t wanted later
Added duplicate_post_id and close_reason_id to post_vote

Not from the Round 7 discussion:
Add a column “internal_id (text 50)” to enum tables (setting, privilege, post_type, vote_type)
Why can’t the dsiplay_name be used for querying (or why will internal_id remain consistent when display_name won’t). Also, with an entity framework solution there will be very little sql readable, or otherwise.

Delete the post_status table
This will be useful if we want to maintain post status history (outside of audit). If we don’t want then then it’s easy to delete and make the changes following

Following are the tables from the latest Database schema proposals.
Table images below show the regular and the audit (history) table. I’ve pasted images as a) retyping all field names didn’t interest me, and b) you get to see the constraints as well and check consistency between data and audit this way.

Tag:

Category_TagSet:

Category:

Privilege:

Modified Privilege:

Post_Tag:

TrustLevel:

Member_Privilege:

Post:

Modified Post:

Member:

Post_Vote:

ExternalAssociationType:

Post_Status:

Member_ExternalAssociationType:

Comment_Vote:

Vote_Type:

Modified Vote_Type:

Post_Type:

Modified Post_Type:

Setting:

Modified Setting:

Comment:

Post_StatusType:

Category_PostType:

Post_DuplicatePost:

New tables:
MemberAnnotation:

AnnotationType:

CloseReason:

CloseReason_Subreason:

Notice:

Outsatanding issues:
Cascade rules for ON_DELETE
~~Member table has a network_account_id column. We need a network_account table.~~
~~Post_Vote table has a close_reason_id column. We need a close_reason table~~
There are other unanswered questions in the Round 7 thread. These can be added as issues to github.

SQL file generated by pgModeler and modified version for generated (calculated) fields added to:
https://github.com/mcalexster/mcalex_codidact/tree/mcalex/skeleton/Database/Scripts/base

luap42 · 25 March 2020 18:53

The problem is, that display_name is often configurable by the communities (e.g. rename of badges, translations), whilst we still need to be able to access them consistently for functionality reasons (e.g. badge queries). For readibility in these cases, there’s the internal_id.

Can you explain to me, why we can’t store everything post status related (deleted, closed, locked, …) on post? They would be stored in history, too, via post_history, right?

I guess that the network id will be from an other table and it should therefore be no fk, but should be interpreted by the software.

I suggested this structures in No. 7:

Add close_reason table with at least these columns: display_name (text 50), description (text), parent_id (bigserial FK to closure_reason, nullable), is_active (boolean default to TRUE)

Add post_close_subreason table, which allows for the selected sub reasons of the primary close reason to be saved.

Then you forgot this suggestion:

Then we need a member_history table, which is not the audit log for the member table, but a list of events in the user history, which are useful for moderators (e.g. moderator message sent, suspension added, suspension removed, PII accessed, custom annotation). This should include a member_history_type table.

Which should now be renamed to member_annotations or so.

manassehkatz · 25 March 2020 18:57

Assuming the issue is “network = list of communities, but if each community is a separate database, can’t share this table”, I would recommend having that table in each community’s database. It will be a small table, rarely change, and so it can be kept synchronized (manually if need be) as communities are added/dropped. That will make things much easier for anything within a page (e.g., migration options, user profile details, etc.) that needs to know the names of other communities.

luap42 · 25 March 2020 18:58

Nah. I mean the network account id. That’d be a potentially big table (all accounts on any Codidact community of our instance).

manassehkatz · 25 March 2020 19:13

You mean the list of users?

luap42 · 25 March 2020 19:14

I mean a table containing the contents for something similar to SE’s network profile (mine for example: https://stackexchange.com/users/11267045/mee-the-setup-wizard), which contains connections to accounts on all communities is connected to authentification.

manassehkatz · 25 March 2020 20:02

Yeah, that would be trivial in the “one database” style. With multiple databases, probably the way to do it is to have the authentication/“master” server have just the barebones info - email/password/etc. + master profile + list of communities. Then each community has a profile for each user which links back to the authentication server (but not a true FK because it is a different server) and has a setting for “same as master” (which I think is already in the schema but I’m not looking right now). If a user makes an update “for all communities” then that process would push updates to all the other communities as needed. Which is I think basically the way SE does it (at least from an outsider’s perspective).

luap42 · 25 March 2020 21:37

Yeah. IIRC it’s in the schema and that was what I wanted to suggest.

mcalex · 5 April 2020 06:17

Sorry for update slowness - trying times all 'round at the moment.

Updates based on discussion follow

OK, I thought the new field was going to be configurable (to match ‘display_name’) too. One thing: to better identify its purpose, can it be called something like ‘universal_code’? Or if you don’t like universal, ‘internal_code’. Just not ‘_id’ (by our naming convention). I’ve called it universal_code in the interim*

Well we can, but basic normalization rules suggest we shouldn’t as the status can change over time without changing anything to do with the actual post. Also, do we want to store the status change timeline (‘Post closed on 2020-03-25 16:10:071’, 're-opened on 2020-03-26 12:15:178, ‘closed-03-26 12:22:376’ etc)?

They would be stored in post_history, but I understood the audit tables weren’t for general functionality, just audit. Isn’t that how CQRS works? Not having separate post_status means every time there’s a status change, a twenty plus field table (post_history) gets an extra row added instead of a table that, apart from the audit data, only has fields for the post and the status (though it should probably have start end date/times too). Given that post will likely be the most populated/updated table it makes sense to make it as lightweight as possible

Additionally, separating out the status means the ‘is_closed’, ;is_accepted’, ‘is_protected’ fields can all come out of the post table and be used as post_status_types. This then allows for further status types to be added to the data without changing the structure of the post_table further (is_promoted’, ‘is_historically_significant_but_not_a_good_fit’, ‘is_the_subject_of_a_twitter_storm’ - is_locked as per later instruction) and then storing a whole bunch of essentially null value columns (for all the is_whatever falses).

Yes. I think I skimmed that para after assuming those changes depended on the changes i’m discussing. Now added, and the notice table, and also the member_annotations and member_annotations_type tables.

What is being stored in ‘post_close_subreason’? Is it just a description (text) and link to ‘close_reason’ (with lookup values)? Or a user written field with different text each time? I have assumed the former.

I think that is all the remaining issues from the Round 7 discussion.

From the Round 8 discussion, with regard to the network_account_id issue, if I understand it’s a master_profile_id. Can I suggest that as a name?

Tables changed and added have been included in OP. Script to create database has been uploaded to: https://github.com/mcalexster/mcalex_codidact/blob/mcalex/skeleton/Database/Scripts/base/SchemaProposalCurrent_withGenCols.sql

Comments:
a) With the new category_tag_set table, we have: Tag - M:1 - TagSet - 1:M - Category. What happens if one category instance wants the same tag_set as another, except for a couple tags in the set that only one or the other category allows? Should perhaps a Category be allowed multiple TagSets which would result in a ‘general_tags’ tag_set instance being used by both categories and then ‘category_a_specific’ and ‘category_b_specific’ tag_sets which each individual category would use?

b) should the ‘enum’ tables be converted into actual enums (types) in postgresql and would this negate the ‘universal_code’/‘internal_id’ issue?

cellio · 5 April 2020 18:53

Multiples potentially introduce a lot of UI confusion. Remember that tags are messy, emergent, and created by users who might not have thought deeply about a whole context involving multiple categories and general-vs-specific tags. I’d like to stick with: a category uses one tag set; a tag set may be used by more than one category. If we find that’s not expressive enough, we can refine it later.

manassehkatz · 5 April 2020 19:54

Perhaps if someone adds/edits a Tag that is in a Tag Set used by more than one Category, they should get a reminder like:

This Tag is (“will be” for an Add) used by your current Category, Foo, and also by the Category Bar. If these changes (this new tag) is not appropriate for Bar, please [some appropriate moderation notification action to discuss splitting the Tag Set].

mcalex · 7 April 2020 10:33

Following discord discussion on Apr 6 in #backend-development channel, tables post_type, vote_type and privilege will be turned into postgresql enums (types). The data in these tables aren’t likely to change once finalised, and any changes will also involve significant codebase changes (to deal with whatever the post_type, vote_type or privilege entails). Fire away if you disagree/have any objections, ‘like’ if you agree (i guess, is that how the forum works?).

manassehkatz · 7 April 2020 15:21

I agree that these 3 tables (post_type, vote_type, privilege) will have only very rare changes after MVP, and even then they will almost certainly be additions and not deletions due to the need to maintain compatibility with any prior posts, votes and privileges.

However, the names of these items will vary over time. That can be due to internationalization, tweaking of names (particularly privileges) to make them clearer to new users or other factors determined by various instances and/or communities.

Therefore, I recommend that the textual portions - name, description or similar fields - be stored in a text file, retrieved at system startup, rather than compiled into code. If that is handled properly then moving them out of the database is quite reasonable.

cellio · 7 April 2020 17:01

I expect we will occasionally add to these tables/enums after MVP. For example, we know we want some post types beyond question and answer, and if we add something like endorsements or reactions or public votes, that would presumably affect vote_type. That said, I agree that such changes will be additive and need to be accompanied by code changes.

misha130 · 9 April 2020 22:07

So another thing:

Is the last modified by member id not nullable? Should be nullable since its only on update
The created by member Id is also not nullable? Should be nullable at least on the members table because the only thing that creates a user is the registration or an admin of sorts?

cbrumbaugh · 9 April 2020 22:46

I would make them both NOT NULL, last_modified_by is set to the exact same as the created_by when the row is created and so it would never actually be null.

As far as creating members, I would suggest we setup some admin users to handle that and that id go into the created_by column there.

mcalex · 10 April 2020 01:40

I agree with @cbrumbaugh. Creation == most recent modification and it’s pretty common to fill both fields with the same value in an entity constructor.

For admin/system creation, I was thinking of a ‘system’ member. I’m not sure if cbrumbaugh is referring to real members who are designated as admin users, but it’s essentially the same concept. There don’t look to be too many entities other than ‘member’ that will involve system creation in any event.

Alternatives: there could be a placeholder in the field until the creation process is complete and then the new member becomes the ‘created_by_member_id’ - this matches the real world process.

Or, and thinking aloud now, if a new member has created an account in order to respond to content - adding an answer to a question, a comment on another post, even casting an up/down vote - the writer of the original content is designated as the ‘created_by_member_id’. Some reputation in whatever format we end up with could be assigned to those members; a straight up count, or better, some formula that factors in the quality of the response content …

When you asked in the forum, i thought you might have been referring to entities whose creating member has since been deleted - but I understand these will remain, just some indicator (is_deleted) will be used to designate them as former members. SO makes any member links unclickable effectively removing the member from general existence, but that’s not the only solution, eg, grey out the regular profile page, or a curated version just listing their questions / answers / other posts, a note showing why the member wanted account removal (if the member wants to publish that) etc.

tl/dr I think they both should be not nullable. We have options for exceptional cases.

misha130 · 10 April 2020 02:33

Well to create this system admin member I need a member_id. Trying to implement this is why I am here.

EF doesnt support this and the orm will crash all the time basically because of its expectations.

Maybe on this first system insert:

Ill remove constraint
Add member with null
Set created/modified to newly created member id
Put constraint back

mcalex · 10 April 2020 04:27

Implementation suggestions of the two options:

I suggest following the unix way of segregating the first 999 uids for system type members. There could be a ‘Default User Creator’ a 'Created by SO Migrator; etc or maybe just one Member Creator. Maybe just pick a number eg 100 (leaving the first 99 for global system types) and create members with ids from 100. This should be flexible enough to change later eg flipping from a single user creator to multiple types or even to having one god-like system user (or just ‘nobody’). Postgresql allows for identity key values to start at a certain number - I’ll change that to 1000 if you prefer this way.

It looks like EF Core provides support in a different manner
https://docs.microsoft.com/en-us/ef/core/modeling/generated-properties?tabs=data-annotations#value-generated-on-add

or there’s a couple SO suggestions: https://stackoverflow.com/questions/27077461/how-to-get-next-value-of-sql-server-sequence-in-entity-framework

I don’t know which would be best: if there’s likely to be a need for other ‘system’ user types then the first, if not, or it makes sense to setup the new member as the creating member, then the second, I guess.

Doesn’t EF add navigation entitles with ID = 0 (or CLR default value which for int is 0) when it doesn’t know primary key values in any event? That might be a sql server feature.

misha130 · 10 April 2020 09:38

It does exactly that which violates the constraint since there are no members with id 0.