DB Schema Round 7

Was there a Round 5 of the Schema Proposal? I don’t see it in the thread list.

OK, so I have been going through the Schema proposals with pgModeler to get some practice in and am finished up to Round 7. Diagram below (and dbm file at https://github.com/mcalexster/mcalex_codidact/tree/mcalex/skeleton). Please note, the joins are a bit messy, but with every entity containing two relations back to the member table, this is somewhat unavoidable. Anybody feel free to give it a clean up :slight_smile:

My notes from ‘Schema Proposal Round 7’ work

Problems:

General:

  • Need cascade rules for ON DELETE, ON UPDATE foreign keys

  • text column types don’t have a length param (text(xx)). All text fields are just ‘text’ in the diagram for this reason. Edit: have noticed VARCHAR column specs are starting to appear that do take a length param. Has this been seen? Different from other database systems, in PostgreSQL, there is no performance difference among three character types. In most situation, you should use text or varchar , and varchar(n) if you want PostgreSQL to check for the length limit. (from PGtutorial page: https://www.postgresqltutorial.com/postgresql-char-varchar-text/)

  • Not all non-nullable columns have been indicated as such. I’ve added the spec for some obvious ones, but not all

Categories

  • category table - from the thread, what are ‘calculations’ - do we mean calculation_result (bigint), calculation_status (bool) or something else? (possibly the start of the trust_level stuff in the next point.)

  • category table - is ‘tl’ short for ‘trust_level’? is there a reason it wasn’t written as such? i wrote it as such

  • category table - display_name UNIQUE? I didn’t make it so.

Members

  • member table - network_account_id column. Where is the network_account table?

Tags

  • tag table - body column - defaults to NULL. Additionally, column should be NOT NULL

Enhancements/Suggestions

  • posts table - Should this have a field or link for attachments: files (eg blender), pics, (fiddle) embeds etc?

  • member table - ‘temporary’ suspensions. All suspensions are temporary. If they’re permanent, they’re expulsions or cancellations, no? Is there a need for the word ‘temporary’? I didn’t see the arguments for adding this in - it was just is_suspended and suspension_end_date in Round 4, I don’t think there was a round 5 and it looks like the privilege table made it in twice instead of the member table in Round 6.

Differences to specs in the proposal thread

  • Instead of BigSerial, I have used the ‘Identity (Always)’ type column (I assume we will be using postgresql >= 10.0) which takes one of PG’s int types (BigInt, Int and SmallInt). All keys are BigInt, but I suggest the enum types (usually *_type) tables only need SmallInt

  • category table - slight change to participation field name

  • social_media_type table: made display_name unique and added account_url (or uri - see below) column. Assumes all s_m_sites have a specific account member webpage/link (they do, don’t they?).

  • post_type table: I have assumed display_name is UNIQUE

  • status table: i changed the name of the ‘status’ table to ‘post_status_type’. This reduces the genericness,err genericity, umm non-specificness of ‘status’ (I’m anticipating ‘is_moderator’ and ‘is_administrator’ might potentially land in a member_status_type table down the track). Will potentially do something similar to ‘category’ table for the same reasons by Round 8 dependent upon reaction.

Other questions

  • category (et al) table - *_url or *_uri for website fields? Does it still matter?

  • tag table - synonym_tag_id - does this assume only one synonym for each tag, not a possible bunch or have i misunderstood tags?

    • Supplementary: is the synonym tag (row) mandated to have it’s own synonym_tag_id point back to the tag synonymous to it?
  • Do Comments want Views (like Posts).

    • Supplementary - comment IS-A post? - No, requires a parent. Never mind.
  • Tables to add ‘delete’ fields (is_deleted, deleted_at, deleted_by_member_id) to. Are they all identified in the thread? I haven’t done this bit yet.

  • How important are relationship names (not foreign key names) in the diagram? There’s a lot of *template* in there :-}


Some suggested additions and changes to the DB Naming convention that I can add to that thread but thought might as well be brought up here first:

Primary Key Fields that aren’t TABLE_id are <optional>_<explanatory>_<reason>_TABLE_id (See synonym_tag_id, parent_post_id)

Date fields are really TimeStamp fields unless otherwise required. Also, I thought I read somewhere ‘Date’ fields were to end in _at. There seems to be a mix of (mainly) _date, with some _at and some date columns that have no type identifier. I have gone for _at (while hoping my memory is correct) and think I have got them all consistent.

Alias fields be mandated to be used in column and constraint creation to assist non-db people reading the diagrams with pgModeler (see ‘Also’) and potentially for documentation assistance.

Also
I have set the notation under tables (in the threads) to the table’s comment field. I have added an alias (used by pgModeler to show a user-friendly name) generally with a human readable version of the column name and its type (text, timestamp, count, tally and status - for text, date or timestamp (see above), counting integer felds (eg ‘upvotes’), tallied/calculated integer/decimal fields (‘score’) and bool fields respectively.

For completeness, I have models of the other (seen) Proposal Rounds on the github repo. Is it worth attaching their diagrams to their relevant ‘Schema Proposal’ threads, or will that just pollute the forum with obsolete info?



Contribuing, hoping this is useful.

3 Likes

Please add a column “internal_id (text 50)”. We will have different databases for different communities and this means, that it’s easier to have such a column than guaranteeing consistency with the ids IMHO. Also makes the SQL-query in the code more natural to read, which is IMO good.

We don’t need this table. At least not for MVP and the foreseeable time after that. A post will have a finite amount of status (closed: yes/no, deleted: yes/no, locked: yes/no). The extra table is of no use IMO and adds only more queries. Everything post-related should be added to the post table. This means:

  • Add close_date and close_reason_id(fk) columns to post
  • Add close_reason table with at least these columns: display_name (text 50), description (text), parent_id (bigserial FK to closure_reason, nullable), is_active (boolean default to TRUE)
  • Add post_close_subreason table, which allows for the selected sub reasons of the primary close reason to be saved.
  • Add notice_id column to post
  • Add notice table with these columns: display_name (text), body (text), is_active (boolean default to TRUE)
  • Add is_locked and lock_date column to post.
  • Add delete_date column to post.

This’ll need a lot more information, but at least:

  • duplicate_post_id referencing a post (of which this is a duplicate)
  • close_reason_id referencing close_reason

Then we need a member_history table, which is not the audit log for the member table, but a list of events in the user history, which are useful for moderators (e.g. moderator message sent, suspension added, suspension removed, PII accessed, custom annotation). This should include a member_history_type table.

Your draft is also missing a tag_group table, which should be used instead of a concept of “parent tags” and instead be added to categories (category.tag_group_id).

Please clarify what this field is. With extremely rare exceptions, ids for foreign keys should be integers and not text strings. Don’t worry about SQL-query code being natural to read - speed is far more important and it won’t be “natural” unless every id is replaced with a string.

1 Like

While only luap42 can tell for sure what he meant, I would expect it to be a globally unique identifier of the post. Which most certainly is not a foreign key in th sense of SQL, as it is meant to be consistent between databases.

In particular, that ID would be preserved when moving the post to another site (which due to the decisions made earlier means moving it into a separate database). I think this would be necessary to forward requests to moved posts to the new site. Without this field, we would no longer be able to identify it as the same post.

The alternative would be a mechanism that ensures the primary post_id is unique across databases, but I fear that would be orders of magnitude more complex.

Don’t you use the UUID type for fields which have that purpose (i.e. “to be globally unique even when generated on separate database instances”)?

What @celtschk said is mostly correct, although this is meant only for the enum-like tables. So we can use the following query:

SELECT * FROM close_reason WHERE internal_id="duplicate" 

Instead of this one:

SELECT * FROM close_reason WHERE id = 42

I am not sure, how UUIDs are helpful for the legibility part. And I don’t want to have unique ids for that network-wide, but quite the opposite I want something which is always the same, regardless of DB creation or changes.

For example we might add a close reason, which will get a high id for all existing communities, but a lower one from the db creation script.

I don’t know the “right” answer with respect to C#, etc. But as a general rule, what I have done in Python & PHP is something like this pseudocode:

x = query('SELECT * FROM close_reason WHERE internal-id="duplicate"')
duplicate_reason_id = x['id']
...
duplicates = query('SELECT * FROM posts WHERE (question = TRUE) and (close_id = {duplicate_reason_id}')

In other words, I agree about using actual text to find the key enum-style records. But don’t use that text to find the related records. To put it another way, do not do:

duplicates = query('SELECT * FROM posts JOIN close_reason ON posts.close_id = close_reason.id WHERE (question = TRUE) and (close_reason.internal_id = "duplicate"')

You mean only use it to get the value once and then cache it? I am not sure how well it’ll work for settings, but for everything else it seems fine.

1 Like

Exactly.

Not that the extra JOIN is a big deal. But it complicates the queries and even if the application has to run those tiny little queries to get the IDs at the start of every process (that depends on the language/system - I don’t know how a C# web process works), those little queries would themselves be memory cached in PostgreSQL and run super-fast.

The EntityFramework version:

using (CodidactEntities context = new CodidactEntities())
{
  IQueryable query = 
      context.Posts
             .Where((p => p.Question)   // <- assumes bool 'question' field
             .Include(p => p.CloseReasons)
             .Where(p => p.CloseReason.DisplayName == "duplicate");

  // 'query' can then be converted ToList(); or have its 
  // FirstOrDefault() retrieved as an individual Post entity
  // depending on requirements
}  // end using.  CodidactEntities will now be properly disposed.

https://www.entityframeworktutorial.net/efcore/querying-in-ef-core.aspx

If I’m following correctly, I don’t think that calling it internal_id is different (at database level) from calling it display_name. As long as the thing is called the same name in the different databases (ie, it’s “duplicate” everywhere) the above should work across databases (assuming the display_name column is in a close_reason table linked by FK to a post table).

Imho, calling it dsiplay_name as per naming spec reads more naturally than internal_id. Also, I expect db types would find it unnatural to call a column something_id that wasn’t a table’s primary or foreign key.

1 Like

The problem with display_name is that I’d expect it to be … displayed. In the UI. This leads to problems with internalization and customization. Hence I suggest something different than the display name (which is IMO the label that is shown in the UI) to be used.

1 Like

Ahh, ok - i assumed it was that field. Yes, agreed. I generally use tablename_code (so ‘close_reason_code’) for those things - this goes with tablename_type which is my version of dsiplay_name, (ie the seen in the UI part of the entity).