DB Schema Round 7

mcalex · 18 February 2020 18:22

Was there a Round 5 of the Schema Proposal? I don’t see it in the thread list.

mcalex · 22 February 2020 15:19

OK, so I have been going through the Schema proposals with pgModeler to get some practice in and am finished up to Round 7. Diagram below (and dbm file at https://github.com/mcalexster/mcalex_codidact/tree/mcalex/skeleton). Please note, the joins are a bit messy, but with every entity containing two relations back to the member table, this is somewhat unavoidable. Anybody feel free to give it a clean up

My notes from ‘Schema Proposal Round 7’ work

Problems:

General:

Need cascade rules for ON DELETE, ON UPDATE foreign keys
text column types don’t have a length param (text(xx)). All text fields are just ‘text’ in the diagram for this reason. Edit: have noticed VARCHAR column specs are starting to appear that do take a length param. Has this been seen? Different from other database systems, in PostgreSQL, there is no performance difference among three character types. In most situation, you should use text or varchar , and varchar(n) if you want PostgreSQL to check for the length limit. (from PGtutorial page: https://www.postgresqltutorial.com/postgresql-char-varchar-text/)
Not all non-nullable columns have been indicated as such. I’ve added the spec for some obvious ones, but not all

Other questions

category (et al) table - *_url or *_uri for website fields? Does it still matter?
tag table - synonym_tag_id - does this assume only one synonym for each tag, not a possible bunch or have i misunderstood tags?
- Supplementary: is the synonym tag (row) mandated to have it’s own synonym_tag_id point back to the tag synonymous to it?
Do Comments want Views (like Posts).
- Supplementary - comment IS-A post? - No, requires a parent. Never mind.
Tables to add ‘delete’ fields (is_deleted, deleted_at, deleted_by_member_id) to. Are they all identified in the thread? I haven’t done this bit yet.
How important are relationship names (not foreign key names) in the diagram? There’s a lot of *template* in there :-}

Some suggested additions and changes to the DB Naming convention that I can add to that thread but thought might as well be brought up here first:

Primary Key Fields that aren’t TABLE_id are <optional>_<explanatory>_<reason>_TABLE_id (See synonym_tag_id, parent_post_id)

Date fields are really TimeStamp fields unless otherwise required. Also, I thought I read somewhere ‘Date’ fields were to end in _at. There seems to be a mix of (mainly) _date, with some _at and some date columns that have no type identifier. I have gone for _at (while hoping my memory is correct) and think I have got them all consistent.

Alias fields be mandated to be used in column and constraint creation to assist non-db people reading the diagrams with pgModeler (see ‘Also’) and potentially for documentation assistance.

Also
I have set the notation under tables (in the threads) to the table’s comment field. I have added an alias (used by pgModeler to show a user-friendly name) generally with a human readable version of the column name and its type (text, timestamp, count, tally and status - for text, date or timestamp (see above), counting integer felds (eg ‘upvotes’), tallied/calculated integer/decimal fields (‘score’) and bool fields respectively.

For completeness, I have models of the other (seen) Proposal Rounds on the github repo. Is it worth attaching their diagrams to their relevant ‘Schema Proposal’ threads, or will that just pollute the forum with obsolete info?

Contribuing, hoping this is useful.

luap42 · 22 February 2020 17:09

Please add a column “internal_id (text 50)”. We will have different databases for different communities and this means, that it’s easier to have such a column than guaranteeing consistency with the ids IMHO. Also makes the SQL-query in the code more natural to read, which is IMO good.

We don’t need this table. At least not for MVP and the foreseeable time after that. A post will have a finite amount of status (closed: yes/no, deleted: yes/no, locked: yes/no). The extra table is of no use IMO and adds only more queries. Everything post-related should be added to the post table. This means:

Add close_date and close_reason_id(fk) columns to post
Add close_reason table with at least these columns: display_name (text 50), description (text), parent_id (bigserial FK to closure_reason, nullable), is_active (boolean default to TRUE)
Add post_close_subreason table, which allows for the selected sub reasons of the primary close reason to be saved.
Add notice_id column to post
Add notice table with these columns: display_name (text), body (text), is_active (boolean default to TRUE)
Add is_locked and lock_date column to post.
Add delete_date column to post.

This’ll need a lot more information, but at least:

duplicate_post_id referencing a post (of which this is a duplicate)
close_reason_id referencing close_reason

Then we need a member_history table, which is not the audit log for the member table, but a list of events in the user history, which are useful for moderators (e.g. moderator message sent, suspension added, suspension removed, PII accessed, custom annotation). This should include a member_history_type table.

Your draft is also missing a tag_group table, which should be used instead of a concept of “parent tags” and instead be added to categories (category.tag_group_id).

manassehkatz · 23 February 2020 00:28

Please clarify what this field is. With extremely rare exceptions, ids for foreign keys should be integers and not text strings. Don’t worry about SQL-query code being natural to read - speed is far more important and it won’t be “natural” unless every id is replaced with a string.

celtschk · 23 February 2020 07:58

While only luap42 can tell for sure what he meant, I would expect it to be a globally unique identifier of the post. Which most certainly is not a foreign key in th sense of SQL, as it is meant to be consistent between databases.

In particular, that ID would be preserved when moving the post to another site (which due to the decisions made earlier means moving it into a separate database). I think this would be necessary to forward requests to moved posts to the new site. Without this field, we would no longer be able to identify it as the same post.

The alternative would be a mechanism that ensures the primary post_id is unique across databases, but I fear that would be orders of magnitude more complex.

cwellsx · 23 February 2020 08:53

Don’t you use the UUID type for fields which have that purpose (i.e. “to be globally unique even when generated on separate database instances”)?

luap42 · 23 February 2020 09:05

What @celtschk said is mostly correct, although this is meant only for the enum-like tables. So we can use the following query:

SELECT * FROM close_reason WHERE internal_id="duplicate"

Instead of this one:

SELECT * FROM close_reason WHERE id = 42

I am not sure, how UUIDs are helpful for the legibility part. And I don’t want to have unique ids for that network-wide, but quite the opposite I want something which is always the same, regardless of DB creation or changes.

For example we might add a close reason, which will get a high id for all existing communities, but a lower one from the db creation script.

manassehkatz · 23 February 2020 15:10

I don’t know the “right” answer with respect to C#, etc. But as a general rule, what I have done in Python & PHP is something like this pseudocode:

x = query('SELECT * FROM close_reason WHERE internal-id="duplicate"')
duplicate_reason_id = x['id']
...
duplicates = query('SELECT * FROM posts WHERE (question = TRUE) and (close_id = {duplicate_reason_id}')

In other words, I agree about using actual text to find the key enum-style records. But don’t use that text to find the related records. To put it another way, do not do:

duplicates = query('SELECT * FROM posts JOIN close_reason ON posts.close_id = close_reason.id WHERE (question = TRUE) and (close_reason.internal_id = "duplicate"')

luap42 · 23 February 2020 15:42

You mean only use it to get the value once and then cache it? I am not sure how well it’ll work for settings, but for everything else it seems fine.

manassehkatz · 23 February 2020 15:49

Exactly.

Not that the extra JOIN is a big deal. But it complicates the queries and even if the application has to run those tiny little queries to get the IDs at the start of every process (that depends on the language/system - I don’t know how a C# web process works), those little queries would themselves be memory cached in PostgreSQL and run super-fast.

mcalex · 24 February 2020 13:38

The EntityFramework version:

using (CodidactEntities context = new CodidactEntities())
{
  IQueryable query = 
      context.Posts
             .Where((p => p.Question)   // <- assumes bool 'question' field
             .Include(p => p.CloseReasons)
             .Where(p => p.CloseReason.DisplayName == "duplicate");

  // 'query' can then be converted ToList(); or have its 
  // FirstOrDefault() retrieved as an individual Post entity
  // depending on requirements
}  // end using.  CodidactEntities will now be properly disposed.

https://www.entityframeworktutorial.net/efcore/querying-in-ef-core.aspx

If I’m following correctly, I don’t think that calling it internal_id is different (at database level) from calling it display_name. As long as the thing is called the same name in the different databases (ie, it’s “duplicate” everywhere) the above should work across databases (assuming the display_name column is in a close_reason table linked by FK to a post table).

Imho, calling it dsiplay_name as per naming spec reads more naturally than internal_id. Also, I expect db types would find it unnatural to call a column something_id that wasn’t a table’s primary or foreign key.

luap42 · 24 February 2020 15:47

The problem with display_name is that I’d expect it to be … displayed. In the UI. This leads to problems with internalization and customization. Hence I suggest something different than the display name (which is IMO the label that is shown in the UI) to be used.

mcalex · 24 February 2020 15:58

Ahh, ok - i assumed it was that field. Yes, agreed. I generally use tablename_code (so ‘close_reason_code’) for those things - this goes with tablename_type which is my version of dsiplay_name, (ie the seen in the UI part of the entity).