Schema Proposal Round 2

manassehkatz · 2 January 2020 04:11

That is pretty close to what I recommend (and what I do IRL). The only exception is that I have learned the hard way that the better way to build that secondary (in this case history) table is to give it a simple autoincrement Id (just like the primary file) and if you need a revision # then add that separately and make postId + revision the secondary key. That avoids a bunch of problems with locking, collisions, etc. But in many cases “revision #” doesn’t really matter, only “the sequence of the revisions” - which can be determined from the sequence of the Ids or from the timestamps.

cellio · 2 January 2020 04:19

Oh, that makes sense! I was hung up on “need a number for sequencing”, but really you just need them to be in order; gaps don’t matter. So autoincrement on the whole table does that for us – win.

manassehkatz · 2 January 2020 04:37

The key (pun intended) is that nobody really cares that something was “the 4th edit of 7”. They just need to always know that the edits will be in order. In fact, if some edits are hidden because they were spam/revenge/etc. and deleted by a Moderator, then the sequence might change from:

1 - a
2 - b
3 - junk
4 - c

to

1 - a
2 - b
3 - c

and with the numbering done on-the-fly and only the sequence stored in the databse, nothing needs to be done in that situation except “delete 3”. Everything else flows automatically.

Actually, that raises an interesting twist: This means there can be situations where the history tables themselves need separate delete flags - i.e., separate from the delete flag transferred from the primary table!

JackDouglas · 2 January 2020 10:14

I’m sure there will be some back and forth about the details of how much to normalize and do on. In the end it might not matter that much what you settle on — but if there is anything you can do now that makes later changes easier, I’d make that top priority, because you won’t think of everything in advance and anyway requirements will change.

In my view, one of the problems with SE is the glacial pace of change — not radical feature changes, but simple bug fixes. I’m sure the reason is that changing anything risks breaking other things (maybe not quite this bad but the same underlying problem).

So I would encourage you to have a broader discussion about architecture now (if you haven’t already). Can you use schemas (or some other method) to separate the system up into logical areas with fewer dependencies? Is it worth having more code duplication to allow more freedom later?

That kind of question can impact how you design your schema. Putting different post types into a single table means less code now but may mean it’s harder to make changes to (say) just ‘answers’ later. I’m not trying to make a big deal of that particular aspect, just use it as an example — On TopAnswers we’ve been toying with the idea of an optional separate tl;dr field in answers that displays on the question summaries which (if present) would show on the question list instead of the first line of the answer we have presently:

Screenshot 2020-01-02 at 10.07.28

We probably won’t end up doing this, but my point is that there is no way we’d have even thought of if until after development had started. There are and will be lots of other things like this that pop up. We want to do everything possible now to avoid getting bogged down later on, which will take away all the fun, for developers and users alike.

cwellsx · 2 January 2020 13:50

Anything’s better than nothing

Data that’s not in second normal form seems inherently “crufty” to me – and tolerable in a toy project?

I’d like a schema to model relationships – it’s a relational database, eh – just like I like to use a typed language and compiler … to limit what can ever happen at runtime.

The schema that’s “necessary and sufficient” is I imagine one which can only contain valid data – and that’s the schema that’s maintainable, can be refactored if requirements change, is my theory. I gather that the nightmare of a DB admin is a badly-normalised DB, which contains invalid data, which must be “scrubbed” a.k.a. cleansed.

This (“correctness”, possibly aka “bondage and discipline”) is especially important when more than one application accesses the data (which isn’t the case at the moment).

If relationships aren’t evident in the schema then where are they evident – it isn’t like the application code is easier to understand than the schema.

The requirement to “make later changes easier” is IMO argument for early strictness – like, you start with a typed language.

Denormalization is – so I’ve read – something you do deliberately (and maybe, traditionally, in a “data warehouse” and not on the “operational database”). And “denormalised” not the same as “unnormalised” data.

This is just theory though. “How bad could it be?” (xkcd: goto)

cellio · 2 January 2020 15:56

To make sure I understand you: are you suggesting that we have, for example, a table for questions and one for answers and one for blog posts and so on, as opposed to a general posts table? Given that one can create a view that joins them all together (right?) if needed, that makes intuitive sense to me – smaller chunk sizes and combining them as needed. Is that what you’re suggesting?

JackDouglas · 2 January 2020 16:41

I do suggest that (for questions and answers, not blog posts) but mainly I’m suggesting setting explicit development priorities rather than trying to decide what the ‘right’ decision is in each case absent those. So many projects get bogged down by ‘technical debt’ over time.

I think the vision here is to create something that will last (growing and adapting) for decades — should that make a difference?

What is core and what is peripheral? What can be developed as libraries so that the core code is kept simpler? Can the database be constructed so that changes down the line are easier — for example by treating it as a separate project that cannot be directly accessed by the application, all access passing through an API that handles transactions rather than updates.

To explain the API question, I mean: is ‘inserting an answer into the ‘answer’/‘post’ table’ something the application can do? I don’t think so — instead the application should have access to a transactional API that exposes an ‘add answer’ function. There are lots of things adding an answer might trigger — notifications, bumping a question etc — and these should all succeed or fail as a single transaction, which is one benefit, but the main benefit is that the database can change independently of the api and the api can change independently of the database.

cwellsx · 2 January 2020 17:28

If the “application” is a web server then there may be more than one application instance?

And so that – transactions – might be implemented using “stored procedures”?

codymullins · 2 January 2020 18:09

What is the purpose of the Stack Exchange tracking columns? This seems odd to belong here.

I’m trying to get up to speed and read through all these posts, so forgive me if this was already discussed - have you considered doing a brief exercise in domain mapping in order to plan ahead thoroughly? It would be a good idea to start with a nicely decoupled domain model with resilient services from the beginning or at least write the core services in a way it’s already decoupled.

It seems from the questions being posed that is what is needed - for example, what IS a Question, a Post, and a Vote? Does a FAQ/Community Wiki involve more than a textfield, potentially? How should Moderator/Admin views be handled?

cbrumbaugh · 2 January 2020 18:15

After the SE content is migrated, then users who come over will be able to claim their content via authenticating their SE accounts. Basically saying I wrote that on SE, so mark me as the author here too please.

codymullins · 2 January 2020 18:27

Based on quick reading through these topics, will we also be importing things from other Q&A sites such as TopAnswers? I would recommend considering splitting this out into a separate table from the user profile. I will think more on this and get back with you.

cellio · 2 January 2020 19:22

This is a good point. We’ve talked about wanting to allow open movement between TopAnswers and Codidact, and not just imports from SE, so we’ll need a way to represent that. We don’t need to implement it now (import from TopAnswers is not MVP), but we should know we’ll have a place to put it later.

JackDouglas · 2 January 2020 20:26

More than one application, yes — a mobile app for example. Or you might mean more than one application instance in the sense of having multiple application servers with one database and that is possible too, though it may not be necessary.

Yes I’d use functions for transactions in the API. There might be other ways of doing it with triggers or rules or whatever, but functions seem like the best option to me.

misha130 · 2 January 2020 20:38

I’d actually like to point that I agree that questions & answers should be separated as these 2 entities can grow into different directions and should not be rooted together even though they are somewhat similar.