IDK, I threw this together last night and then updated it this morning with the feedback. Maybe enums are better and we should use those. I think that’s worth its own topic where we can discuss the pros and cons.
Also shouldn’t most of these tables have the same basic 3 columns that are consistent on all tables:
Like row creation date, row update date & a soft deletion column?
Yes, see the convention here Proposal - DB Naming Conventions I didn’t write those columns out here because they are assumed to be on the table.
Well in that case you should change the name of this column because its already the one defined in the name conventions you linked.
- Linked Comments
- Can be Closed/Deleted/etc.
- Need an Edit History and Edit Reviews.
- Have Votes Up/Down
- Can be Searched together
There are some differences:
- Title is not in Answers - but is in all the others.
- Tags are not in Answers - but are in all the others.
- Answers need a parent Question - the others do not.
- Conversely - Questions can have Answers, the others can not.
But overall, the differences are relatively minor - and having a few extra fields hanging on with Answers that will never get used are a small price to pay for much simpler Searching and other common code.
I don’t see why “searching” is much simpler:
- If there were separate Posts and Topics, and the UI were like Discourse, then the main page selects topics (not posts).
- Or if the UI were like SE’s, then the main page selects topics joined to one post (either the topic’s first post, or its “active” i.e. most-recently-edited post).
- If you’re on a page like this one (i.e. showing the details of one selected topic) then the back end selects the topic plus all posts associated with that topic.
Still, have it your way and I’m glad you understood what I was saying at least – or do you?
Having extra stuff in posts (e.g. Title and IsMeta in answer records) means the database doesn’t (even) satisfy second normal form.
I’m all for breaking rules, am I not, but if you’re sure you want to do that this early I imagined more reasoning than this,
I thought about splitting into a Answers table and and QuestionsTable but doing it in the same table lets comments have a PostsId instead of a QuestionsId and a AnswersId. Meta posts are denoted by the IsMeta column. Still unsure on community wiki/FAQ posts.
Maybe we’ve just got different ideas about DB design – mine’s too purist or something.
Given What kind of database should we use? I thought the schema wouldn’t attempt to support n-level trees, yet this one does.
As you say, maybe that’s right though.
What I’m referring to is more like searching for “any question or answer in a community that has 'xyz” - like the Search field at the top of SE pages. That searches through Questions and Answers - and arguably would search through Blogs and Canonical posts if they existed. Keeping them all in one table makes that easy. And searching for the Home page - just Questions - adds one field to the search (IsQuestion = True or PostType = “Q” or whatever we settle on). And searching for everything in one specific Question is easy “(Id = x) or (ParentId = x)” or something like that. With separate tables it gets messier.
As far as the rest - I am not a DB purist or an expert in RDBMS theory. But I’ve built plenty of DBs and learned a lot the hard way (this stuff didn’t exist in the early 80s when I was in school).
Yes I accepted the theory that questions and answers have a lot in common (i.e. they’re “posts”).
I don’t think answers are topics though – or are they? – so topic-related data (titles, tags, meta-categories, …) could and therefore should be removed from the posts table, is what I was saying.
Come on, I sure you can cope with an elementary JOIN – you’re going to have to anyway, whether it’s joining two separate tables, or in the present schema it’s a self-join .
So you went to school in the 80? “You were lucky!” In the 70s I dreamed of one day growing up to work for some company big enough to own a computer – so that I could program one instead of only reading about programming. I graduated in the early 80s (never using a computer at uni), first job was help build packet switched data network software for a national telco (the proto-internet).
But I read like some of Joe Celko’s books in the 90s – like, SQL for Smarties, see its bit there on Normalization – when I had to act as a DB developer.
As a kid in the 70s started programming BASIC on an ASR33 connected to my Dad’s company’s minicomputer. 80s at University of Maryland - perfect timing - just missed having to use punch cards, everything on terminals, beginning of PCs, etc.
But back to the topic: I see what you’re saying, even if I disagree with some of it. But some of what I have said is (I think) echoing the initial Schema Proposal from @cbrumbaugh I’d like to see what @JackDouglas and some others have to say about all of this.
Made a thread per your suggestion:
I hadn’t read all of the first version before you posted this second version, so I’m reviewing this without having read all that. Sorry if I’m repeating something you already addressed.
Some of those fields, like the ones about SE accounts and importing data, seem like they should be at the community level, not the instance level. I might have imported my data from Writing but not (yet) from Worldbuilding.
What is IsSameAsInstance in this instance-level table?
What are HelpText and FAQ – literal text or URLs or what? Ultimately a community will have many help topics, many of which will be common across the instance.
IsBeta implies that there are two kinds of sites, beta and not. We haven’t talked about beta-ness at all yet, and I think one thing we’ve learned from SE is that there are more than two natural stages to a community. If we have the notion of site types/stages let’s make that an enum so we can adjust it when we need to.
The ID column here is the actual trust level, right? (0 through 6 as currently proposed.)
Count me with those wondering whether Can[many things] is the best approach. I was expecting something like:
TrustLevelId - which is what drives baseline privileges
Restrictions - some representation (waves hands) of overrides, like temporary edit bans (I guess this is your UsersSuspensions table?)
And then a table for trust-level info that says what privileges come with each level.
For faster lookups, you can then create a view that joins all this together. Maybe the table you’re proposing here is that view and the underlying implementation is not exactly as shown here? Sorry if these are naive questions.
Please plan for there being more types of posts in the future. I think all of the following are Posts:
- Discussions (a type of post useful on meta)
- Blog posts
- Canonical posts
- Tag wikis
- Election nominations
- Help topics
All of these are, at their core, blobs of text. All have authors and edit histories.
Most have titles. Most are votable. Some have comments.
Should we be thinking of posts as having a type column rather than a bunch of IsSomething columns that mean we have to change the table definition when we add a post type? (Oops, I guess this is that other conversation again.)
IsMeta is good, because it allows (in principle) any type of post to be meta or not. Meta announcements might be, under the covers, blog posts with IsMeta set, for example.
And anything else that takes comments (election posts, blog posts, etc).
I don’t understand this one. What’s this for?
General question: how does edit history get stored? Posts, comments, and user profiles should keep history. Is each revision of something a new row in the corresponding table, and the server code is responsible for finding and displaying the latest one and making the rest available via history links where applicable?
These fields are duplicated at both the Users and UsersCommunities table in order to take care of both someone wanting to import all, and someone wanting to import just a specific community. That’s also why the Bio field etc. is on both places.
This was proposed by manassehkatz
Most of those fields are in order to have a “master” user definition used when a user connects with a new community or in an existing community sets “IsSameAsInstance”.
I would say that those are the starting help text fields, as you mentioned we will probably have more than just the one, but that structure hasn’t been fleshed out yet.
This makes a lot of sense will change it next iteration.
I think I goofed up here, there should be a TrustLevel table where the id column is the actual level and then a TrustLevelCommunity table for each community to customize the text. This would also be where the customized how many items it would take to progress to a level.
With more time to think about this, I am leaning towards reference tables too.
VoteTypes is a reference table for votes, upvote and downvote right now.
Suppose I have 20 communities on SE, two of which have Codidact analogues. How is this flag set at the instance level? What happens when a third community gets created and I import data to that?
Tracking data import makes sense at the community level, but I’m having trouble seeing how it makes sense at the instance level. There’s no “import all my SE data” notion; we can only import from and to specific communities, which have to exist.
Ah, that makes sense. Since that’s a column that is only meaningful at the community level and not the instance level, I’m confused by its presence in the instance-level user table.
I’d rather leave help and FAQ out of the schema definition for now, until we work out how we really want these to work. There might end up being a Help table where rows are posts for individual topics (with columns for categories and some way to manage presentation order). I don’t think “help” is one column, and I don’t know that “FAQ” is a separate concept.
On edit history, remember that we’ll need to access that pretty frequently, which is another reason for it not to be off in another entire database. We’re already discussing this on that other thread, though, so I won’t continue that thread here.
Finally, thanks for all your work on this! I apologize for not leading with that.
Mistake. It should only be in UsersCommunities - to indicate that a particular user says “make all my profile stuff for this community same as at my top-level (Instance) profile”.
We could leave them out. My thought was that every Community will need to have at least one Help and at least one FAQ (both Markdown blobs), each of which could be linked to additional pages, etc.
If we plan from the start for more than one, then each could be a PostsId not NULL.
That’s fine with me.
The one disadvantage of a PostTypesId field is a drop of code needed at the start of the program to “read the PostTypesIds for Q, A, Blog, etc.” because there will be a lot of code specific to each particular type of post and it will need to know “For Questions, SELECT * WHERE PostTypesId = 3” because throwing an extra “JOIN PostTypes ON Posts.PostTypeId = PostTypes.Id WHERE PostTypes.Type = ‘Q’” on every PostType-specific query is technically correct but does not make sense.
That’s the whole discussion about history tables (same schema/DB or separate, all fields or just changed fields, backup/restore vs. history, etc.) In my mind, every revision is a new history row in a separate history table “somewhere” but the server code doesn’t find the latest, it goes to the primary table and retrieves the only copy from there. Only when actual history is needed (view Edit History, research User changes, etc.) do we read from the history tables.
I think we should assume that each community will have multiple help topics, some of which will be inherited from the instance (how do to X) and some of which will be customized (scope). I’m not a big fan of premeditated FAQs, which usually aren’t really FAQs but just things a site wants to push at users. Emergent FAQs are different, but we have to wait for them to emerge.
Given that there will be multiple topics, I think a HelpTopics table makes sense. If there’s only one entry in the MVP that’s fine, but there will be more before too much longer.
Right. I wasn’t assuming that would be burdensome. I mean, SO stores post types separately like this, and if they can handle that at their scale, we should be fine, right?
My naive preference is to have a history table for each table where we track history, in the same schema in the same database. The primary table always has the current version; the history table stores prior revisions when they exist, with postID+revision# together forming a unique ID (key?) for the row. Does it need to be more complicated than that?
That is pretty close to what I recommend (and what I do IRL). The only exception is that I have learned the hard way that the better way to build that secondary (in this case history) table is to give it a simple autoincrement Id (just like the primary file) and if you need a revision # then add that separately and make postId + revision the secondary key. That avoids a bunch of problems with locking, collisions, etc. But in many cases “revision #” doesn’t really matter, only “the sequence of the revisions” - which can be determined from the sequence of the Ids or from the timestamps.
Oh, that makes sense! I was hung up on “need a number for sequencing”, but really you just need them to be in order; gaps don’t matter. So autoincrement on the whole table does that for us – win.
The key (pun intended) is that nobody really cares that something was “the 4th edit of 7”. They just need to always know that the edits will be in order. In fact, if some edits are hidden because they were spam/revenge/etc. and deleted by a Moderator, then the sequence might change from:
- 1 - a
- 2 - b
- 3 - junk
- 4 - c
- 1 - a
- 2 - b
- 3 - c
and with the numbering done on-the-fly and only the sequence stored in the databse, nothing needs to be done in that situation except “delete 3”. Everything else flows automatically.
Actually, that raises an interesting twist: This means there can be situations where the history tables themselves need separate delete flags - i.e., separate from the delete flag transferred from the primary table!
I’m sure there will be some back and forth about the details of how much to normalize and do on. In the end it might not matter that much what you settle on — but if there is anything you can do now that makes later changes easier, I’d make that top priority, because you won’t think of everything in advance and anyway requirements will change.
In my view, one of the problems with SE is the glacial pace of change — not radical feature changes, but simple bug fixes. I’m sure the reason is that changing anything risks breaking other things (maybe not quite this bad but the same underlying problem).
So I would encourage you to have a broader discussion about architecture now (if you haven’t already). Can you use schemas (or some other method) to separate the system up into logical areas with fewer dependencies? Is it worth having more code duplication to allow more freedom later?
That kind of question can impact how you design your schema. Putting different post types into a single table means less code now but may mean it’s harder to make changes to (say) just ‘answers’ later. I’m not trying to make a big deal of that particular aspect, just use it as an example — On TopAnswers we’ve been toying with the idea of an optional separate tl;dr field in answers that displays on the question summaries which (if present) would show on the question list instead of the first line of the answer we have presently:
We probably won’t end up doing this, but my point is that there is no way we’d have even thought of if until after development had started. There are and will be lots of other things like this that pop up. We want to do everything possible now to avoid getting bogged down later on, which will take away all the fun, for developers and users alike.
Anything’s better than nothing
Data that’s not in second normal form seems inherently “crufty” to me – and tolerable in a toy project?
I’d like a schema to model relationships – it’s a relational database, eh – just like I like to use a typed language and compiler … to limit what can ever happen at runtime.
The schema that’s “necessary and sufficient” is I imagine one which can only contain valid data – and that’s the schema that’s maintainable, can be refactored if requirements change, is my theory. I gather that the nightmare of a DB admin is a badly-normalised DB, which contains invalid data, which must be “scrubbed” a.k.a. cleansed.
This (“correctness”, possibly aka “bondage and discipline”) is especially important when more than one application accesses the data (which isn’t the case at the moment).
If relationships aren’t evident in the schema then where are they evident – it isn’t like the application code is easier to understand than the schema.
The requirement to “make later changes easier” is IMO argument for early strictness – like, you start with a typed language.
Denormalization is – so I’ve read – something you do deliberately (and maybe, traditionally, in a “data warehouse” and not on the “operational database”). And “denormalised” not the same as “unnormalised” data.
This is just theory though. “How bad could it be?” (xkcd: goto)