Proposal -- Topics Table

Instead of this from Schema Proposal Round 2

**Posts**
UsersId (bigserial)
CommunitiesId (bigserial)
Title (Text 255 Default NULL)
Body (Text)
IsMeta 
ParentId 
etc.

**Comments**
UsersId (bigserial)
PostsId (bigserial)
Body (Text)
etc.

**PostsTags**
PostsId (bigserial)
TagsId (bigserial)
UNIQUE CONSTRAINT (PostsId,TagsId)

… I propose …

**Topics**
CommunitiesId (bigserial)
Title (Text 255 Default NULL) 
IsMeta
etc.

**Posts**
UsersId (bigserial)
Body (Text)
TopicsId # instead of ParentId
etc.

**Comments**
UsersId (bigserial)
PostsId (bigserial) # as before
Body (Text)
etc.

**TopicsTags**
TopicsId (bigserial) # instead of PostsId
TagsId (bigserial)
UNIQUE CONSTRAINT (TopicsId,TagsId)

It extracts (into a Topics table) stuff that’s applicable to a whole topic and not to a specific “posts” (and certainly not to any answer) within the topic.

That’s assuming a structure like SE’s and/or Discourse’s – i.e. a list of topics each containing a list of posts (of which the first might be a question).

Having a ParentId in the Post record would allow for n-level trees – each answer is itself a topic – which is an advanced and potentially complicated feature, perhaps for a later feature if ever.

Having a Topics table makes the existence of “Topics” explicit and gives you somewhere to put things (like a title, and whether it’s meta) and to reference things (like tags) which belong to the whole topic and not e.g. to an answer.

Could you clarify what you mean by a “topic”? I don’t understand what concept you’re trying to capture here. A question is incomplete without a title and tag(s), so what’s left for the “topic” layer?

2 Likes

When you ask new a question I’d expect you (i.e. I expect the application and/or the database procedure) to create

  1. a new topic, and
  2. the first post within that topic.

And subsequent answers then belong to the same topic.

Tags and the title and whether it’s meta and which community it belongs to and so on are properties of the whole topic – of the group of the posts – of the question and the answers – and shouldn’t be in the Posts table.

Also, “topics” are more-or-less what you see when you look at this page – https://forum.codidact.org/ – those aren’t exactly “posts”, those are “topics”.

Whereas on this page we’re seeing all the posts within one topic – plus the topic’s title and tags.

That doesn’t necessarily follow – a “Company” for example would be incomplete without “Employees”, and vice versa, nevertheless you don’t put all the “Company” information in the first Employee record in the Employees table.

Oh, so by “topic” you mean a container for everything on the question page? To what end?

1 Like

I think this structure would make sense for a forum software, you have a tooic and subordinate replies which are linear and not hierarchical. However for a Q&A site with hierarchical content it makes less sense. For example, how do you do these simple use cases with simple SQL:

  • Get all answers sorted by score/last activity
  • Determine, whether a vote is on a question or answer. (for example to give then different weight for trust level thresholds)

For all these cases, you’d need a sub query to determine, whether a post is the first one. Alternatively, you could add an IsFirst column, which brings us almost back to the prior variant.

Also what about help center articles, which might be stored as posts (as some suggested)? They’d have a title, but no tags. Do you want to create a new container table for every type? I’d rather have some NULL values than maintaining that.

Therefore I’d object this structure.

1 Like

I’m with @luap42 on this one. I absolutely do not think of a Question as a “big topic” the way that it is in this (and many other) forums. I think of it is as a small “chunk”, as are the answers, which are sometimes grouped (Q + A + A + A…) and sometimes not (Blog, Canonical, Help, FAQ, Tag Wiki, etc.). To me (and it seems we have quite a disagreement going on this), Q & A have more in common with each other (and with the other types of Posts) than they have differences. I could see it sort of the other way around: Instead of what I think @cwellsx is suggesting (correct me if I’m wrong):

  • Table of Q - which each point to a Post
  • Table of A - which each point to a Post and to a Q

instead I prefer:

  • Table of Posts, which includes an indicator that each one is a Q or an A (or a Blog or whatever)
  • If a Post happens to be a Q, then either:
    • include the Q-specific fields (as the original proposed schema), despite the non-use of those fields for A, Blog, etc. or
    • include a pointer to a Q-specific table of the additional information
1 Like

Because it’s wrong to store Company details in the first Employee record, I don’t know – I imagined it’s obvious to someone with an eye for normalising data, don’t people do that anymore?

The schema proposed other flags in the Posts table …

IsAccepted (Boolean Default FALSE)
IsQuestion (Boolean default TRUE)
IsAnswer (Boolean default FALSE)
IsBlog (Boolean default FALSE)
IsCanonical (Boolean default FALSE)
IsClosed (Boolean default FALSE)
IsProtected (Boolean default FALSE)

… some of those might identity types of topic, but some (e.g. IsQuestion versus IsAnswer) appear to identity types of post.

Or you could vote on topics instead of (or as well as) on questions.

“Almost”

So you store all types of data in one table, eh? Interesting.

I expect there’s a word for that “pattern” but I don’t know it.

Me too?

So if “posts” is for questions and/or answers, then “big topic data” (e.g. titles and tags) don’t belong to those posts records. Instead the posts (questions and answers) belong to topics.

More specifically, related posts each belong to the same topic.

You were wrong? I was suggesting:

  • Table of topics, which
    • Have a title
    • Have or belong to or more categories e.g. “IsMeta”, also communityId etc.
    • Have tags (because of there being a TopicsTags table instead of a PostsTags table)
    • Don’t point to a Post (point to a community maybe is about all)
  • Table of posts, which
    • Are questions and/or answers
    • Have a body and an author etc., votes and comments
    • May have a type if that’s important (e.g. “is first” or “is question”)
    • Belong to a topic (have topicId as a foreign key)
    • Don’t necessarily reference each other at all – no self join in the MVP

That is a fundamentally different structure than Q&A. You would end up with:

  • Topics that have NO Q or A or Blog or anything.
  • Topics with MULTIPLE Questions.

Both of which are “forum-like” and not “Q&A-like”. Yes, I want to broaden the definition of “Post” beyond Q or A or Q+A(n), to include Blog, Canonical and as others have suggested Help, FAQ, TagWiki, etc. But with the exception of Q+A(n), each of those has exactly one body text, original author, etc. I think these are two different ways of looking at the overall site structure, which then determines the DB structure.

1 Like

That’s … true?

But “Q+A(n)” is what it’s all about, the rest is icing on the cake. .

But apparently you’ll import like 20 million topics, with several posts per topic, and several tags per topic (not tags per post) – that’s the use-case to optimise IMO (and perhaps the only one in the MVP).

And creating/displaying/editing each (any one) topic is kind of trivial, isn’t it? And the harder bit then is selecting and paging (maintaining, caching) through the various lists of topics – and you propose to do that without even having a “topics” table?

You can do it, it’s just not the most obvious way to design that first use case (i.e. Q+A), imo

That’s sort of true.

To my mind a function of a self-join is to support an n-level tree of posts (e.g. as shown here) – anything less complicated than that, in a relational database, imo deserves the clarity of a finite number of well-identified levels each with its own table (e.g. communities contain topics which contain posts).

This might be an example of Object-relational impedance mismatch already, am I right – i.e. you want to “subclass” posts now – model different types of post, each type having subtly different data and different relationships or other (and, apparently, you want to do that by stuffing everything into one table).

I agree that what you’re proposing is more general – everything in one big table (“big” meaning “lots of records” and “lots of columns” … and lots of types of record all in one table), little relational integrity, add further columns for new use cases.

I don’t know that it’s orthodox?

Like, Gall’s law:

A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over with a working simple system.

I guess I think that what I’m proposing would “work” and be “simple” (and in that way future-proof) – being really orthodox in the schema design to avoid any kind of bleeding edge design.

What do you think about @JackDouglas’s suggestion (if I understand him correctly) of having each post type be its own table? So there’s a Questions table, an Answers table, a BlogPosts table, a HelpTopics table, etc, and each table defines the columns that make sense for that post type. (Obviously there will be commonalities, like author, creation date, and edit history.) The advantage seems to be that each post type can be suitably tuned, so we don’t end up in the situation down the road where we can’t do such-and-such “obvious” thing for some new post type because of how we built the Posts table way back at the beginning.

If you actually need to access all the posts through one table (I’m not sure why you would), you can use a view, right?

I know somebody is going to say “but joins!”, but really, is that so bad? I mean, if you have a Posts table you’re going to have to join it to itself to wire up questions with their answers, right? (This is my experience with SEDE, which has that structure, anyway.)

Well, in TopAnswers for comparison, there are (as I understand it), separate Question and Answer tables, with Question getting the connection to Community and the Title, plus a few other fields, and Answer getting the connection to Question - about what you would expect. But Question also has a Type - Question, Meta, Blog. So it seems to be a bit in-between - i.e., Question is for anything “at the top” but not limited to Questions per se.

One of my concerns is the very useful SE search that can look at both Questions & Answers together. On the other hand, SE (at least when looking at All Actions in my User page) can combine Questions, Answers and Comments - and Comments are almost certainly a separate table. On the other hand, “normal” searching would not look at Comments, and single-User-specific searches are going to be small enough that combining a couple of sets of search results is not a big deal.

I still like my way. I see TopAnswers as in-between. I’m very much not in favor of @cwellsx “everything really separate” idea - even though it is more formal (which is often good) and orthodox (which is often good). I tend to be, for better (hopefully mostly) or worse (sometimes…) a gut-feeling programmer. And this is the way I see it.

We’ll never get it perfect the first time around. (Or the 2nd…) We will be making changes to the schema - whether Posts are combined or separate. That is a problem when you are dealing with 500,000,000-record tables (been there). But with 1,000,000 records, even 10,000,000 records - just not that big a deal.

True. And if we follow the “all DB access goes through an internal API” path, then maybe it doesn’t matter as much – if we change the DB layout it only affects the path between that API and the DB, but it doesn’t require changes to server code, let alone client code. Let’s build an API to protect ourselves from ourselves. :slight_smile:

1 Like

Except of course that “API code” is really “server code” just boxed a bit differently.

I thought about this, and decided no because posts have votes and comments.

If Posts were to be split between Questions and Answers, then there also needs to be a QuestionVote and a AnswerVote table plus QuestionComment and a AnswerComment table.

That for me is the big advantage of having posts in a single table.

3 Likes

Which is basically what I’ve been saying: Questions and Answers are much more alike than they may seem, at least within this type of system.

1 Like

That’s reassuring, I guess: i.e. if that schema is good enough for them, then …

(Although just possibly that isn’t the schema of their operational database.)

It isn’t just run-time performance which drove me but relational integrity – for example using the suggested schema:

  • The title must be nullable because answers don’t have a separate title – or the question’s title must be copied into its answers – ditto isMeta and CommunitiesId etc.
  • PostsTags could reference a post which (wrongly) happens to be an answer not a question

I’m also not convinced (haven’t seen) that One Big Table is easier to select from – i.e. to code SELECT statements for all use cases, and to define a good physical index e.g. for when selecting lists of topics.

And you might want to cache something at the “topics” level to improve the performance of selecting pages from the lists of topics – for example, cache the most recent date at which a answer within the topic was edited (for the view which sorts the topics by their “most recent activity” date which is very common). Having a topics table gives you a table in which to contain that new/cached column (instead of adding it to the “questions” record and so necessarily as a nullable column of every other record in the posts table).

So IMO it improves conceptual integrity too (separation of concerns).

There’s “more than one way to skin a cat” (allegedly).

  • If Questions and Answers were in separate tables (one with a QuestionsId and the other with an AnswersId as their PK) they could in theory both still have a PostsId as well – their PostsId being a foreign key into a Posts table, which might contain no columns except PostsId.

    That would be analogous to saying that the two types are both subtypes of Post.

    And that would gives you something – i.e. a PostsId again – to which you can tie what the types have in common … so there’s one PostsVotes table and one PostsComments table (each with PostsId as their foreign key), as you wanted.

  • I wasn’t actually suggesting that though.

    Because Questions and Answers have so much data in common, and to avoid that extra and IMO unnecessary level of indirection (i.e. subclassing), then instead of saying they’re two subtypes (of Posts) I said they’re both the same type with the same properties – and contained in the same table (“Posts”), why not.

    After deciding they can be identical data in the same table, then any-and-all “difference” between the two then I store in a different table (i.e. “Topics”), with posts having TopicsId and their foreign key. That “difference” (between Questions and Answers) is the columns which exist in questions but not in answers, and/or the values which multiple records theoretically always share or “belong to” – i.e. title, tags, whether it’s meta, and the CommunitiesId

Generalising from the above, this suggestion could be instead of some of these fields:

IsAccepted (Boolean Default FALSE)
IsQuestion (Boolean default TRUE)
IsAnswer (Boolean default FALSE)
IsBlog (Boolean default FALSE)
IsCanonical (Boolean default FALSE)
IsClosed (Boolean default FALSE)
IsProtected (Boolean default FALSE)
ParentId (bigserial Default NULL)

So I might suggest:

  • If the post is a question or answer then it has a TopicsId
  • And/or if the post is a blog post then it has a BlogsId – or a HelpsId and so on
  • The auxiliary tables (Topics, Blogs, Helps) contain any data unique to that type (that isn’t particular to every Post type and Post instance)
  • The auxiliary tables – being a few separate table types – provide a convenient way to select all/only instances of that type (e.g. “all topics”)

Alternatively that FK relationship could be in the other direction – e.g. the Blogs table could include a PostsId field to identity the associated Posts record (and no BlogsId in the Posts table). In that case Topics (having more than one post) would have a TopicsPosts join table, or contain a PostsId field for the question and have a TopicsAnswers join table.

If the “API code” is a stored procedure and/or function within the database – instead of in the .NET application – then it is “protecting ourselves from ourselves” in the sense implied by Conway’s law.

1 Like

Note that the “topic” here is essentially what I called the “title” in this other thread.

Also note that there’s also the proposal to have a separate discussion post type on meta. It is not unthinkable that one day we might want to implement threaded posts for discussion-type meta posts. That is, “answers” on a meta discussion would in turn get their own “answers”. So the possibility of supporting forum-type threads in the future, while not the most important consideration right now, should also not be dismissed completely.

I think I was the one (or one of the ones?) who proposed a “discussion” post type, with meta as the use case. Not all meta posts are discussions, but some might be.

I’d like to see post types be about their intrinsic properties – question, answer, blog post, discussion, help topic, etc. Whether they are also meta posts is orthogonal. There could be meta-specific help, for instance, indicated by the “IsMeta” flag on a help-topic post. We’ll probably say that discussion posts are only meta posts (not welcome on main), but again, that should be handled via the flag (plus the software not allowing a path to create non-meta discussion posts).

2 Likes