What kind of database should we use?

cwellsx · 23 December 2019 02:59

I’d like it as a future requirement, to verify that the chosen database engine has a way to support it.

It might, even if it’s MS SQL database – e.g. store in some big XML field?

I’d like to think that it (and tree-like threaded comments) are reasonably foreseeable and that part of the MVP design (not implementation) includes its supporting future requirements and not only the immediate ones, i.e. to not “paint oneself into a corner”.

I wonder how comments are supported in the database anyway? If it’s SQL a naive implementation might put them in a separate table to be joined to topics, possibly instead they could all be denormalized in some XML field already within each topic instance?

manassehkatz · 23 December 2019 03:11

My initial (and still main) thought on comments is that they are a typical SQL table. That’s because they are more than just “blobs of text following a post” - they have (or could have - not necessarily the same as SE, etc.):

Authors
Votes
Flags
Edit History
etc.

So they need all the relational capabilities that everything else has.

Academic-Quantity · 23 December 2019 16:14

I’m inclined to say if a relational database works fine for our core Q&A content, it would be good enough for canonical posts too and if it wouldn’t work for canonical posts, it wouldn’t work for Q&A’s.

There are some valid concerns about using a relational database here, but canonical posts don’t seem to need to factor into that discussion.

But then I might also be missing why you think canonical posts would be so different from regular Q&A’s (in terms of underlying representation, not content).

Although I’m also of the mindset that it’s better to just get started with something and fix it later, which shouldn’t be too hard if your design is good and modular (which is where most people tend to go wrong, unfortunately), and I’m also probably a bigger fan of relational databases than I should be.

manassehkatz · 23 December 2019 16:17

I think a relational database works for “everything” here. The differences between “Q&A”, “Canonical Posts” (whatever we call them) and “Blogs” (whatever we call them) are functional differences based on a few fields, how they are displayed, etc. Fundamentally they are “a bunch of metadata” + “editable Markdown text”, etc. All fit into an RDBS just fine. A ton of common code between them.

cwellsx · 23 December 2019 16:50

So are we all.

I’m heedful though of the corpses of 20 SE clones already, and here you want to out-do SE itself eventually.

It’s that threaded conversations are kind of tree or graph.

And documents are trees too, nested subsections, maybe you want different rules (tags, privileges, commentary) for some section (and subsections) of a book and not others.

Looking back I guess see this and especially e.g. this.

Plus Monica’s saying, “When SE did Doc they had to make a whole new thing” though I don’t know about that.

Maybe we want to be working with trees and structured documents, and relational is better for implementing mere scrapbooks.

The example in Celko’s book I referenced, of “hierarchical” information, was of “parts with sub-parts”.

So SE now is all about 1-to-n relations of different kinds – not about different layers of the same kind of information (tags with subtags, sections with subsections) – and the lack of hierarchy makes the data a big flat unstructured mess (which you can search with a search tool but not organise).

I think that’s fundamental and an artefact or constraint of the database implementation, or am I wrong?

See also ontology I guess (though I never studied comp.sci.).

manassehkatz · 23 December 2019 17:00

(ARGGHH! Posted in wrong place. Deleted. Pasted here and got “Too similar to recent post” - which of course it is! Let’s see if adding this line at the top fixes it.)

What I have often seen suggested - and never found real-world actually worked in a production system (though I am sure there are some out there…somewhere) is this idea of “Let’s throw away efficient, productive, proven, reliable, etc. RDBS (MS SQL, MySQL, PostgreSQL, etc.) and replace with the latest, greatest nebulous not-relational system that handles “trees”, “networks”, “text”, etc.”

IMNSHO, an RDBS is 100% absolutely the type of back-end data store for this type of system. No doubt about it. Not one bit.

That being said, I will admit that:

RDBMS typically don’t handle trees very well. I have used various implementations (both other people’s software and rolled my own) and an RDBS can store a tree well and moderately well retrieve a tree but the functions needed to maintain a tree are typically a mess (to put it nicely). To the point where in my current largest work project, we went over the course of several years from “a few very specific tree-like structures” to “a totally flexible true tree structure (stored in the RDBS)” back to “a few very specific tree-like structures” because maintaining those trees proved to be pretty much ridiculous.
Text is hard to handle well in an RDBMS. An RDBMS typically stores it as a blob. Some index the text in a way that makes searching reasonably fast. But they (again, someone will prove me wrong) don’t handle structure within the text.

End result: We may want to do something to better handle things like:

Creating tree structure or threading or “something” with comments beyond a simple list.
Creating “structure” out of “User created Markdown blobs of text” using various tags within the text.

and possibly other things that an RDBMS doesn’t natively handle very well. But an RDBMS is still the right choice for the primary storage of Q&A, Comments, “Blogs”, Users, etc.

cwellsx · 23 December 2019 17:13

Ironically, here we were deleting and reposting in order to keep up with a moderator’s migrating an evolving sub-topic into it’s own new topic.

Welcome to information being siloed into topics.

Much convenience for the authors, so easily followed by the readers, not.

Very much so: software just like this. That’s what I’m complaining about or afraid of.

Yep, that was my experience too. So.

That’s aka an “impedance mismatch”, isn’t it. It’s not easy to prototype: a square thing in a round hole or whatever the metaphor is, and you can’t just write and expect it to work, it turns on you and fights back.

But do you think it supports tree-like structures well enough, then, to be no hassle? Does “relational” remain like the “obviously correct and best” solution, if users want to present and edit any hierarchical/structured information? Can you make it work with documents (even ontology) and not just a flat array of topics?

Alex · 23 December 2019 18:50

Craigslist’s migration from MySQL to MongoDB is one such example I know of (just to give an example). Edit: actually, after some DuckDuckGoing I found that MongoDB has a list of customers who use their database.

I added this post purely for educational purposes. I don’t mean to add fuel to the fire. I believe everything has its pros and cons and it is merely a choice of choosing the right tool for the job.

For this particular project I believe relational database is a better tool to go with.
_{At the very least for now.}

shagie · 23 December 2019 21:07

Hierarchical queries can be represented in all databases using a nested set model. This is incidentally how Redmine models its data so that it will work with Sql Server, MySQL and Postgresql. Yes, the updates are a bit clunky, but they’re not awful.

That said, given a database being selected, every modern relational database that I’ve touched has a better way of doing it. Oracle has connect by prior. SQL Server and MySQL use recursive common table expressions. PostgreSQL has the ltree module which makes me go “what… oh, that’s… interesting… I’m going to need to find a project to play with that.”

cwellsx · 23 December 2019 21:34

So there is a feasible and supported/obvious way, you reckon, albeit diverse non-standard ways.
Is it PostgreSQL then, in particular, what people are planning for this project , do you know?

manassehkatz · 23 December 2019 21:41

Correct. The plan is definitely PostgreSQL. It has fairly broad approval from the initial group of developers.

paulocoghi · 30 December 2019 10:49

I would like to suggest a non-sql database to give the project scalability, in particular ScyllaDB.

As I said on discord, I believe the project needs to choose a database observing a balance between maintainability and scalability. And SQL databases, when clustered, require all data on all nodes. This doesn’t scale well. I already maintained MariaDB Clusters and it’s expensive to scale.

ScyllaDB is fully compatible to the (already stable and popular) Cassandra API, and was completely rewritten to provide ease of maintenance and performance.

It can provide many and multiple millions of TPS on a single node, making it a perfect choice in my opinion.

paulocoghi · 30 December 2019 10:49

I can do the sysadmin work to create and maintain the ScyllaDB cluster.

cegfault · 30 December 2019 12:44

Although I love ScyllaDB and would want to see this option in the future, I don’t think this is the right time for it. It seems like most of the expertise and work being done now is around Postgre.

It’s more important to get something working than it is for it to be ideal. Because this is an open source project, there will be plenty of options to upgrade to Scylla in the near future (especially if we do a good job of encapsulating the db logic and semantics in code). Ideally, people who are running their own instances should be able to use mysql, percona, maria, postgre, scylla, or really anything they want.

paulocoghi · 30 December 2019 13:34

It’s more important to get something working than it is for it to be ideal.

Agree! If the code is database agnostic and, in some version, it allows the option to change the database, I agree.

The only downside is the migration of data when the instance’s owner chooses to change the database. But I understand that, still, this is a future problem and not a problem for Codidact itself.

I don’t think this is the right time for it.

Scylla is based on Cassandra standard and is more than mature. It’s used by IBM, Samsung, Intel, Yahoo, Comcast, IMVU, etc

It seems like most of the expertise and work being done now is around Postgre.

Disagree. For many architectural reasons, relational databases can’t scale as well as non-relational and I see the expertise and work being mostly done on the non-relational ones.

Codidact can be not only as big as StackOverflow, but it can be naturally greater and bigger, because it’s open source.

And for this reason, I would still vote to make the project database agnostic but with a suggestion to start with Scylla first, allowing the instance owner to start better (on the scalability point-of-view) since the beginning.

paulocoghi · 30 December 2019 14:14

Although Scylla is made to scale, you can start with one node (that’s really easy to install) and add nodes later, transparently.

raphaelschmitz · 31 December 2019 13:20

I feel like it’s a very attractive option on its own, but less so in the context of our current situation.

Most people here do not have experience with Scylla. Also, it’s a volunteer project - let’s say it takes 3 industry work days (3x8h = 24h) to learn Scylla. That can easily be a whole month for volunteer work (4 weekends x 6h).

And then consider that a skeleton project is currently under development and we’ve been working toward that for 2-3 months now. With having to rethink a bunch of other stuff, you might easily look at taking 2 months just to be where we are right now - which still isn’t close to MVP. What we get in exchange is something like “if this really takes off, in 5 years a browser will open a question 50 milliseconds faster”.

All these numbers are just educated guesses and you’re free to create a real business case for this; try and convince engineers with numbers. However, you’ll have to be really convincing to justify the whole project being even later - and I fear the numbers won’t be that much of a difference to begin with.

paulocoghi · 31 December 2019 14:01

Very good point of view, @raphaelschmitz . And I cannot agree more with you! The way you exposed helped me rethink my suggestion

With the basics solved, we can make the project get more traction, more visibility and more contributors, and then yes, more advanced issues can be better addressed.

The support for Scylla is something to worry about not only after the MVP, but after 1.0 or 2.0 release

cellio · 31 December 2019 15:02

Yes. We need to get something running soon that will meet our needs for a while, and we will continue to adjust, extend, or even refactor as we go. It will likely be years before we have scale problems that would make Postgres blink. Let’s get this running now. I want to see active Codidact communities soon, not wait many months while we build something that could handle SO today. We won’t have SO today or likely for many years. Our current approach is sound, not fragile; let’s go with that for now.

jenny_d · 3 January 2020 10:39

Speaking as a sysadmin who’s been roped into supporting various databases, I strongly concur. Postgres is far more reliable.

If you need a volunteer sysadmin a few hours per week, I would like to help. I’m not a DBA as such, but I do know the basics of running a pgsql cluster, setting up backups etc.