Should we join forces with TopAnswers?

manassehkatz · 8 December 2019 23:43

Downloaded. Thank you.

I don’t have time to go through it all right now. But with a quick look I see lots of tables, separate history tables for “everything”, etc. Definitely my style - and the opposite of what some others here have suggested in terms of “a few tables hold everything” - but that’s generally from people who look work at the application level than the database side of things. Of course, there is more than one way to do this, so no real “right” or “wrong”.

JackDouglas · 8 December 2019 23:52

Don’t know about that

You probably don’t mean anyone here wants One True Lookup Table but just in case:

The One True Lookup Table (OTLT) design, also know as “one lookup table to rule them all”, is a classic anti-pattern that looks so sensible at first glance, that you will defend your decision to use it, even as your database performance takes a nose dive, your data integrity is compromised, and your developers stick pins into a voodoo doll in your likeness!

A lot of stuff in that article can and should be generalised wider. Hence my earlier snippet somewhere here about having a question and an answer table!

manassehkatz · 9 December 2019 00:16

Not exactly in Codidct.

But I have gone down some of that path before in other projects, and always regretted it a bit because of the ugliness of the code. And yet, the problem is that sometimes there really are a lot of lookup tables and they will change over time and want to have same admin page for all of them etc. etc. etc…

What we’ve been hashing about a bit here is the question of Questions vs. Answers vs. Blogs vs. Comments vs. Chat. On one extreme is “everything together” since they’re all “text + stuff” and they all ultimately link back to a Question (well, mostly they do). On the other extreme is “everything separate because they’re different - Questions & Blogs have titles & tags; Questions have Answers; Questions/Answers/Blogs have comments; Comments have threaded structure”; etc. My tendency has been (I think) somewhere in the middle. Looking at your SQL, clearly TopAnswers has gone to “everything separate”.

One of the discussion points has been the number of queries. There was a discussion of, for example, putting tags as comma-separated text. I disagreed and said the tag references should be another table (ManyToMany Questions - Tags) and the tag text in another table. Which is how you do it in TopAnswers and just plain makes sense because it lets you change the spelling of a tag or merge tags or other necessary operations trivially. It does means some extra FK lookups along the way for a routine Question display, but that really should not be a big deal.

So thank you - we are likely closer in philosophy on databases than it may have seemed at first. How much that will translate into sharing of structure, data, code, remains to be seen.

JackDouglas · 9 December 2019 00:35

Yes exactly, databases were born to join. Also views exist!

GRMule · 12 December 2019 05:28

All else aside, that front-end is an eyesore.

Marc.2377 · 12 December 2019 06:33

@JackDouglas, thanks for your answer If you can share the DB schema (either in script form or diagrams) it would be awesome. Please reach me on Discord or via email¹ when you have some time for that.

Regarding the “Web API serving the presentation layer” approach, I’m familiar with it and discussed the merits with some of our other developers already. I still don’t think that’s the approach we’ll end up with, but what I propose instead is something quite similar - indeed, an API (a library, to be more precise) that serves both the presentation layer and a Web API, which itself is a form of thin wrapper around the API/library.

Got it! Makes sense.

Btw, are you experienced with SQL Server? If yes, I would love if you could make some quick remarks about whether they compare, as this is the engine I’m most familiar with.

Thanks again, for all the insights.

[¹]: marc@<this project name>.org.

manassehkatz · 12 December 2019 15:18

@Marc.2377 @JackDouglas sent it to me - I’ll send it to you.

JackDouglas · 13 December 2019 19:42

My SQL Server knowledge is only skin-deep I’m afraid. There are some basic ways that I know in which Postgres is more like Oracle than SQL Server:

tables are heaps by default (and achieving on-disk clustering isn’t easy at all with postgres)
use of temporary result sets (temp tables and the like) are idiomatic in SQL Server. Postgres has them but there is more of an Oracle-like tradition of using bigger SQL statements and more CTEs (including writeable CTEs in postgres, which are wonderful) in postgres.
T-SQL is a very basic procedural wrapper around SQL. Postgres has PL/pgSQL (along with a bunch of other languages), which, like PL/SQL on Oracle is a full fledged programming language.

There are quite a few ‘enterprise’ areas where Postgres lags behind the commercial databases like SQL Server. Partitioning support isn’t as good for example. Postgres also doesn’t have query hints, which some would say is a good thing.

If I think of anything else obvious I’ll add it here!

cellio · 17 December 2019 14:15

Not sure where to record this so dropping it here, where the prompting conversation happened:

As part of data-sharing with TopAnswers or anybody else, we need to make sure we can account for the difference between imported content and original content. If a Codidact site has imported data from SE and then imports from TopAnswers, and TopAnswers imported that same SE data, we need to not end up with two copies. Harder: if TopAnswers imported content from SE and then edited it, and Codidact imported it from SE and still has the original version, importing from TopAnswers should update it. Hardest: merge conflicts.

Let’s at least make sure we don’t end up with dupes. I don’t know if we tackle that at the API layer (here are SE URLs for stuff to exclude) or deeper (get everything and let the backend sort it out), but we need to account for it somewhere.

manassehkatz · 17 December 2019 15:04

The key (literally) thing is to make sure that any IDs are kept between systems. Presumably (I haven’t looked) there is a Question ID for each Question coming from SE. That needs to be preserved for both matching to updates from elsewhere (e.g., TopAnswers) but also for future updates from SE. Hopefully TopAnswers is able to keep that ID too and include it (along with the TopAnswers ID) when transferring to Codidact. But yes, it can get messy. And it might mean needing a separate Review Queue specifically to handle Import Merge Conflicts.

JackDouglas · 19 December 2019 17:24

Yes we keep the SE ID and can expose it in any dump or api. You might like to know we invariably need to clean up the imported markdown from SE to make it properly CommonMark compatible — a lot of this is done automatically in the import process (eg ###heading → ### heading) but some things need some manual attention.

cwellsx · 19 December 2019 17:51

Theoretically true, maybe not in practice.

To me ASP means coding server-side C# which implements page responses. The C# fetches data from a database – probably SQL – and renders that to HTML.

Whereas something like the MEAN stack is closer to nothing on the server – instead I think you:

Define what the page data is – in an abstract/convenient form (like just a TypeScript interface definition)
Define a REST API which is to GET that data e.g. as JSON
Render that data into the DOM (using Angular or React)

In theory the two solutions could use the same database, in practice people tend to use different technology stacks.

In the former case (ASP+SQL) there’s Object-relational impedance mismatch which people try to hide using ORMs. A theoretical benefit is getting ACID “for free” – but, might pay more at run-time for hosting a database, and you might have to be real careful about caching and transaction concurrency.

In the latter case the server might be as thin as possible – an out-of-the-box web server which implements the routes by delegating to a data layer like MongoDB which more-or-less directly supports JSON I/O.

JackDouglas · 7 January 2020 14:09

TopAnswers source is now on GitHub: https://github.com/topanswers/topanswers

cc @Marc.2377

Marc.2377 · 20 January 2020 01:02

Nice, thanks for letting us know @JackDouglas!

Is it ok with you and other devs if I share the database schema you and Manasseh gave me (which I loaded into pgModeler and exported to PNG/SVG) with some of our contributors at this time? I can also submit it to you of course.

(P.S. please DM me on Discord if you’d like to discuss details.)

JackDouglas · 20 January 2020 20:12

Yes, now we are on GitHub anyway, please feel free to share!

marklar423 · 24 January 2020 14:54

Not sure if this has been mentioned yet, but what about a federated/distributed system, similar to Mastadon? That way each individual site can run on whichever tech stack they please, but still be part of a network.

It doesn’t have to be very complicated, we’d just need sites that are part of the network implement things like:

A single OAuth login, so that one account works everywhere
A common API for things like user profiles and reputation

So some sites will be on TopAnswers, some on Codidact, and some sites on even more tech stacks or platforms that fit them better.

manassehkatz · 24 January 2020 15:43

I have serious concerns about this. I generally see three types of reactions to “data collection/aggregation”. Note that this is from my personal experience as a computer support professional and also with friends & family - not necessarily any reflection on “SE users” or any other larger group.

Security Conscious and/or Paranoid: No way! I don’t want anyone to know anything about me unless I tell them to. I don’t want to tell Google anything they don’t need to know, and it is scary how much they already “know”. If I have to have account “A” I certainly don’t want them sharing any of my information with “B” and “C”. For that matter, I’ll avoid even creating an account on a site if I can get the most important stuff anonymously - why should yet another company get my email address to spam me?
Make My Life Easy and/or Tell-All and/or I Don’t Care: Sign in from my Facebook account to “everywhere else”? Sure. I’m in Facebook telling the world every step of my life to all my so-called friends, so I might as well use that to connect to “A” and “B” and “C”. They’re going to mash all my data together? So what, the US 3-letter-agencies, China, Russia, the hackers all have everything anyway, so just make my life easy.
In Between: I don’t mind sharing my data around where it makes sense. I actually think about every site where I setup an account. I’ll share stuff if it makes sense, and I don’t shy away from creating an account on yet another site if there is some benefit that outweighs the risk. I might even actually read the Privacy Policy or other legalese on a new site (though I never quite trust it).

I’d like to think that most intelligent people fall into the 3rd category. I know that’s not the case - but I encourage people to actually think about technology (and everything else in the world that affects us) rather than blindly accepting everything (which is dangerous, IMHO) or blocking out anything we can’t guarantee to be perfect (which is very limiting, IMHO).

That clearly directs me, personally, away from “a single OAuth login, so that one account works everywhere”. I don’t care what anyone says in their policies, that type of setup is a magnet for attacks (from the outside, trying to get in and use the system to get to multiple systems at one shot) and abuse (from the inside - who’s to say that an insider at “A” couldn’t use that information to gain access to “A”'s customers who also have “B” and “C” and “D” and corrupt/abuse/misuse those accounts without the customers’ knowledge.

I don’t shy away from creating new accounts on any new system that appeals to be that seems “reasonably” reputable and secure. But when I get to a site where after the first page they say “set up an account or login with your Google or Facebook account” (or something like that), I almost always run away. If I really must get the information, and can’t find it elsewhere, then I create a fresh account. Another username. Another password. But better than having a site I don’t trust automatically linked with other sites that I depend on.

marklar423 · 24 January 2020 16:01

@manassehkatz I hear what you’re saying. I personally prefer to use OAuth since I trust Google with password security more than others, but I can see how not everybody would want that setup.

What if the OAuth was optional - a convenience? So if you prefer you can create an account individually with each site in the network, or for convenience you can use OAuth and link your various network accounts together similar to how SE works now.

manassehkatz · 24 January 2020 16:34

1 - I have no problem (not really sure how it works - to be honest, not sure why it isn’t just all one automatically connected system) with SE combining authentication/account for all of their own sites, just as we plan to have one instance running all the “primary” Codidact communities. So not really an issue there. - it is really just one system.

2 - Yes, I actually do trust Google more than a lot of others. But I still don’t like tying too many unrelated things together. No problem with Google connecting my personal/friends/family, volunteer organization’/work stuff together when it is all hosted on Google anyway (e.g., Google Docs), though even then I try to separate things out and use separate Google accounts for different projects (knowing full well that “its all in Google”). But if I actually used Quora (I don’t - almost every time that Quora shows up with results for a question I’m Googleing, it turns out to be “meh” at best misleading or incomplete typically, and “wrong” at worst), I would set up a separate account and absolutely not connect it to my Google or (if I had one) Facebook account. No point in putting all my eggs in one basket. (FYI, I just tried Quora via its home page - have to login (Google, Facebook or create an account) to get to anything. That is just awful. Ugghh. Staying away. Codidact better not end up like that.)

marklar423 · 24 January 2020 17:02

That’s exactly what I’m trying to avoid too - I’m thinking more of an federation of independent communities rather than a top down one-size-fits all solution.

Think of it as a sort of EU of Q&A sites, where each site is independent, with separate logins, separate data, separate tech stacks even - but are loosely unified by certain things.

OAuth will enable logins to be honored everywhere in the network, and let you link your accounts together if you choose.
A common data API will let people script public data seamlessly like we do with SE now
A set of common, agreed on principles (like content licensing, CoC, etc)

So each site will keep its own identity and ways of doing things, but still be united into a single network! To use an example talked about above, some sites may choose to have downvotes, others won’t.

Take a peek over at Mastadon, which is where I got the idea from - https://joinmastodon.org/. I think the model is a great balance between the benefits of a single large network with a large userbase, plus the flexibility that running independent instances provides.

(And if you’re interested - under the hood, even when you use OAuth it still creates a separate account local to each site. OAuth just saves you the hassle of separate logins and enables you to link your various network accounts together. It’s necessary to share logins across different URL domains, due to browse restrictions on cross-site cookies.)