MVP: Data import

Problem: An empty site gains no traction. We need sites that have SE analogues to be able to launch with content.

Proposal 1 (definitely MVP):
We need a way to “seed” a site from an SE community in a license-compliant way. Posts on our site need to link to the source post on SE and to the SE user. This attribution must be done in a way that a user who “comes over” to us can re-license to us and drop the attribution back to SE, but that “license migration” isn’t itself MVP.

Proposal 2 (probably MVP):
We need a way to pull updates from SE that do not conflict with activity on our site – new questions, new answers, posts that have been edited there but not here. Specifically excluded: deletions. That SE deleted something doesn’t mean we should automatically do so.

2 Likes

I’m not sure about seeding. It’s a very powerful tool, but it can backfire very badly.

Having a ton of imported content, most of which doesn’t have an owner (because the owner didn’t move from SE, in addition to all the owners who had already left SE before), is only viable if there are enough to do some curation and to answer the new questions. Curation of old content includes moderate edits, retag as needed, update broken links and whatever else needs to be updated, etc.

There are plenty of sites around that just grab SE content wholesale and display it, plus ads. Some of them are read-only, but some of them are actual Q&A sites. Even the actual Q&A sites are worthless because no one is around to curate them. Occasionally a poor soul asks a question there, and no one answers and it’s lost in the mass.

I wonder if it would be better to migrate content on explicit request rather than wholesale. Perhaps let a user import all of the threads where they participated as an asker or answerer, or a subset. And maybe have a privilege that lets users import other threads (tag-based, maybe?), but that part wouldn’t be MVP.

Would new communities start empty (private beta) or start with some imported content? Starting with imported content is only viable if the community policies are close enough.

4 Likes

I’m only imagining us creating communities for which there’s a group of interested users – presumably, initially, that’ll be users on SE who want to go somewhere else. We shouldn’t be scraping everything from SE, only the communities that have users actively interested in coming here (and thus curating and building).

Separately we should talk about how brand-new communities start, but I don’t think that’s MVP. We’re initially reaching out to disgruntled SE communities, at least to get a few beta tests going.

2 Likes

How about importing only Questions with Accepted Answers. That way nobody will have the expectation that thousands of old questions should/will get answers here, and means that the seed content actually has some value to new users as a source for answered questions. We could even put some more restrictions, possibly topic site dependent, as we want to have “enough information to make the site useful” without “overloading the site with lots of relatively old, not so useful, stuff”. Maybe something like:

  • Only questions with an accepted answer
  • Set a minimum number of Votes required of Question and/or Answer - e.g., only transfer if there is a minimum vote of +3 on the Question or the Accepted Answer. Could even be a different value for Q vs. A.
  • Adjust the minimum based on age. New questions tend to be more useful than old questions because they deal with either newer topics (new technology, new programming languages, etc.) or more advanced topics. But old stuff, if popular, is likely to still be useful (I sometimes look for something and find an answer from 10 years ago with an Accepted Answer in the +hundreds range that still answers my question quite well). Something like <1 month, >= +3; 1 month to 6 month, >= +6; 6 month to 1 year, >= +10; > 1 year, >= +20 - or whatever - the actual specifics should be based on a bit of digging through the data on a topic site basis.
4 Likes

On some sites, particularly the more subjective ones, askers are often reluctant to accept an answer. So it shouldn’t be just acceptance, but maybe we only import things with positive score? Do note one edge case: a question can be weak and downvoted and yet have outstanding answers; I don’t think we want to ignore those.

I’d rather import everything and, if necessary, change how we filter or render the stuff that might be weak, so that users who come over have all their stuff. Maybe it’s sufficient to say that user import triggers targeted data import, i.e. when you connect your Codidact user account to your SE one, we’ll bring over all your stuff (including questions you answered)?

2 Likes

Questions without an accepted answer can be very important. One of my favorite answers, is on a question that never got an Accepted Answer because OP didn’t like any answer - most of the answers being very opposed to OP’s ideas. But it is an important Q&A for others to read if they have similar issues. So adding some additional parameters - e.g., if no Accepted Answer then Question score > ‘x’ or Answer score > ‘y’ would make sense.

My concern is that simply importing “everything” will get too much junk, though very much site dependent. On the other hand, importing only when an existing SE user arrives will mean we start with too little.

1 Like

Only doing user-based import is too narrow, and I didn’t mean to give the impression I was suggesting that. I meant that regardless of what we do generally, when a user is imported we should supplement by importing everything that user participated in. (I mean Q&A, not comments.)

3 Likes

Continuing from User public profiles in MVP - #20 by rodolphito, I think we need a mechanism to import tags or tag groups as sites, for cases like SO which are too large imo. SO is roughly the same size as the rest of SE combined, and it feels like the equivalent of having a “Religion” site and just throwing all of them in there indiscriminately. It could benefit from splitting at some level like “Java ecosystem” site, “.Net CLR ecosystem”, “Databases”, “Web dev”, etc.

It gets tricky though. Even in SE, there are problems with ambiguity between sites - does a question about configuring Apache on Linux go in Superuser, Webmasters, Unix & Linux, etc. Plus something like “Web dev.” could include any language that can be used as the back-end of a web site plus HTML/Javscript/CSS/jQuery/etc. on the front-end, etc. Import will need to be done, some level of that is definitely MVP. But it gets messy.

1 Like

I don’t think we should currently be in the business of reorganizing all the things. Let’s import willing communities as they are; if communities want to either merge or split up we can work with them on that, but our baseline assumption, I think, should be that SE Community X wants to become Codidact Community X and carry on from there. SO isn’t going to be among the first to come over, and smaller communities wouldn’t be helped by this sort of refactoring (and some would be actively harmed).

Ontologies are messy. We have to be ok with that.

4 Likes

Agreed 100%. Just gonna repost this here for reference, as it is MVP material IMO:
Have a tickbox present when answering a question imported from SE, that automatically “crossposts” the answer to SE with link attribution to the answer on Codidact.

We could make all imported content community wiki and get rid of the OP control of the ‘accepted’ answer.

Advantage: We can start managing the answering (and making questions as well) more as a community collaboration and place additional content to existing answers rather than posting a new answer for every new little idea.

Disadvantage: it would be nice to still have some sort of competitive scoring system (although it becomes a bit perverse incentive; it does greatly help contributors feeling appreciated and motivated).


This is related to one of the ideas that I had for the statistics website. There is not much need for many more new questions (the majority of new questions are low score, also when corrected for age) and more important is that now the large amount of questions and answers (including lots of noise) need to be mostly managed (duplicates, bad content, tagging, improving answers, finding bad answers, etc.).

So as alternative to the mess at SE/SO I though of making a wiki-style copy of the content from SE/SO. A website that takes all the content from SE/SO but just organizes it better; and becomes more attractive to a professional community as reference and as place to contribute.

A person who wrote something on SE and joins Codidact should still “own” that content here. Technically we don’t have to do that by the attribution rules, but I think it’s important for building engagement, and it’s kind of mean-spirited to say “yeah we know you wrote this, but we filed your name off it anyway”.

That’s for the subset of imported content where accounts here and there have been matched up.

Accepted answers, on the other hand, I think we should just drop on import. The opinion of one person who might not even be here isn’t very important. Let’s drop all answer acceptances on import, and if the author of the question wants to claim it and accept an answer here, that’s still possible. Meanwhile, default sorting for answers applies.

5 Likes

One important thing to consider is the licensing question: Since the recent content license change on SE is at least legally questionable, there’s the problem under which license imported content should be.

For content contributed after the license change, things are easy, as those are contributed under the new license, and therefore have to be under that license. For content contributed before the license change, the legal situation appears more muddy to me: On one hand, the original posters did provide it under the old license. On the other hand, SE only provides it under the new license (whether legally or not). So it might be illegal to use them under the old license (unless gotten from archives from pre-license change), and it might be illegal to use them under the new license.

Of course when the authors register with our site, they can explicitly allow to use their content under the new license. But that’s not feasible for automatic import where the author may not be on the new site (or worse, where there is more than one author, as all of them would have to agree to a license change).

1 Like

I would not consider this part of an MVP.

It’s an important feature we need; not just ultimately, but high priority.
Or, to put it differently, using the MoSCoW method I’d consider this a “should have”… but not a “must have”.

Q&A seems like an incredibly simple (as opposed to complex) concept, but I can tell from my experience with starting to write a Q&A website that the devil is in the details. The first things I remember now off the top of my head:

  • registering, logins, password recovery
  • a common UI language
  • user friendly error handling for every page where a user can do stuff, in code & UI
  • before you can even start with any of this, you need a coding standard - in a world where even “tabs vs. spaces” leads to incredible flamewars

That’s already a bunch of work, and I haven’t even started with the actual domain - questions, answers, votes, flags, tags, roles & rights, moderation tools…

We have a ****load of work ahead of us, just creating what most would consider “the basics” in SE:

  • ask questions
  • answer a question
  • at least minimal moderation

And when we have those parts working, we can release it. “Release” as in make it public and put it to work. That’s what I would call an MVP. At that point, I don’t see any reason to wait until we got e.g. this import functionality here figured out. Anything else is functionality we can roll out later to make it better and better gradually.

Well, IMHO.

2 Likes

There was a data dump just a few days before the license change. It seems clear to this non-lawyer that that data dump was provided under 3.0 and we can freely take it. The status of content created after the license change is murky and we should seek guidance before pulling that in. (I’ve previously suggested auto-updating from SE and I do want to be able to do that, but I now understand that there might be license complications.)

Anything that a user explicitly brings over is fine, because the author is free to license it to us. (SE doesn’t hold an exclusive license.) We should therefore (not for MVP) make it easy for a user to say “I’m so-and-so on SE (here’s proof); bring all my stuff over and I license it to Codidact”, and we’d go collect everything we can. (I have further ideas about this for when we get there.)

3 Likes

Having thought about it more, I’m inclined to agree. I want data import early because part of the attraction will be “you can pick up where you left off”, but maybe for the earliest communities we do something more manual and selective.

1 Like

Revised proposal: for MVP, a method exists for a community to import data from its corresponding SE community in a license-compliant way. It is up to the community to decide whether and how to use this; a community might decide to start fresh, import selectively, or import everything and continue operations here. A manual, one-time operation meets the need.

License-compliant could mean importing only data that predates the latest questionable license change, or could mean applying different licenses to pieces of content based on last-write date on SE.

3 Likes

Added to functional spec.