MVP: Data import

Problem: An empty site gains no traction. We need sites that have SE analogues to be able to launch with content.

Proposal 1 (definitely MVP):
We need a way to “seed” a site from an SE community in a license-compliant way. Posts on our site need to link to the source post on SE and to the SE user. This attribution must be done in a way that a user who “comes over” to us can re-license to us and drop the attribution back to SE, but that “license migration” isn’t itself MVP.

Proposal 2 (probably MVP):
We need a way to pull updates from SE that do not conflict with activity on our site – new questions, new answers, posts that have been edited there but not here. Specifically excluded: deletions. That SE deleted something doesn’t mean we should automatically do so.

1 Like

I’m not sure about seeding. It’s a very powerful tool, but it can backfire very badly.

Having a ton of imported content, most of which doesn’t have an owner (because the owner didn’t move from SE, in addition to all the owners who had already left SE before), is only viable if there are enough to do some curation and to answer the new questions. Curation of old content includes moderate edits, retag as needed, update broken links and whatever else needs to be updated, etc.

There are plenty of sites around that just grab SE content wholesale and display it, plus ads. Some of them are read-only, but some of them are actual Q&A sites. Even the actual Q&A sites are worthless because no one is around to curate them. Occasionally a poor soul asks a question there, and no one answers and it’s lost in the mass.

I wonder if it would be better to migrate content on explicit request rather than wholesale. Perhaps let a user import all of the threads where they participated as an asker or answerer, or a subset. And maybe have a privilege that lets users import other threads (tag-based, maybe?), but that part wouldn’t be MVP.

Would new communities start empty (private beta) or start with some imported content? Starting with imported content is only viable if the community policies are close enough.

2 Likes

I’m only imagining us creating communities for which there’s a group of interested users – presumably, initially, that’ll be users on SE who want to go somewhere else. We shouldn’t be scraping everything from SE, only the communities that have users actively interested in coming here (and thus curating and building).

Separately we should talk about how brand-new communities start, but I don’t think that’s MVP. We’re initially reaching out to disgruntled SE communities, at least to get a few beta tests going.

1 Like

How about importing only Questions with Accepted Answers. That way nobody will have the expectation that thousands of old questions should/will get answers here, and means that the seed content actually has some value to new users as a source for answered questions. We could even put some more restrictions, possibly topic site dependent, as we want to have “enough information to make the site useful” without “overloading the site with lots of relatively old, not so useful, stuff”. Maybe something like:

  • Only questions with an accepted answer
  • Set a minimum number of Votes required of Question and/or Answer - e.g., only transfer if there is a minimum vote of +3 on the Question or the Accepted Answer. Could even be a different value for Q vs. A.
  • Adjust the minimum based on age. New questions tend to be more useful than old questions because they deal with either newer topics (new technology, new programming languages, etc.) or more advanced topics. But old stuff, if popular, is likely to still be useful (I sometimes look for something and find an answer from 10 years ago with an Accepted Answer in the +hundreds range that still answers my question quite well). Something like <1 month, >= +3; 1 month to 6 month, >= +6; 6 month to 1 year, >= +10; > 1 year, >= +20 - or whatever - the actual specifics should be based on a bit of digging through the data on a topic site basis.
3 Likes

On some sites, particularly the more subjective ones, askers are often reluctant to accept an answer. So it shouldn’t be just acceptance, but maybe we only import things with positive score? Do note one edge case: a question can be weak and downvoted and yet have outstanding answers; I don’t think we want to ignore those.

I’d rather import everything and, if necessary, change how we filter or render the stuff that might be weak, so that users who come over have all their stuff. Maybe it’s sufficient to say that user import triggers targeted data import, i.e. when you connect your Codidact user account to your SE one, we’ll bring over all your stuff (including questions you answered)?

2 Likes

Questions without an accepted answer can be very important. One of my favorite answers, is on a question that never got an Accepted Answer because OP didn’t like any answer - most of the answers being very opposed to OP’s ideas. But it is an important Q&A for others to read if they have similar issues. So adding some additional parameters - e.g., if no Accepted Answer then Question score > ‘x’ or Answer score > ‘y’ would make sense.

My concern is that simply importing “everything” will get too much junk, though very much site dependent. On the other hand, importing only when an existing SE user arrives will mean we start with too little.

1 Like

Only doing user-based import is too narrow, and I didn’t mean to give the impression I was suggesting that. I meant that regardless of what we do generally, when a user is imported we should supplement by importing everything that user participated in. (I mean Q&A, not comments.)

3 Likes