How should we specify what to import (for MVP)?

cellio · 16 January 2020 04:11

Our spec for data import says:

For MVP, a method exists for a community to import data from its corresponding SE community in a license-compliant way. It is up to the community to decide whether and how to use this; a community might decide to start fresh, import selectively, or import everything and continue operations here. A manual, one-time operation meets the need.

This post is not about policy; that’s up to individual communities. What I’m asking about here is what we need to be able to support. For example, I can imagine the following use cases:

A community might import either nothing or everything; those are easy.
A community might import only posts by users with associated SE accounts. Presumably that means if your associated post is an answer, you bring the question along with it. Fate of other answers TBD.
A community might import only posts satisfying a query like “closed:no score:3+ locked:no” or “(some tag)”.
A community might import only specific posts, like maybe they’re mostly starting fresh but they want those four canonical posts the community worked hard on.

I’m not saying all these use cases are MVP, to be clear. The new Writing site imported everything and, as someone here predicted, might be regretting that. TopAnswers imports only specific posts, which would be cumbersome if you want a lot but the right approach if you just want a few things.

How should we approach data-import functionality? Is it sufficient to say that if you can write a SEDE-style query for it we can import it, and if you want that query to be “select *” you can?

(Data import will be from data dumps, not the live site through the API. So it might not actually execute a SQL query; I don’t know enough about how that works.)

tuggyne · 16 January 2020 04:45

The data dumps are SQL, so in all likelihood a nearly-raw query would be the simplest. Of course, that requires some handholding to make sure the query does what’s expected, and automatically sanitizing it to be safe and sane would probably take too much work for MVP. So the trivial solution is just to make it mostly manual: a discussion is posted on meta about what query to use, then when there’s consensus a developer goes in and runs it. No, it’s not technically efficient, but it’s good enough for the first few dozen at least.

cellio · 16 January 2020 13:39

The data dumps are XML with everything specified in attributes, last I checked. That XML is dumped from a SQL database and can be loaded into one. I just wasn’t sure how one “stages” that to run a SQL query to choose only some of the data to import, hence my hedging. I mean, do you have to load everything into a different SQL database just so you have a way to execute that SQL query? Is that practical?

If we can write a SQL query to choose what to pull from a data dump, I agree that’s easiest – let each community decide what that query should be.

mattdm · 16 January 2020 14:43

I think I’m most interested in this. And I think I’d rather provide a list of URLs than figure out a SQL query to do it.

manassehkatz · 16 January 2020 15:06

It is actually generally far more complicated than “SQL database begets XML dump begets SQL database”. That can work in very limited circumstances - essentially if one entity controls the two SQL databases. I have written such projects. But that is not what we have here.

The XML dump (I admit I haven’t looked at one from SE yet, but this is a near certainty) may mimic the overall structure of the original SQL but almost certainly does not include “everything” and likely merges (SQL JOIN) together data that is actually stored separately, makes some translations (whether for human consumption or for back-end obfuscation or whatever) and makes other changes of numerous types beyond our control and largely in ways we don’t even know (and can’t know, short of seeing the actual SQL queries or discussion with an SE DB administrator).
The import into our own SQL database will be a piece-by-piece (record-by-record and field-by-field) import based on our database structure. This structure is likely similar to SE’s actual DB structure but unlikely to be exactly the same - and it doesn’t matter because the XML in between will be different anyway.
There are other issues with primary keys (ID #s), duplication of data and other things that mean the any data import will not be “trivial”.
It may be practical to build a set of SQL database tables precisely matched to the XML format so that a very simple process can be used to import from XML to those tables. That way any further manipulation is SQL-to-SQL and therefore a bit easier to manage. XML is a human-readable wordy blob of data which, when used for a large data set, does not lend itself to in-memory manipulation. Once it is transferred into an initial set of SQL tables, manipulation for transfer into the “real” database is much easier, though likely still not trivial.

I have done more than my share of data transfers and conversions, from other software packages into my own, between 2 different 3rd-party packages, and even (because I’m a nice guy and my customers pay me to do it) from my own software into 3rd-party replacements. This is all quite doable (e.g., as @ArtOfCode has already done for Writing) but it is not as easy as it seems in the movies.

Hosch250 · 16 January 2020 15:24

I think all answers should be imported if the question is answered (unless they are explicitly filtered out for having a negative score or some other filter). Otherwise, some people could use this to artificially boost their answers by making it appear their mediocre answer was the only one on the post and getting extra votes by showing up earlier than other answers.

cellio · 16 January 2020 16:06

Right, we can’t take their DB dump and somehow just load it directly into our DB. We’re not using the same table layout. We have to do some transformation work.

I worked with the SE data dump several years ago. I took a few sites’ worth of data and loaded them into a new DB that I created (Vertica, not Postgres). To do that, I had to transform the XML to JSON and then create my own tables and load the JSON into them. My tables mimicked the XML structure because I was doing analytics, not setting up a live site for millions of people to pound on, so efficiency wasn’t my top concern.

Any path from SE to Codidact is going to have to pass through some sort of transformation layer, where we apply our structure and constraints. I take that as given.

My question here is focused more on how we’re going to pick out the stuff we want. SQL query against the SE data (which we then transform)? Specific URLs/postIDs as suggested in another comment? Something else?

ArtOfCode · 16 January 2020 21:10

The way I did this for Writing was to write a script that provides options for specifying what to import - all, by user, by tag, etc. If you’re working off the data dump, that provides all the necessary information in the XML for reconstructing posts and users; if you’re working off the API, you need to issue multiple calls for each. Importing off the dump is significantly quicker.

For MVP at least, this should be a manual process - creating a system that lets it be specified through the UI has complexities that make it more of a time-sink than we have time for. Consensus can be made, per-community, on meta, and a developer can come in and run the right script.

cellio · 16 January 2020 21:46

Agreed on manual process; we do not need a nice UI for this any time soon, if ever. I just wanted to clarify the range of what can be included in that query/script/config/whatever, to make sure we have a handle on use cases other than “get everything”. Sounds like you’re way ahead of me and I don’t need to worry about this.