MVP: Account database scheme suggestion

celtschk · 2 December 2019 08:13

Here’s a suggestion how our accounts database might look like. Note that I’m not a database expert, so feel free to take it apart without mercy.

Note that this does not contain profile data (other than the nickname); I think the profile data (including an optional publicly shown email, that may be different from the account recovery email) is better stored together with the site contents.

I separated account and identity so that we have the option to later implement several identities per account if we decide so (which I personally hope). In MVP, there would be one identity entry per account entry.

Note that the only time the account table would be accessed is at login, password recovery, or when sending account-related mails. Everything else could be done purely on the identity table. Therefore I don’t think the separate tables are a performance disadvantage even if we don’t end up supporting multiple identities.

One thing that might be a bad choice is that I made separate tables for each login method. My idea being that I suspect it’s more easy to add a table than to add fields to the accounts table. The disadvantage is that the link between authentication type and authentication data table is not in the database, but only in the software. Maybe a database expert can tell me whether this is a good or bad idea.

A note on the “private” vs. ”public” label in the comments:

Any table labelled ”private” contains information that is only ever given to the account owner and, where required, site operators/moderators. Any table labelled “public” contains information publicly displayed on the site (like the nickname). I tried to avoid mixing public and private data in the same table; however the account field in the identity table is actually private data. But making a separate table for account/identity seemed to be overkill to me; maybe my judgement here was wrong.

Anyway, here is my suggestion:

-- table of accounts: personally identifiable, private
CREATE TABLE accounts (
  id                SERIAL PRIMARY KEY NOT NULL,  -- account Id, used only internally, never appears in the interface
  email             TEXT NOT NULL,                -- email used for things like password resets
  status            INT NOT NULL                  -- active, suspended, deleted, ...
)

-- table of identities: personally identifiable, public
CREATE TABLE identities (
  id                SERIAL PRIMARY KEY NOT NULL,  -- identity Id, used by the site
  account           INT NOT NULL,                 -- the account this identity belongs to
  nickname          TEXT NOT NULL                 -- the name associated with this identity
)

-- table of authentication methods: personally identifiable, private 
CREATE TABLE authenticationMethods (
  accountId INT,               -- the account the identification depends on
  authenticationType INT       -- e.g. 1 = password login, 2 = StackExchange
  authenticationDataId INT     -- the entry in the corresponding table
)

-- data needed for password authentication (except for email address, as that is already in accounts):
-- not personally identifiable, private
CREATE TABLE passwordAuth (
  id            SERIAL PRIMARY KEY NOT NULL,
  salt          TEXT,
  passwordHash  TEXT
)

-- data needed for login with SE: personally identifiable, private
CREATE TABLE stackAuth (
   id             SERIAL PRIMARY KEY NOT NULL,
   stackAuthData  TEXT                         -- replace by whatever data SE authentication requires
 )
 
 -- session data: personally identifiable, private
 CREATE TABLE sessions {
   cookieData     TEXT PRIMARY KEY NOT NULL,  -- the contents of the session cookie
   identity       INT,                        -- the identity of the session
   expires        TIMESTAMP                   -- the expiry date of the session
 }

ArtOfCode · 2 December 2019 11:35

Some observations:

Primary keys should be integers. Primary keys should especially not be large text-based fields.
Minor one: since this is .NET Core, DB names should be PascalCased rather than camelCased.
You don’t need authenticationMethods. The User model can simply have two associations, one to passwordAuth and one to stackAuth.
You don’t need stackAuth, either. Just store SEAcctId on the accounts table. The authentication process will ask the SE API for the current user’s network ID, and we use that against the accounts table to log them in.
Generally, prefer shorter fields over longer fields: nickname could be VARCHAR(255) rather than TEXT. Likewise email. cookieData may be longer than 255 characters, but could still be SMALLTEXT rather than TEXT.
Prefer declarative attributes over magic enums: status should be a number of fields, i.e. SuspensionEndDate, IsDeleted. The only acceptable enums are foreign keys.

However, this is a good base design for us to iterate on. Authentication data should be a separate table to user data, so that’s good. Whether or not we have multiple identities per account… well, we don’t know yet, but it’s worth figuring out how we’d design it if we do. Storing session data in the database is also good.

celtschk · 2 December 2019 14:09

Thank you for the feedback.

I’d like to comment on some of the points.

I didn’t know that. The reason I put the cookieData as primary key is that this will typically be what is searched for, and it should be unique anyway, and I thought making it the primary key will speed up that search. I also think that the content of a login cookie should not be that large anyway.

The reason for that table is that we might want to add more authentication methods later, and I’m not sure how expensive it is to add a column to an existing table (if I understand correctly, we’d need a new column for each authentication method).

Indeed, I guess one day we’ll want to allow authentication with every OpenID and/or OAuth provider. Which means a potentially unlimited number of “authentication methods” all referring to the same table (because the same user may have accounts at many different servers). I don’t think an unlimited number of columns is possible, is it?

OK. I had no idea what data you need for authentication with SE. Since we’ll have to treat the SE account specially for data import anyway, I assume that putting the information directly into the accounts table makes sense.

Thank you.

ArtOfCode · 2 December 2019 14:37

That’s not quite what I meant there - while we don’t need the authenticationMethods table, we can keep passwordAuth (and stackAuth, if there wasn’t an easier way to do that), and in the future we could have OpenIdAuth too. Each of those tables references the accounts table, so we can define methods on the User object to grab auth details for any of those methods - think User.GetById(324767).GetAuthenticationDetails(AuthenticationDetails.OpenId) would return results from the OpenIdAuth table.

celtschk · 2 December 2019 14:53

OK, I think now I get what you mean. You mean, adding a column to each authentication method table which contains the account ID that this authentication data is for, right?

I agree, that’s probably a better way to do it.

In the password table, there should obviously be only one entry per account, of course.

manassehkatz · 2 December 2019 15:45

Pretty much everything @ArtOfCode said. All of which (except for PascalCase) is pretty much standard modern database configuration. In particular:

Every table should have a primary key integer autoincrement (SERIAL)
String fields need to be sized correctly based on anticipated data - that means TEXT when you actually need it but not when you don’t. (The big problem being that TEXT fields typically store a pointer in a record so when you retrieve the data you (behind the scenes automatically) generate a read of another location to get the string.)

In addition, structures need to be considered carefully.

I plan to do a fairly thorough analysis of the planned data and generate a proposed schema, but I will likely not get to that until at least Thursday as I have too much “real work” that I am behind on. I’ve built a lot of databases over many years, so while I am sure I won’t get everything (and a lot will change as we work through the details), I have some pretty good ideas on how to structure this for efficiency and expandability.

cbrumbaugh · 2 December 2019 18:16

I would add a

VerifiedEMail tinyint(1) NOT NULL DEFAULT ‘0’

to the accounts table, and set the value to 1 once the user has verified the email as theirs.

cellio · 2 December 2019 18:31

If instead of a boolean we record a date, then we’ll be able to answer questions like “have I verified my email address since I changed it?”. I don’t know if that’s an important consideration, but it seems that in general, recording events with timestamps/dates is more expressive than just using booleans.

manassehkatz · 2 December 2019 18:36

This is where databases get complex. Which is OK - I deal with this stuff all the time. Timestamps are important but having the status as a field too can be well worth the extra byte.

cellio · 2 December 2019 18:49

Ah. This is where my programming mindset brings the wrong instincts for databases – I was thinking that checking for the presence of the value is sufficient to answer the yes/no question so wouldn’t add a burden beyond the extra storage of a timestamp over a byte, but that’s probably not how the query really works, is it?

manassehkatz · 2 December 2019 18:58

Yes, checking for the presence of the value is sufficient, and I do that a lot of the time. But using the email record as an example, having that boolean is helpful for a bunch of reasons:

Less worry about “what is empty” for a date (0000-00-00 vs. Null/None) - booleans and tinyints are simpler.
If there are multiple email addresses (for account recovery purposes or for multiple notifications or whatever or for history of email - e.g., “original was [email protected] but changed to [email protected]”) then having flag fields - “active”, “newsletter”, “recovery”, etc. makes a lot of sense (rather than having a separate table for each of these types of email addresses) and then having “verified” as one of those boolean/tinyint fields keeps the code cleaner than “for this you need to check if the date is valid, for that you need to check a boolean field”.

GrumpyCrouton · 2 December 2019 19:10

Forgive me if this is a dumb question or if I’m missing something, but do we really need to account for auth with SE if our ultimate goal is to

“develop a community-driven alternative to replace Stack Exchange with something better”

cellio · 2 December 2019 19:12

Not a dumb question. A lot of our users will initially be coming from SE, so making it easy for them to do so seems prudent.

manassehkatz · 2 December 2019 19:13

Yes.

One of the planned tasks is to pull content from SE into Codidact based on various criteria TBD. There are a number of advantages to having that content tied to the original user and/or pulling additional content created by those users (e.g., we might pull “positive questions/answers from the past year” initially and then add “everything asked or answered by a user, including old content or answers to negative score questions” when a user authenticates). So being able to somehow authenticate a Codidact account to an SE account has some real value.

celtschk · 2 December 2019 19:27

The Stack Exchange authentication is in particular important for relicensing stuff imported from Stack Exchange. It would violate copyright if done without the consent of the original author. Now anyone could come to our site and claim “I wrote that, please relicense” even if they are not the author, which could cause legal trouble. But if we authenticate with SE, we can be reasonably sure that the user is indeed identical to the SE user, and then we can use the SE data to verify that this person really wrote the original SE post.

gilles · 3 December 2019 00:06

For passwordAuth, the salt and the hash proper should be stored in the same string, together with the other parameters. See e.g. Argon2, Unix crypt output.

JackDouglas · 3 December 2019 12:48

String fields need to be sized correctly based on anticipated data - that means TEXT when you actually need it but not when you don’t. (The big problem being that TEXT fields typically store a pointer in a record so when you retrieve the data you (behind the scenes automatically) generate a read of another location to get the string.)

This isn’t true on postgres, you should use text for almost everything and limit length where necessary with check constraints: https://www.postgresql.org/docs/11/datatype-character.html.

cc @ArtOfCode

manassehkatz · 3 December 2019 16:17

Thanks for the information and the link. I have heard similar but vague statements before but without documentation and in my specific past usage it was irrelevant because my main PostgreSQL project uses an ORM to handle the details.

Reading the linked page, it looks to me like:

CHAR - slowest and not recommended unless fixed-length strings actually needed (which is extremely rare in my experience except for single-character strings)
VARCHAR - Arbitrary length up to a field-specific limit
TEXT - Arbitrary length up to system limit (1 GB)

and that VARCHAR and TEXT get treated exactly the same inside the database (except for maximum length):

<= 126 chars - 1 byte (presumably length + 1 bit to tag as a short string) + actual string
> 126 chars - 4 bytes (limit 1 Gig., so that leaves 2 bits for flags - if I designed it that would be 1 for long string + 1 for compressed/uncompressed) + actual string compressed

“Very long values are also stored in background tables so that they do not interfere with rapid access to shorter column values.” matches with my mention of an extra disk access for TEXT fields - the difference being that unlike MySQL (CHAR/VARCHAR always in the main table, TEXT never in the main table), apparently PostgreSQL tries to handle it more intelligently on-the-fly based on actual field usage.

In addition to this making VARCHAR and TEXT the same (except max. length), this means there are no small/medium/large TEXT fields like in MySQL (the difference there being the number of bytes used to store the length).

End result: It seems to me that there is little if any functional difference in PostgreSQL for VARCHAR vs. TEXT + length constraint - they are both stored in something more like TEXT than VARCHAR when compared to MySQL. Therefore it would be more stylistic than performance as to whether to use VARCHAR vs. TEXT for fields that have a known relatively short maximum length.

JackDouglas · 3 December 2019 16:40

End result : It seems to me that there is little if any functional difference in PostgreSQL for VARCHAR vs. TEXT + length constraint - they are both stored in something more like TEXT than VARCHAR when compared to MySQL. Therefore it would be more stylistic than performance as to whether to use VARCHAR vs. TEXT for fields that have a known relatively short maximum length.

Yes that’s basically it, good write-up. This on SO provides a bit more info and a rationale for text+check over varchar(n): https://stackoverflow.com/a/4849030

HappyHacker · 3 December 2019 19:17

Coming from a security standpoint, I would very, very strongly recommend that we not store passwords, even if hashed and salted appropriately. This is because the meaning of “appropriately” has changed multiple times over the last two decades, and it is going to change multiple times over the next decade, too. There are entire companies that make their living just doing this, which is a pretty good indicator that we shouldn’t do it ourselves. In particular, Auth0 is IMO one of the best, and they even have an Open-Source program where we get the benefit of their entire product for free (with attribution). Do we accept having to dedicate dev resources long-term to security issues, or use a free off-the-shelf product? To me, that’s a no-brainer.

If we absolutely must have our own identity management for some reason, then we should use an established product like Identity Server.

Just my $0.02.