History/Audit DB

cbrumbaugh · 1 January 2020 18:40

This has been kicked around before in a couple of places, but I would like to formally propose it here.

The basic idea is that we create a separate DB for history and audit purposes. The name of the db with be the same as the normal one, just with _chg appended to the name.

The history DB will have all the same tables as the normal one, with _chg appended to the end. Each table in the history DB will have 4 extra columns.

Date of record.
What activity was being recorded INSERT,UPDATE_BEFORE,UPDATE_AFTER,DELETE
DB user
UsersId

All of the other columns will be the exact same data type as the normal but will default to null.

What happens is that when a INSERT,UPDATE or DELETE operation happens in the normal DB, a trigger is fired that backs up that data to the History DB.

INSERT All data that was just inserted gets written.
DELETE All data that is going to be deleted gets written.
UPDATE The data that changes gets written, if a table has columns A,B,C and only B changes, then only B gets written. The data goes into two rows, one recording the data before the update and on the data after.

It’s possible to generate the triggers based off the schema so we shouldn’t have to write those out by hand.

There is a performance hit to doing this, but if a query ever goes wrong, this makes it possible to fix just the mistakes that were made.

manassehkatz · 1 January 2020 20:05

Default to null should not be needed. The data will be the same, whatever it is.

I am not so sure it should be a separate DB. The problem with a separate DB is that there are normal functions - e.g., viewing Edit History of a Q or A - that will require access to these tables. With the names different (_chg at the end or History at the beginning or whatever - arbitrary and doesn’t matter as long as it is consistent), there is no reason the tables can’t be in the same DB.

As noted before with other things, this would be Date/Time.

Also one other change: The “Id” will be different. There are at least two ways to do this:

Original Id → OriginalId, new Id automatic

Advantage: Automatic Increment field is always Id.

New “ChangeId” automatic field, Original Id → Id (i.e., same as all other fields that are copied)

Advantage: References between Original Table and History Table will have Id refer to the same thing (Autoincrement Id of the Original Table).

aakhmetov · 1 January 2020 20:15

My concern from immediate-glance is that if some sort of concurrent change occurs, then the audit log, as written out, might be even more inconsistent (especially since updates turn into two rows); this would be especially bad since different DB implementations might handle consistency around triggers differently. I’m guessing that it would take some specific tweaking in the database layer to make it work for each DB backend.

Of course, if there’s a human in the loop, they can probably disentangle the audit log based on context.

cbrumbaugh · 1 January 2020 20:26

If the DisplayName of a user get’s changed, then I don’t see the point or storing their Bio again. I have seen systems where on an update recorded all the values of that row and in addition to taking up way more space, it makes it difficult to see what changed. Setting the values to default to null means that if the update doesn’t change the value of a column then that value is set to null.

A separate DB is mostly for ease of organization, most development isn’t going to touch the change tables and so when you are looking at the tables in the schema it puts them elsewhere.

I would have the ChangeTable be its own primary key and call it something else. HistoryId or something.

The 4 extra columns plus the original primary key need to be indexed.

manassehkatz · 1 January 2020 20:44

Yes, it takes more storage. Storage is cheap. As far seeing what is changed - that isn’t easy if you are browsing rows of a table. But it is trivial for software to highlight the differences for you - much as (and easier than) a diff program shows differences between two blobs of text.

But the flip side is that:

If you have fields that actually can be null then how do you tell “xyz” → “null” vs. “xyz” → “xyz” (unchanged)?
If you want to recover the complete contents of a record at a given point in time (e.g., high-ranking user (no approvals needed for edits) deliberately messes up a bunch of posts (shouldn’t happen, but I’ll bet it does…) then reverting is more complex than if all data is in every history record.

Any change to structure (new field in a primary table) is going to require the same change in the history tables.

I don’t know if you really need to index all 4 extra columns plus primary key. Really just original primary key plus history primary key.

cbrumbaugh · 1 January 2020 21:16

Storing data, however, writing data to disk takes a hit out of your performance. If data doesn’t need to be written, why write it.

In the update before it will be null, and in the update after it will have a value. The if statement does need to be null safe.

All we need to fix is whatever fields were damaged, the others can be ignored. If they defaced the title and left the body alone, then the title needs to be reverted and the body can be ignored.

100% correct. If the changes aren’t done the system breaks (ask me how I know). This will be true regardless of whether the tables are in a seperate DB or not.

There are going to be lots of rows in the history table, indexing whether the operations was a INSERT/UPDATE/DELETE, who did it and when it happened makes rolling back changes much faster.

cellio · 2 January 2020 01:14

I’m probably missing something fundamental here, but it sounds like you’re describing a backup mechanism? Why build a second DB and code to handle it when B/R is standard database functionality?

cbrumbaugh · 2 January 2020 01:45

It’s not for backup and replication, what it is for is tracking changes. It can be used to back up and I have used this system to undue deletes but that’s not really for backing up.

For instance, if someone changed their display name, then it would record the old one. If they change it multiple times those are all recorded.

Broadly this makes it possible to undo a single update or delete of any single record instead of having to pull up an backup that may not be up to date.

cellio · 2 January 2020 01:56

Oh, I thought edit history would be in the main DB, because it will often need to be available (like when showing post edit histories). This would also allow mods to roll back a problematic change of display name instead of hard-resetting it.

cbrumbaugh · 2 January 2020 02:03

Here is why I would like the backup tables in a separate DB with a picture. See how the tables are listed on the left side?

If you put the history tables in the same place, you double the list of tables to scroll through and most of the time when you are working on it you only care about the actual tables.

It’s the biggest showstopper in the world but putting the history tables in their own DB has made my life a lot easier in the past on some of my other projects.

cellio · 2 January 2020 02:06

It seems odd to change high-level DB structure to accommodate an IDE. It makes me wonder why the tool can’t do better, or whether it can be used differently.

For example, if history tables had a naming convention that sorted them to the end of the list, would that solve the problem? If history tables were in a different schema (no idea if that’s a terrible idea!), would that solve the problem?

cbrumbaugh · 2 January 2020 02:11

It looks like postgres does things differently from MySQL (which is what I am most familiar with), but from a cursory reading, it does look like a separate schema would solve my concern.

manassehkatz · 2 January 2020 02:58

I’m still not convinced. Another possible issue: Foreign Keys. If they are in the same DB then:

Users has FK CommunitiesId
Users_Changes has FK CommunitiesId (i.e., the exact same) and FK UsersId

with 2 separate Schemas/DBs (however that is defined, not my usual thing which may be part of my reluctance);

Users has FK CommunitiesId
Users_Changes has FK CommunitiesId and FK UsersId to tables in a different schema/DB.

Alternatively, Users_Changes has FK Communities_ChangesId but that would require an extra lookup to figure out the “right” Communities_Changes to link to.

As far as @cellio 's naming convention, there are 2 ways to go:

Users → Users_Change which puts each primary & history table together as a pair

or

Users → Change_Users which puts all the primary in one batch and all the history in another batch

codymullins · 2 January 2020 18:23

I do not agree with using a separate database for auditing. I’ve been handed systems like this before and it wasn’t pleasant.

Triggers are evil. They’re awesome, too, sometimes. Here’s why I wouldn’t recommend triggers:

Performance
Triggers firing each time records are inserted or updated will lead to system performance issues, especially on heavy load. There’s a cost associated with this, which shouldn’t be the basis of an entire system’s auditing.

Maintainability
A trigger here, a trigger there, a trigger that triggers another trigger - oh my. Triggers can get confusing and get out of control. Over time they’re tough to maintain when used liberally.

As far as auditing changes goes…this is where domain mapping is crucial. What tables do we truly care about for history? Do I care if someone changed their Github link a million times? Maybe not. Do I care if someone edited a post? Definitely.

If we can draw a domain around what we’re building, we can nicely tailor our platform and design choices with a narrow scope. User Profile domains might not need anything special - have a few endpoints to PUT updates and call it a day. Posts / Answers domains however may need something like an event sourcing style system where the data is immutable. Storage is cheap these days. What you will potentially encounter is dealing with performance - which can be solved when it’s needed.

I’m not saying Event Sourcing is the best decision here, but I’d recommend looking at the pattern or other auditing patterns.

manassehkatz · 2 January 2020 18:58

100% agree with - based on experience (with my own databases and others):

Same database for auditing
Triggers are evil - conceptually awesome, but not always so practical in the real world

As far as “what needs an audit trail”:

I have gradually been pulled over the years - through hard-earned experience - to the idea of auditing “almost everything”. The exception is not so much “what do I think we don’t really need” to “what tables are just so huge as to be impractical to audit” and also, sometimes, “are there ever any real changes to audit”. Some specifics:

Any small table (let’s say 1,000,000 records or fewer) with occasional changes (up to 10,000 changes a day?) - just do it. That’s most of the tables we will have except Posts, Comments, Votes.
Any medium-sized table (let’s say up to 100,000,000 records) with occasional changes (10,000 to 1,000,000 changes a day) where the changes are useful - just do it. It will affect performance (but not a lot if done well) and will be far easier than any kind of “send it off someplace else (NoSQL or a text log file or whatever) to retrieve it when we need it”. Just keep it all in SQL, use a big enough server, make sure your indexes are configured appropriately and done. This should cover Posts and Comments for quite a while.
Any large table (100,000,000 on up) with occasional changes - think carefully about what will work best. This is where it gets big enough for performance to be a concern, and then you have to balance performance vs. utility.
Any large table with hardly any changes - audit changes only. That is a little different from the usual “audit all create, update, delete” because it will make a big difference in performance and storage to NOT create an audit when a record is added. I have one system like that - most of the small to medium tables are audited (should be more, but didn’t start with the concept at the beginning…always harder to add it later) but deliberately not auditing the two biggest tables (energy data from mostly automated sources) which have very few updates.
Any small table with frequent changes - do not audit unless there is a business need. I am not talking about “Users like to change their profiles frequently” - in the grand scheme, those changes are “occasional” and important. I mean if we have tables containing cached summary information or other primarily derived information that changes frequently. There is no need to audit (since the information can be recreated from the audit trails of the other tables) and there is great benefit to not auditing in terms of storage & performance.

In our current proposed schema (including variations for reference tables vs. lots of little fields, etc. - that makes little overall difference), the result is that I would audit pretty much everything, including User, Community, Post, Comment, UserCommunity, Tag, etc., including Instance, Community, User and Post specific details (all the reference tables, linked tables, etc.) except:

Votes - Each vote is basically Up or Down and rarely changed. If a User changes a Vote, just delete (flag, not true Delete) and add a new Vote - avoids having an Audit table that is nearly a duplicate of the large (in terms of records, not fields) Vote table.

There may be other exceptions, but not that I can think of at the moment.

Alex · 3 January 2020 16:37

SQL Server 2016 introduced temporal tables. I am not familiar with PostgreSQL but I just did research and found PostgreSQL temporal tables extension. I have no clue how good / performant / maintained / etc. that extension is. Just putting it out there, maybe somebody else knows.