Audit tables round 2

cbrumbaugh · 8 January 2020 05:22

Not every table or every column is going to need to a record of changes, but when you build the audit system, it easier to start with everything and pare down from there as things aren’t needed or the performance/storage costs get too high.

Every table will have a change table behind it, in the same database.
The name of the change table will be orginal_chg
The first column of the change table will be chg_id (bigserial) and that will be the primary key
The second column will be called chg_activity and it will be an enum (‘CREATE’,‘UPDATE_BEFORE’,‘UPDATE_AFTER’,‘DELETE’)
The third column will be called chg_activity_date and it will default to the date the row was created.
The fourth column will be called chg_user and will default to the current_user
The rest of the columns will be in the exact same order and of the exact same datatype as they were in the original table.

To build the chg_tables,

Take the script the builds the original tables and copy it to a new script.
Add _chg to the end of the table names
Remove the unique constraints.
Remove the foreign key constraints (the foreign keys will be correct when inserted and will stay correct until a row in the original db is deleted).
Remove the default values (we want what is inserted or updated not the default) .
The old primary key “id” must not be null
Add the new columns specified above to the front of the tables
Add an index to the following columns at a minimum
- chg_activity
- chg_activity_date
- chg_user
- id

To build the triggers,

We don’t want to have to write these out by hand, rather what we want is a script that connects to the DB and gets a list of columns in a table then builds the triggers before writing them to a .sql file.

This is especially import if we add more tables/columns or we decide that we only want to log changes, because in the first case all you have to do is to rerun the script, and in the second the if statements get really long and it would be easy for a human to mess up.

manassehkatz · 8 January 2020 06:38

As with a number of other “date” fields, this must be a full date/time, not just date.

That should not be needed. I have done this with Python/Django using another class (AuditTable) to do most of the dirty work. I can’t imagine there isn’t some comparable way to to do this in C# etc. Even if it turns out that everything needs to be written out twice (which I doubt), it should all be in one script. That will make it much less likely for errors (changed a field in one place and not the other) to creep in, which is particularly important for code that will hopefully be used by others (who should have to repeat our “oops” moments of fixing the DB manually in PostgreSQL).

I’ve never been into triggers much. But that is probably due to lack of knowledge & lack of need. However, I would consider the possibility of the “save” function handling this directly rather than having to setup triggers separately. That has the advantage of keeping all the action in the code, whereas (as I understand it, but not having done it) triggers are somewhat separate. Again, with Django this is relatively easy (can’t say “trivial” but close) and I am sure similar methods are available in C#.

raphaelschmitz · 8 January 2020 10:41

Is some sort of event sourcing off the table already?

Because this seems like a lot of added complexity ( = points of failure)
just to get you exactly what event sourcing does…

misha130 · 8 January 2020 13:41

This isn’t event sourcing, its just audit history.
For the sake of debugging, restoring data, etc.
Its not for an architectural purpose.

The only time I actually think this will be used for business logic is post history (if I am not mistaken).

misha130 · 8 January 2020 13:48

There is support for this in some libraries (mostly EF) with some adjustments. If we use these libraries there is no need to write scripts/triggers.

The only point I have about this is maybe use “history” instead of chg? Its not really obvious what chg means. Unless this is a standard I am not aware of

raphaelschmitz · 8 January 2020 17:49

That’s… what event sourcing does, too.

It’s more of a pattern than an architecture, really. You can easily apply it to only one of your models and not the others. You can also use it inside of DDD, Hexagonal Architecture, etc.

I’m afraid that is not the case. I don’t remember where here I saw it, but IIRC there were good arguments why we should do a whole bunch of logging. The more extreme opinions even advocated logging literally everything. So with this approach… basically more than doubling the amount of columns / schema complexity.

I’m not saying we should use this, but consider for a second what I’ve come up with for answers while working on my own SO clone (while not caring about audit history):
An Answer that is just ID, Question (FK) and CurrentRevision(FK).
An AnswerRevision with ID, Answer (FK), Author (FK) and the actual text.

If you follow what SO does, these are already things you need (because on SO you can look at an answer’s change history). Most of it is there already, in an event-sourcing-ish way. Seems wasteful to me to create whole duplicate-and-then-some audit tables.

manassehkatz · 8 January 2020 18:04

@raphaelschmitz The “Answer” + “AnswerRevision” style could work. But it:

Still Duplicates Tables (in the sense of “2 tables for Answers instead of 1”, even though the number of columns is much less than 2x)
Requires more “application centered” logic - i.e., a slightly more complex query to display an Answer, at a savings of not needing the (ideally pretty much automatic and therefore not-a-big-deal) automatic copy-to-audit-table code.
Does not significantly affect the database storage required. Either scheme will end up with “just a little more than the history needs” data storage. The main difference is with a single action (e.g., initial create of a record) where the original scheme uses 2x (all columns repeated in the audit trail) vs. your idea uses 1 + a little. But first revision and it is 3x vs. 2x, 2nd revision 4x vs. 3x and pretty soon they are almost the same.

misha130 · 8 January 2020 18:10

Yes, coincidentally.

I was just trying to say that there are no plans to use event sourcing in the architecture right now.

codymullins · 8 January 2020 18:14

@raphaelschmitz I agree with you and have suggested this, too. I’m not a fan of this direction.

cellio · 8 January 2020 18:47

Is event sourcing something for DB admins to use, or something for db users to use? We’ll need ordinary users to be able to inspect some history (posts), and moderators will need to be able to inspect more (user history at least), and instance admins might need more. None of those people should have, or need, direct DB access. Does event sourcing (which I know nothing about) meet these kinds of audit needs?

misha130 · 8 January 2020 21:30

Event sourcing is a pattern which manages the state of the application in a way that there is only one source of events and the source has a clear sequence of the events detailed.

This event driven design executes certain commands based on the current events which in turn the handlers of these events execute their own logic.

One of the basics of Event Sourcing is having a store or a database if you will with audit tables to signify these events so then the logic could execute whatever is required based on these.

TL;DR Its for the backend people, not DB admins.

And who knows maybe we will do Event Sourcing in our project just because the audit tables are already there.

misha130 · 18 January 2020 02:07

I know backend shouldn’t affect the database but calling classes MemberChg or ChgMember or MemberChange really rustles my jimmies.
How about history?

cbrumbaugh · 18 January 2020 02:50

I didn’t think the Chg would show up in any other table names in the regular segment, that’s why I went with that. History sounds fine too.