Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Using triggers + history tables (aka audit tables) is the right answer 98% of the time. Just do it. If you're not already doing it, start today. It is a proven technique, in use for _over 30 years_.

Here's a quick rundown of how to do it generically https://gist.github.com/slotrans/353952c4f383596e6fe8777db5d... (trades off space efficiency for "being easy").

It's great if you can store immutable data. Really, really great. But you _probably_ have a ton of mutable data in your database and you are _probably_ forgetting a ton of it every day. Stop forgetting things! Use history tables.

cf. https://github.com/matthiasn/talk-transcripts/blob/master/Hi...

Do not use Papertrail or similar application-space history tracking libraries/techniques. They are slow, error-prone, and incapable of capturing any DB changes that bypass your app stack (which you probably have, and should). Worth remembering that _any_ attempt to capture an "updated" timestamp from your app is fundamentally incorrect, because each of your webheads has its own clock. Use the database clock! It's the only one that's correct!



> each of your webheads has its own clock. Use the database clock!

Yes, for consistency you should use the database clock by embedding the calls to `now()` or similar in the query instead of generating it on the client.

But it's not sufficient to use these timestamps for synchronization. The problem is that these timestamps are generated at the start of the transaction, not the end of the transaction when it commits. So if you poll a table and filter for recent timestamps, you'll miss some from transactions that are committing out of order. You can add a fudge factor like querying back an extra few minutes and removing the duplicates, but some transactions will take longer than a few minutes. There's no upper bound to how long a transaction can take in postgresql, and there's a lot of waste in querying too far back. This approach doesn't work if you care about correctness or efficiency.


There is a way in PostgreSQL to get actual wall clock time of the database server: `clock_timestamp()` regardless of transaction start time.

There's also `statement_timestamp()` per the docs: "returns the start time of the current statement (more specifically, the time of receipt of the latest command message from the client)."

https://www.postgresql.org/docs/current/functions-datetime.h...

This isn't to say any of these methods are the best (or even good) in all cases. Time is tricky, especially if you're trying to do any sort of sequencing.


Estuary (https://estuary.dev ; I'm CTO) gives you a real time data lake'd change log of all the changes happening in your database in your cloud storage -- complete with log sequence number, database time, and even before/after states if you use REPLICA IDENTITY FULL -- with no extra setup in your production DB.

By default, if you then go on to materialize your collections somewhere else (like Snowflake), you get synchronized tables that follow your source DB as they update.

But! You can also transform or materialize the complete history of your tables for auditing purposes from that same underlying data-lake, without going back to your source DB for another capture / WAL reader.


I'm evaluating Flow for CDC. Do you support logical decoding messages from `pg_logical_emit_message`? This would allow us to add audit metadata[^1].

[^1]: https://www.infoq.com/articles/wonders-of-postgres-logical-d...


No. But this is neat, and at a glance it looks straight forward to add. Happy to discuss further!


That sure sounds cool but I can't tell from your website that it does any of that. Even giving up on the marketing copy and going straight to the docs... I can't follow them.


Hi fellow Dark Tower friend.

Yes, marketing and docs are not our strongest suits and we need to do better. To be fair, though, we're also trying not to scare off less technical users who see a bullet list like above and think "well this is clearly not for me". It's a hard balance


Are you using debezium to capture changes?


No. We implemented our own [1] for a few reasons:

* Scaling well to multi-TB DBs without pinning the write-ahead log (potentially filling your DB's disk) while the backfill is happening. Instead, our connector constantly reads the WAL and works well in setups like Supabase that have very restrictive WAL sizes (1GB iirc).

* Incremental fault-tolerant backfills that can be stopped and resumed at will.

* Flowing TOAST columns through seamlessly to your materialized destination, without requiring that you resort to REPLICA IDENTITY FULL.

* Being able to offer "precise" captures which are logically consistent in terms of the sequence of create/update/delete events.

The last one becomes really interesting when paired with REPLICA IDENTITY FULL because you can feed the resulting before/after states into an incremental computation (perhaps differential dataflow) for streaming updates of a query.

Our work is based off of the Netflix DBLog paper, which we took and ran with.

[1] https://github.com/estuary/connectors/tree/main/source-postg...


I found that referencing session variables from triggers lets me add additional information (eg. a comment on why the change is being made) to the history. I've only done it in a small personal project, but it's worked well there so far.


Thanks for that example. I got Code Interpreter to port it to SQLite and demonstrate it working here: https://chat.openai.com/share/b5113cb1-10df-4a38-adde-5ec0e7...

I have my own SQLite implementation of a similar pattern (but using columns rather than JSON) which I describe here: https://simonwillison.net/2023/Apr/15/sqlite-history/


I agree and this is a good approach - and how we power the activity feed in our app. But it doesn’t solve the issue of “pushing the changes” out. Of course, you can always listen to the audit table WAL changes - best of both worlds.


If you have a GDPR request to delete everything for a user, do you go through the audit table and delete everything related to that user?


Unfortunately yes.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: