Using triggers + history tables (aka audit tables) is the right answer 98% of th...

smilliken · on Sept 22, 2023

> each of your webheads has its own clock. Use the database clock!

Yes, for consistency you should use the database clock by embedding the calls to `now()` or similar in the query instead of generating it on the client.

But it's not sufficient to use these timestamps for synchronization. The problem is that these timestamps are generated at the start of the transaction, not the end of the transaction when it commits. So if you poll a table and filter for recent timestamps, you'll miss some from transactions that are committing out of order. You can add a fudge factor like querying back an extra few minutes and removing the duplicates, but some transactions will take longer than a few minutes. There's no upper bound to how long a transaction can take in postgresql, and there's a lot of waste in querying too far back. This approach doesn't work if you care about correctness or efficiency.

sbuttgereit · on Sept 22, 2023

There is a way in PostgreSQL to get actual wall clock time of the database server: `clock_timestamp()` regardless of transaction start time.

There's also `statement_timestamp()` per the docs: "returns the start time of the current statement (more specifically, the time of receipt of the latest command message from the client)."

https://www.postgresql.org/docs/current/functions-datetime.h...

This isn't to say any of these methods are the best (or even good) in all cases. Time is tricky, especially if you're trying to do any sort of sequencing.

jgraettinger1 · on Sept 22, 2023

Estuary (https://estuary.dev ; I'm CTO) gives you a real time data lake'd change log of all the changes happening in your database in your cloud storage -- complete with log sequence number, database time, and even before/after states if you use REPLICA IDENTITY FULL -- with no extra setup in your production DB.

By default, if you then go on to materialize your collections somewhere else (like Snowflake), you get synchronized tables that follow your source DB as they update.

But! You can also transform or materialize the complete history of your tables for auditing purposes from that same underlying data-lake, without going back to your source DB for another capture / WAL reader.

jvz · on Sept 23, 2023

I'm evaluating Flow for CDC. Do you support logical decoding messages from `pg_logical_emit_message`? This would allow us to add audit metadata[^1].

[^1]: https://www.infoq.com/articles/wonders-of-postgres-logical-d...

jgraettinger1 · on Sept 23, 2023

No. But this is neat, and at a glance it looks straight forward to add. Happy to discuss further!

slotrans · on Sept 23, 2023

That sure sounds cool but I can't tell from your website that it does any of that. Even giving up on the marketing copy and going straight to the docs... I can't follow them.

jgraettinger1 · on Sept 23, 2023

Hi fellow Dark Tower friend.

Yes, marketing and docs are not our strongest suits and we need to do better. To be fair, though, we're also trying not to scare off less technical users who see a bullet list like above and think "well this is clearly not for me". It's a hard balance

DenisM · on Sept 22, 2023

Are you using debezium to capture changes?

jgraettinger1 · on Sept 23, 2023

No. We implemented our own [1] for a few reasons:

* Scaling well to multi-TB DBs without pinning the write-ahead log (potentially filling your DB's disk) while the backfill is happening. Instead, our connector constantly reads the WAL and works well in setups like Supabase that have very restrictive WAL sizes (1GB iirc).

* Incremental fault-tolerant backfills that can be stopped and resumed at will.

* Flowing TOAST columns through seamlessly to your materialized destination, without requiring that you resort to REPLICA IDENTITY FULL.

* Being able to offer "precise" captures which are logically consistent in terms of the sequence of create/update/delete events.

The last one becomes really interesting when paired with REPLICA IDENTITY FULL because you can feed the resulting before/after states into an incremental computation (perhaps differential dataflow) for streaming updates of a query.

Our work is based off of the Netflix DBLog paper, which we took and ran with.

[1] https://github.com/estuary/connectors/tree/main/source-postg...

dllthomas · on Sept 22, 2023

I found that referencing session variables from triggers lets me add additional information (eg. a comment on why the change is being made) to the history. I've only done it in a small personal project, but it's worked well there so far.

simonw · on Sept 23, 2023

Thanks for that example. I got Code Interpreter to port it to SQLite and demonstrate it working here: https://chat.openai.com/share/b5113cb1-10df-4a38-adde-5ec0e7...

I have my own SQLite implementation of a similar pattern (but using columns rather than JSON) which I describe here: https://simonwillison.net/2023/Apr/15/sqlite-history/

cpursley · on Sept 22, 2023

I agree and this is a good approach - and how we power the activity feed in our app. But it doesn’t solve the issue of “pushing the changes” out. Of course, you can always listen to the audit table WAL changes - best of both worlds.

krashidov · on Sept 22, 2023

If you have a GDPR request to delete everything for a user, do you go through the audit table and delete everything related to that user?

slotrans · on Sept 23, 2023

Unfortunately yes.