A write-ahead log is not a universal part of durability

pjdesno · on July 1, 2024

Unfortunately this post skips over the "atomicity" part of a write-ahead log.

Assume you start with data on disk AAAAAAAA, read it into memory, and update it to BBBBBBBB, then write it back. If you crash in the middle, you might end up with BBBAAAAA, BBBBBBAA, or even some crazy interleaving. (at least for reasonable file sizes - note that the largest atomic write to many NVMe drives is 128K)

If you ditch the in-memory BTree and write a direct-to-disk one, with a lot of care (and maybe a bit of copy-on-write) you can make sure that each disk write leaves the database in a crash-consistent state, but that will cost multiple writes and fsyncs for any database modifications that split or merge BTree nodes - you have to ensure that each write leaves the database in a consistent state.

(for those of you old enough to remember ext2, it had the same problem. If you mounted it async and had a bad crash, the data on disk would be inconsistent - you'd lose data, so you'd vow to always mount your filesystem with synchronous writes so you'd never lose data again, then a few weeks later you'd get tired of the crappy performance and go back to async writes, until the next crash happened, etc. etc.)

The advantage of a log is that it allows you to combine multiple writes to different parts of the database file into a single record, guaranteeing (after crash recovery if necessary) that either all changes happen or none of them do. It serves the same purpose as a mutex in multi-threaded code - if your invariants hold when you get the mutex, and you reestablish them before you drop it, everything will be fine. We'd all love to have a mutex that keeps the system from crashing, but failing that we can use a WAL record to ensure that we move atomically from one valid state to another, without worrying about the order of intermediate changes to the data structure.

neal · on July 1, 2024

Good point!

I'm not sure if there are any databases that do your 'with a lot of care' option, but for anyone curious about what that might look like in practice there are file systems that forgo write-ahead logging and maintain metadata consistency using techniques like soft updates[0] or copy-on-write up to a root[1].

[0]: https://www.usenix.org/conference/1999-usenix-annual-technic... [1]: https://www.cs.hmc.edu/~rhodes/courses/cs134/fa20/readings/T... (yes, ZFS can be configured to use a WAL too for durability)

refset · on July 1, 2024

> I'm not sure if there are any databases that do your 'with a lot of care' option

LMDB springs to mind.

> LMDB was designed to resist data loss in the face of system and application crashes. Its copy-on-write approach never overwrites currently-in-use data. Avoiding overwrites means the structure on disk/storage is always valid, so application or system crashes can never leave the database in a corrupted state.

https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Databa...

philkrylov · on July 2, 2024

> yes, ZFS can be configured to use a WAL too for durability

ZFS always uses ZIL for sync writes. You can optionally set `sync=disabled` for a dataset to switch it off. I would not describe it as "can be configured to use WAL".

makz · on July 2, 2024

I imagine “a lot of care” like this:

Ok, let’s write this veeery slowly, good, let’s check now it got written as intended. Right, let’s check again just in case, but this time slowly and paying a lot of attention… ok looks, good, moving on

bjornsing · on July 1, 2024

For some data structures you get the atomicity for “free” though. For example, a log structured merge tree where writes are done in large transactions could well do without a WAL and still achieve atomicity.

valyala · on July 1, 2024

A good example of database, which relies on LSM tree properties for protecting against data corruption on unclean shutdown is VictoriaMetrics [1].

[1] https://valyala.medium.com/wal-usage-looks-broken-in-modern-...

amluto · on July 2, 2024

One might argue that an LSM tree has a WAL — LSM’s log contains logical records instead of physical changes, but it seems like the essentials are there.

bjornsing · on July 2, 2024

Yes, I guess that’s another way of phrasing it. But LSM tree implementations often have a separate WAL as well, to ensure durability for small transactions.

nyrikki · on July 1, 2024

Ever have an LSMT quit writing a run half way through? Ever need to try and restore Cassandra where mutation storage order isn't guaranteed, with the last mutation always winning?

It is absolutely not "atomic" if you have ever had to do a recovery from SSTable snapshots.

bjornsing · on July 1, 2024

> Ever have an LSMT quit writing a run half way through?

That should be very easy to “solve” by just writing the length of the SSTable in a header.

> Ever need to try and restore Cassandra where mutation storage order isn't guaranteed, with the last mutation always winning?

This is more of a distributed systems challenge as I understand it: Cassandra is “multi-master” so can end up in a situation where transaction order is hard to sort out.

eatonphil · on July 1, 2024

One of the interesting things I came up against while writing this post was the pretty common misconception that SQLite and Postgres will validate your data with checksums [0]. SQLite leaves this to an optional extension (this is a little more commonly known). Here was an HN comment I stumbled on about Postgres [1]:

> Postgres (and other databases that actually care about data integrity) have per page checksums:

Postgres does have support for data checksumming (but not metadata checksumming) but defaults to disabling data checksumming. MongoDB (WiredTiger) on the other hand defaults to checksumming. I was told after publishing this post that MySQL (InnoDB) also does checksum by default but I did not check that in my survey.

Links for proof are in the post.

[0] https://x.com/eatonphil/status/1807572135340134687

[1] https://news.ycombinator.com/item?id=25231308

wongarsu · on July 1, 2024

That's interesting. [1] seems to be the relevant page in the postgres documentation.

tl;dl: run `SHOW data_checksums` to see if checksums are enabled (they were enabled by default on our GCP Cloud SQL instance), to enable them shut down postgres and run `pg_checksums --enable --progress`

1: https://www.postgresql.org/docs/current/checksums.html

slaymaker1907 · on July 1, 2024

I’ve come around to the idea that checksumming isn’t some universal panacea. Once you have checksums, you need to start thinking about error *recovery* and not just detection. What happens to linked data? Do you just declare the whole SQLite file corrupted and just throw up your hands?

If you have backups, that’s one obvious solution, but people might not have uncorrupted backups.

I still think you should do it and emit a warning about it at the very least, but it’s not trivial to handle.

throw0101b · on July 1, 2024

Just use ZFS. :)

xyzzy123 · on July 2, 2024

Running a database on zfs can work well but introduces a whole new world of whitepapers to read and tuning options to consider :/

ErikBjare · on July 3, 2024

You don't have to read any whitepapers. Just copy the `zfs create` command from some guide, it'll work fine.

nubinetwork · on July 2, 2024

Leave zfs at the defaults and in the case of mysql/mariadb, turn off the "double writes" setting. It's not hard.

dboreham · on July 1, 2024

Database engines that live as libraries inside application processes (like WT) have a more pressing need for checksums because the most common attack vector is buggy pointer arithmetic outside the library that stomps on data in memory randomly.

whizzter · on July 1, 2024

Checksumming is also useful when storage goes bad or other random hardware corruption like cosmic rays or overheating/old computers.

pixl97 · on July 1, 2024

This reminds me of the time I had a disk go 'random' in a RAID. Disk didn't fail, instead it just returned junk for every Nth read. Happened to be in a big database server too. Crazy part is it took far longer than I would have expected to catch data showing up corrupted in user applications so an entire 28 hours of data had to be rolled back.

fdr · on July 1, 2024

yeah, that one is a path dependency issue, re: postgres. I'm not sure if it's well reasoned past inertia after the initial "it's new, let's not throw the entire user base into it at once", but at the very least, it will complicate pg_upgrade somewhat. WiredTiger didn't really have that path dependency, being itself new to Mongo to shore up the storage situation.

It's probably about time to swap the default, but some people's once-working pg_upgrade programs that they haven't looked at in a while might break. Probably okay; those things need to happen...once in a while. I suppose some people that resent the overhead of Postgres checksumming atop their ZFS/btrfs/dm-integrity/whatever stacks, but they are somewhat rarer.

funny_falcon · on July 3, 2024

Most weird thing: PostgreSQL's checksum algorithm is almost free on modern processors. It uses SIMD instructions extremely well in optimized builds. I've never seen him in CPU profile despite I always enable it.

There are no any reasons to not enable it (except pg_upgrade).

zellyn · on July 1, 2024

I found the Developer Voices discussion[1][2] with Joran Dirk Greef, creator of TigerBeetle, to be fascinating. They mentioned that the rigorous correctness proofs that exist for Raft and Paxos assume an absence of disk faults, but that more modern theory includes fixes for that, as long as you marry the log writing and consensus algorithms together properly rather than keeping each as a separate black box.

[1] https://podcasts.apple.com/us/podcast/databases-ambitions-an...

[2] https://www.youtube.com/watch?v=ayG7ltGRRHs

refset · on July 1, 2024

Specifically, that modern theory is "protocol-aware recovery" - e.g. https://www.usenix.org/conference/fast18/presentation/alagap...

samatman · on July 1, 2024

I second this recommendation. I'm impressed by what TigerBeetle has accomplished, and would love to see more of the industry approach software engineering with their philosophy and rigour.

The TigerStyle video is well worth watching for anyone who wants to build defect-free software systems, or recognizes that this is the correct thing to aim for. https://www.youtube.com/watch?v=w3WYdYyjek4

AdamProut · on July 1, 2024

Maybe good to mention torn pages somewhere too? Both MySQL and Postgres jump through some hoops to both detect them and repair them [1][2]. So, even the scenario in the post where fsync is used to harden writes, the database still needs to handle torn pages (or requires using a file system \ storage that guarantees atomic page writes at the page size the database is using as several managed\cloud databases do).

[1] https://wiki.postgresql.org/wiki/Full_page_writes [2] https://dev.mysql.com/doc/refman/8.0/en/innodb-doublewrite-b...

eatonphil · on July 1, 2024

Thanks Adam! I think torn writes would still be caught via checksums, no? Although that may be later than you'd wish.

I'm not confident but from reading that page it seems that for Postgres at least, if it did do checksums it might not need to count on page-level atomic writes?

AdamProut · on July 1, 2024

Checksums can detect a torn page, but not always repair them. It's likely a good part of the database page is gone (i.e., an amount of data that matches the disk / file system atomic write unit size is probably missing). Torn page writes are a pretty common scenario too, so databases need to be able to fully recover from them - not just detect them and report a corruption (ie., just pull the power plug from the machine during a heavy write workload and you're likely to get one - it doesn't require a solar ray to flip a bit :) ).

eatonphil · on July 1, 2024

That's fair. In the post I did mention disk redundancy (and I guess I only implied recovery) as one additional level for safety. Which I think is what you're getting at too.

mattashii · on July 1, 2024

Disk redundancy won't help guarantee torn page protection if the writes across the redundant disks are not coordinated to have one start after the other finishes such that there is always have one copy of the page that is not currently being written. So writing to a RAID1 array won't help here without knowledge about how that raid1's writes work.

londons_explore · on July 1, 2024

Worth noting that there are various tricks to combine the write ahead log and the rest of the on-disk data, meaning the vast majority of data will not be written more than once, and you'll still get the durability and data locality benefits of a WAL.

7e · on July 1, 2024

Could you elaborate? Do you mean an LSM tree or log structured filesystem?

simonz05 · on July 1, 2024

Lsm trees also need a log, right?

valyala · on July 1, 2024

LSM trees do not need write-ahead log in general case:

- When new data arrives, it is converted to SSTable, which is then stored to disk in an atomic manner before returning 'success' to the client, who writes the data. If computer crashes in the middle of write, no problems - the partially written SSTable will be dropped on database start, since it isn't registered in the database yet.

- When computer crashes in the middle of background merge of smaller SSTables into bigger ones, then no problem - the source SSTables are still available after database restart, while partially written output SSTable can be safely dropped on database restart.

VictoriaMetrics and VictoriaLogs use LSM trees without WAL, while providing good durability levels. They can lose recently ingested metrics or logs on server crash, if they weren't converted to SSTables and weren't written to disk yet. But this is very good tradeoff comparing to data corruption or failed WAL replay in other systems, which use WAL in improper ways - https://valyala.medium.com/wal-usage-looks-broken-in-modern-... .

scottlamb · on July 1, 2024

I don't agree with your https://valyala.medium.com/wal-usage-looks-broken-in-modern-... article in at least a couple ways.

One specific reason:

> TimescaleDB relies on PostgreSQL’s WAL mechanism, which puts data into WAL buffers in RAM and periodically flushes them to WAL file. This means that the the data from unflushed WAL buffers is lost on power loss or on process crash.

That links to the manpage which says "The contents of the WAL buffers are written out to disk at every transaction commit". Maybe there's a missing "TransactionDB only commits periodically" that makes the quote above true, but any suggestion that PostgreSQL does not guarantee durability of committed transactions out of the box is incorrect.

A broader reason is: it talks about how WALs may be "lost / corrupted" before fsync. Then how the "write directly to SSTable" approach just loses recently added data, and "IMHO, recently written data loss on process crash has lower severity comparing to data corruption". But in general, I'd expect these databases to have a mechanism by which they don't apply a corrupted WAL (typically involving a strong checksum on WAL entries). So ultimately these are two ways of describing the same thing. If those databases really do apply corrupt/half-written/unflushed WAL entries and thus corrupt the previously committed data, yes, that's very interesting, but the smoking gun is missing. The article is either wrong or incomplete.

dbcpp · on July 1, 2024

LSM-trees do need a WAL. The entire idea of LSM-trees is that writes are buffered in memory and written out all at once. But a particular write doesn't wait for the memtable to be flushed. For that reason you still need a WAL (there is committed state in memory).

See https://research.facebook.com/publications/optimizing-space-... and https://www.cs.umb.edu/~poneil/lsmtree.pdf.

7e · on July 2, 2024

Those implementations use a WAL, but it seems to be only as a performance optimization to decrease the size of the in-memory index; is there a theoretical reason one is needed? It looks equivalent to a WAL-less write path combined with an almost immediate compaction. If you remove the compaction and don’t delete the WAL it seems like you can eliminate that write amplification (at least temporarily).

dbcpp · on July 2, 2024

The original purpose of an LSM-tree is to take I/O off the critical path of a write (there are other reasons to use them though, for example reducing space amplification).

I would argue that by definition an LSM-tree buffers committed writes in memory, and that means you need a WAL for recovery.

If you are going to immediately flush the memtable then IO is on the critical path. And if you have fine grained updates you'll end up with lots of small files, which seems like a bad thing. It could be reasonable if you only receive batch updates.

7e · on July 2, 2024

Any durable commit is going to have I/O in the critical path unless you're Paxos/Raft replicating in-memory across failure domains (which we're not discussing here), but I think you mean it takes random I/O out of the critical path. You can get that without a WAL, though; just have the LSM keep appending out of order to a growing file and keep the in-memory index. That's the exact same I/O pattern that the WAL would generate, there just isn't an immediate compaction. The in-memory index will be stay fragmented for longer, though (which is why I call the WAL a performance optimization above). I suppose the WAL-less design lets you defer compaction for longer, which might be an advantage if you have lots of disk and lots of RAM, but don't want two-thirds of your throughput (read + write) taken away at critical moments.

dbcpp · on July 2, 2024

I actually think you're describing a WAL.

hodgesrm · on July 2, 2024

> I would argue that by definition an LSM-tree buffers committed writes in memory, and that means you need a WAL for recovery.

This is true, but note that the WAL does not need to be in the database. You can use an event stream like Kafka and replay blocks of events in the event of a failure. ClickHouse has a feature to deduplicate blocks it has seen before, even if they land on a separate server in a cluster. You still need to store checksums of the previously seen blocks, which is what ClickHouse does. It does put the onus on users to regenerate blocks accurately but the overhead is far lower.

valyala · on July 2, 2024

Batched inserts to LSM-based databases aren't so uncommon. For example, ClickHouse expects batched writes - https://clickhouse.com/docs/en/optimize/bulk-inserts . Of course, it supports asynchronous inserts with in-memory buffering for the recently stored data - https://clickhouse.com/docs/en/cloud/bestpractices/asynchron...

dbcpp · on July 2, 2024

Yes, that's true. And not even ClickHouse considers their MergeTree engine to be an LSM-tree (see https://github.com/ClickHouse/ClickHouse/blob/bfc0260f359a0c...).

That doesn't make it wrong, or a bad architecture. It still takes ideas from LSM-trees, and it has similarities. But it can't be called an LSM-tree.

beltsazar · on July 1, 2024

But WAL is not for durability, but for atomicity and consistency. And yes, you need to use fsync to ensure durability.

anarazel · on July 1, 2024

Nitpick: If you're careful fdatasync() can suffice.

refset · on July 1, 2024

> durability is a spectrum

This is the truth. My favourite example of this is MemoryDB's "multi-AZ (multi-datacenter) durability" for Valkey - there's a good write-up here https://brooker.co.za/blog/2024/04/25/memorydb.html

sgeisenh · on July 1, 2024

In the pseudocode snippet that introduces a WAL, the semaphores are signaled before the fsync occurs. This seems like a mistake but maybe there’s a good reason to do this that I am missing?

Edit: it looks like this is the case in the first batched fsync example, as well.

eatonphil · on July 1, 2024

Good catch! That was a think-o. I've swapped it around so that semaphores are signalled after the fsync.

sgeisenh · on July 1, 2024

Thanks for the quick response, and for the nice write up. I thoroughly enjoy the python-like pseudocode, it was the main reason I was able to pick that out reasonably quickly!

jmix · on July 1, 2024

This is obvious if you took a DB course, and if you didn't, you have no business building a DB. Sadly, all the NoSQL junkware was built by people who didn't.