Disk write buffering and its interactions with write flushes

M95D · on March 21, 2024

I noticed this problem when I switched to Linux.

On Windows, when I copied a file, disk writes started immediately. On older systems, like Win98, I had to tweak Total Commander's disk buffer to improve copy speed on the same drive. Total Commander even had separate settings for same disk vs. different disk copy buffer sizes.

When I switched to Linux I was immediately surprised that disk writes did not start until the memory was full, and then it would stop reading while flushing dirty data. This happens even if the copy is between different drives: reads stop, writes only, then reads again with no writes to the other disk, repeat. It basically halves the copy speed.

It even happens when I copy to network mounts: reads 20 GB of data into memory, then reading stops and tries to flush the data over the nfs. Nfs times out, transfer fails. I had to use nfs timouts of 1h just to be able to do a backup.

It drives me crazy. Is there any way to make it write immediately, or at least to put a memory limit on dirty data?

nolist_policy · on March 21, 2024

https://docs.kernel.org/admin-guide/sysctl/vm.html#dirty-byt...

https://docs.kernel.org/admin-guide/sysctl/vm.html#dirty-bac...

The values are crazy high by default (on modern hardware anyway): 10% of memory for dirty_background_bytes and 20% for dirty_bytes. I wonder why no distro touches these.

qwerty456127 · on March 22, 2024

To me this seems great and even humble. As RAM is dirt-cheap this can significantly improve performance (especially when external or a remotes drives are involved and not only NVMe SSDs), also prolong SSDs life (which means saving not just money but also the hassle of replacing them, money as well though - when it's a Mac and you can't just replace the SSD).

I wish I could configure Windows the same way: whenever it can use RAM to avoid an extra disk write/read - it should.

M95D · on March 22, 2024

How does it improve performance when it won't read an write at the same time when copying from one disk to another?

qwerty456127 · on March 23, 2024

In the case you describe it may indeed decrease the performance (or may not, I'm not a disk I/O or caching expert and know the things can be weird) but still may increase it in some scenarios. Copying files from one physical disk drive to another is not the only kind of operation in which RAM cache of disk I/O is involved.

suprjami · on March 21, 2024

Because people complain their system is "slow" if it blocks on disk I/O.

Another set of people also complain Linux takes too long to safely unplug USB drives.

qwerty456127 · on March 22, 2024

> Because people complain their system is "slow" if it blocks on disk I/O.

Yeah, I/O blocks drive me mad every time. They are more noticeable on Windows though. Perhaps that's because Windows doesn't RAM-cache enough and I do a lot of USB I/O (USB NIC, USB drives).

> Another set of people also complain Linux takes too long to safely unplug USB drives.

If only it had an APIs to see how much RAM of a specific device is RAM-cached rigth now and visualize the progress of flushing that cache... Unvisualized long I/O (incl. caching) operations, let alone those freezing the UI, indeed feel bad and are a UX bug.

One of the key reasons I prefer Linux over Windows is Linux is much more rare to freeze, no matter the workload.

qwerty456127 · on March 23, 2024

> If only it had an APIs to see how much RAM of a specific device is RAM-cached rigth now

This is a typo, should have been "If only it had an APIs to see how much of a specific device is RAM-cached right now".

suprjami · on March 30, 2024

I've also thought it would be nice if Linux's dirty page handling was more granular. But at the same time, whenever dirty pages are a concern it's usually one large file or one NFS mount or one USB device. The system otherwise doesn't have a great deal of dirty pages to bother reporting on. Also programs have access to selective file flushing with fsync() so there is at least that.

Rygian · on March 21, 2024

When the choice is between "slow" file transfers and being unable to do file transfers because of NFS timeouts, the choice should be obvious.

I've lost data as a side effect of a simple file transfer timing out.

suprjami · on March 21, 2024

Just the small amount of people using NFS suggests this tunable should remain the default. Nothing is stopping sysadmins tuning for their environment.

There's no one-size-fits-all answer, which is why it's a tunable.

tremon · on March 21, 2024

This isn't just about NFS timeouts. Try playing a movie from a rotational disk while simultaneously doing high-volume writes. You will get frequent pauses in your video because the write buffer size is so large that a single writeback will cause the video buffer to drain empty.

On my desktop with 32GB ram, I can even get audio to skip when ripping DVD's to disk. That's because practically the entire movie fits into ram before Linux decides to start the writeback process, and that writeback process will hog the disk for almost a minute. Or it used to, until I reduced the buffer size by a full order of magnitude.

This is just another sad example of buffer bloat: the inability to tune data buffers to the capacity of the underlying stream.

koverstreet · on March 22, 2024

It's also because doing real IO scheduling has gotten a lot harder with fast NVME devices and blk-mq.

It's a real problem.

M95D · on March 22, 2024

That's another thing I can't understand: Why does NFS timeout when the data transfer is still on? Shouldn't it timeout only when the server is no longer ACK-ing packets?

Tyrannosaur · on March 22, 2024

> Shouldn't it timeout only when the server is no longer ACK-ing packets?

That's exactly what happens. The server ACKs data until it fills its write buffer, and then stalls unresponsive until the entire buffer is flushed to disk. If it takes longer to flush the buffer to disk than the client's timeout, it gives up.

I have personally watched this happen via wireshark where the server doesn't ACK for more than 10 minutes.

M95D · on March 25, 2024

That's not it. I only had this problem on a fast-ethernet connection (because I had to share the cable for two connections). The server could write ~ 50 MB/s, but it still timed out on the 10MB/s upload.

Tyrannosaur · on March 25, 2024

It's possible you were seeing another problem, but this issue is more likely to appear with a faster network connection, because the network transfer happens faster than the disk writes.

You can confirm by watching /proc/meminfo and watching the Dirty and Writeback numbers.

Changing up the vm.dirty* settings can help as described here:

https://lonesysadmin.net/2013/12/22/better-linux-disk-cachin...

CableNinja · on March 22, 2024

Holy shit i think you and the above comment, along with this thread, may have finally given me the answer to one of the few problems i was never able to solve.

About 4-5 years ago, i was working on a project, and part of that was copying big amounts of data to a system via nfs. At 30 minutes exactly, nfs would croak, transfer fails.

I think this buffer fill and empty flow was fucking killing it. Its a shame i dont work there anymore, id definitely wanna try tweaking these settings and see if i could solve it

Tyrannosaur · on March 25, 2024

Yeah that does sound like the symptoms of the problem I discovered. If you ever witness it again, the trick is watching /proc/meminfo for the Dirty and Writeback numbers.

And it's the vm.dirty* settings to change to fix it as described here: https://lonesysadmin.net/2013/12/22/better-linux-disk-cachin...

bcook · on March 22, 2024

Have you played with "ionice"? Did it help?

bobmcnamara · on March 22, 2024

Is there a max yet or per disk limits? IIRC those would start writeback. I always thought wiping a slow USB disk shouldn't consume all available RAM. But it used to. Maybe it still does.

pengaru · on March 22, 2024

One of the things I improved in my recordMyDesktop fork [0] was an awful tendency for the frame cache writer to accumulate heaps of dirty pages until background writeback would flush them out.

I had 16GiB of RAM which meant quite large swaths of dirty pages would become buffered while the SSD sat idle until writeback began. This would cause high-FPS full-screen recordings in particular to become backlogged and start dropping frames / audio dropouts. Just generally broken behavior for a desktop recorder, especially for a deferred-encode mode that's supposed to be optimized for minimizing system-wide effects/overheads during the recording.

The simple solution I found was to proactively initiate writeback regularly via fdatasync() on the cache fd. [1] I haven't decided yet if more should be done to constrain its buffer cache effects though. The cache files will be read back during encoding in post, so if there's enough RAM it can be desirable to enable reading them back entirely from memory instead of having to hit the disk again... but it would also be nice to let the rest of the system's processes keep their stuff in the page cache. memcg can probably be used to find a balanced solution, but I haven't done any experiments yet. Have any of you handled similar scenarios? What did you do?

[0] https://github.com/recordmydesktop/recordmydesktop

[1] https://github.com/recordmydesktop/recordmydesktop/commit/42...

toast0 · on March 21, 2024

> Rather than allowing multiple gigabytes of outstanding buffered writes and deferring writeback until a gigabyte or more has accumulated, you'd set things to trigger writebacks almost immediately and then force processes doing write IO to wait for disk writes to complete once you have more than a relatively small volume of outstanding writes.

I think having the trigger be size based rather than timebased is the real problem. Or bounds on both...

I probably don't want to buffer writes for more than X seconds, or let the buffer grow beyond Y% of ram. At least for the time based limit, you'd really want to be able to say start writing to disk when the buffer has data over 10 seconds old, but still accept writes into a new buffer, only blocking writes when there's a buffer being written out and the current buffer is too old or too big.

vlovich123 · on March 21, 2024

You're mostly right in that it is similar to the problem of buffer bloat in networking. However, it's not quite the same thing because for example you can do things like write a file to disk & unlink it or overwrite some portion of contents, meaning that by buffering for longer you can avoid the writeback in the first place. By buffering for longer, the kernel is trying to balance things landing on disk and avoiding touching the disk if the dirtied data will be dirtied again. Granted not a common use-case these days, but consider the case where you have object files being created during a build over & over again. It's easy to construct scenarios where you indeed wouldn't want to write to disk so quickly.

It's not out of hand a terrible idea to avoid flushing data to disk and there's no free lunch here as any workload you optimize for will have a different workload that suffers. People try to come up with general heuristics that work in most situations on consumer machines, but there's no one size fits all for all HW + use-case combos. That's why hyperscalars tune the kernel beyond that / have kernel developers writing code to optimize for their use-case. It's telling that the performance analysis in the article is pretty hand-wavy without any clear demonstration of a concrete problem.

As for sync, I believe the author is mistaken. You can do fsync instead which is more efficient as it only creates a barrier for writeback of the file descriptor rather than a system-wide sync. And invoking fsync I believe is more common than sync. You should be able to have multiple parallel fsync happening concurrently for unrelated files that don't block on each other so much (ideally the kernel would prioritize those writebacks and interleave for fairness, but I doubt it does).

anamax · on March 22, 2024

Not doing a write when the disk is idle doesn't save anything even if that write turns out to be unnecessary.

riobard · on March 22, 2024

Not doing anything saves power and NAND cycles.

Disk also might be in standby mode, which saves motor spin up and head load cycles too.

anamax · on March 22, 2024

The power is negligible compared to the compute. (The disk isn't in stand-by mode If you're doing something that involves transient files.) Moreover, doing the writes now may let you turn the system off. (As long as we're making up situations.)

The goal is optimizing performance as seen by users. Long waits for disk flushes because the system wasted write opportunities is not good performance by any measure.

riobard · on March 23, 2024

Nay I'm just refuting the idea that performance is the utmost, singular goal, as is pointed out by vlovich123 4-levels up in the thread https://news.ycombinator.com/item?id=39785690

Modern storage hierarchy has become so convoluted precisely because of many conflicting goals: performance, efficiency, durability, cost, etc.

loeg · on March 21, 2024

Ideally the system starts flushing buffered writes basically immediately, but with low device-level queue depth so that subsequently issued higher priority IOs do not suffer from a ton of additional latency.

kvemkon · on March 21, 2024

Ideally is only when the size of the complete file is known in advance. To minimize avoidable fragmentation.

loeg · on March 22, 2024

Eh. At the rates we're talking about (keeping the disk 100% busy for seconds at a time) you're writing big enough chunks that isn't an issue.

charleshn · on March 22, 2024

It is the case, see the vm.dirty_expire_centisecs sysctl [0] at the VM layer, and the commit interval for e.g. ext4 [1] at the filesystem layer, and [2] for their interaction.

[0] https://docs.kernel.org/admin-guide/sysctl/vm.html#dirty-exp...

[1] https://www.kernel.org/doc/html/v4.19/filesystems/ext4/ext4....

[2] https://serverfault.com/a/995887

magicalhippo · on March 21, 2024

ZFS has both a time-based and size-based limit. IIRC the default is 15 seconds and lower of 2GB and some % of system memory.

Though in ZFS' case it's not really a regular write cache as such, as it's used to minimize updates to its on-disk copy-on-write structure.

Rygian · on March 21, 2024

My systems routinely report transferring files at speeds much faster than what the physical medium asked, only to have them get stuck and fail on a time out a while after.

Having such behavior be the default is, to my limited understanding, a bug in the Linux kernel.

JackSlateur · on March 22, 2024

You can see this the other way

When I write lots of data to a rusty device and don't sync, the kernel gives back more quickly. From there, I can move on with my work : everything is "as-if" the data were actually copied (except power-cutoff protection, indeed)

As such, I see this behavior as a feature, it allows me to wait less and do my work quicker.

Unless windows, on which I must wait for the whole data to be written on the disk.

Rygian · on March 23, 2024

> As such, I see this behavior as a feature, it allows me to wait less and do my work quicker.

So, when (if?) you notice that the transfer has failed with a timeout, do you still consider your work was done quicker?

JackSlateur · on March 24, 2024

There is no timeout possible for IO cache

If you are thinking of something like a HTTP download, the transfer has to be done anyway (and I must wait for it). However, I do not need that data to be fully written to the (potentially busy) local device

Rygian · on March 25, 2024

My initial comment was about a simple 'cp' copy of a large file from local storage to an NFS mount.

It times out, 'cp' exits in error, and my file is not transferred. I can reproduce this at will. I think I should report it as a linux kernel bug, to be honest.

adr1an · on March 21, 2024

There's "$ sync" to ask the Kernel to start the actual write. And "dd" command has the option to do so, too. It's just not the default, as many things on Linux, unfortunately.

hsaliak · on March 22, 2024

The theory is compelling but this is the exact sort of problem that needs a representative benchmark. Based on the benchmark, the optimal strategy can then be picked.

I don’t see any stats/graphs or benchmarks in the linked article.

cafxx · on March 22, 2024

I think that the idea of not initiating writeback immediately derives mostly from the days of spinning rust, where read latencies would be noticeably impacted if you initiated writeback too aggressively: reads, contrary to writes, are synchronous by default, and spinning rust rarely allowed high (by modern standards) IOPS, so it made a lot of sense to buffer writes as much as possible to minimize the number of I/O operations spent on writes, as this would leave as many of those IOPS as possible available for reads.

This is probably much less of a concern today, as NVMe drives - beside having many orders of magnitude higher IOPS capacity - also have (at least on paper) much better hardware support for high I/O concurrency. It may still make sense, even today, if your hardware (or stack) limits IOPS.

vlovich123 · on March 22, 2024

As I mention elsewhere in the thread, reading the kernel documentation for the various flags suggests that the kernel devs are also concerned about multiple writes to the same piece of data & thus buffering lets you potential elide unnecessary disk I/O.

randkyp · on March 22, 2024

I found this recent thread interesting, specifically about really considering whether you're going to read the data you just wrote in the near future or not (in which case, use direct IO) and a set of (abandoned?) patches for write-behind caching for sequential writes in Linux (https://lore.kernel.org/lkml/156896493723.4334.1334048120714...).

https://lore.kernel.org/linux-mm/45odvhgymm7fxsgwpewoiiggaok...

rwmj · on March 22, 2024

Direct IO is very inflexible. A better way is this recommended by Linus in an old post:

https://lkml.iu.edu/hypermail/linux/kernel/1005.2/01845.html

I implemented something along the same lines but a bit less spicy here:

https://gitlab.com/nbdkit/nbdkit/-/commit/aa5a2183a6d16afd91...

rwmj · on March 22, 2024

That second link is wrong, should be: https://gitlab.com/nbdkit/nbdkit/-/commit/a956e2e75d6c88eeef...

mgerdts · on March 21, 2024

> Rather than allowing multiple gigabytes of outstanding buffered writes and deferring writeback until a gigabyte or more has accumulated, you'd set things to trigger writebacks almost immediately and then force processes doing write IO to wait for disk writes to complete once you have more than a relatively small volume of outstanding writes.

This is especially true when the thing doing a bunch of buffered writes is in a VM. If the VMM is buffering writes to the host fs, you get the described effects in the host OS and the guest OS.

Borg3 · on March 23, 2024

And thats what direct I/O is for. Leave that to the guest OS.

mgerdts · on March 27, 2024

Edit: maybe you were suggesting the host does direct IO. That is exactly what I recommend. I initially read this differently.

Original: If you are running the host OS you often have little or no say about what happens in the guest OS. Even if you control both it is likely not trivial to get the apps to use direct IO.

It could even be harmful to use direct IO because that would mean that writes would not stay in the guest buffer cache, forcing what would have been a cached read or minor fault in the guest into what it sees as a physical read.

The written blocks are not going to be shared, except maybe due to KSM. But KSM would do the same if that data was in the guest’s buffer cache if huge pages are not used.

Borg3 · on March 28, 2024

Of course I was talking about host doing direct I/O :) Glad you got it right.

pradn · on March 21, 2024

fsync() guarantees that writes have hit the disk. But is there a guarantee about what's written before an fsync()? Can it be anywhere between "nothing" and "everything"? I suppose this must be a loose guarantee if the "write-back" parameter can be tweaked at will.

vlovich123 · on March 21, 2024

If you're generating an immutable file, a common technique is to play fsync + rename tricks although I think a more modern technique would be:

    1. open directory (dir_fd)
    2. create an unnamed O_TMPFILE file (file_fd)
    3. write to file_fd
    4. fdatasync(file_fd)
    5. linkat(dir_fd, "", dir_fd, "file name", AT_EMPTY_PATH)
    5. fsync(dir_fd)

This should guarantee that either "file name" will have the old contents or the new contents and no transient version is observed. The "old" mechanism is similar in that you write out to a temporary sibling and rename (these days you'd use RENAME_EXCHANGE w/ renameat2 to guarantee the atomicity or get an error) with the differentiating difference being that the temporary data could be observed on the filesystem / left around if you have a machine reboot.

pradn · on March 22, 2024

Thank you - it's baffling to me that something as basic as saving some data to a file has become so convoluted!

vlovich123 · on March 22, 2024

It was always this convoluted because the POSIX APIs suck and the Linux kernel still refuses to provide an API to write large amounts of data transactionally so you have to know the magic incantation for doing it. And note this is when you have all the data up-front. Doing an append is significantly expensive or requires specific filesystem support for reflinks.

vlovich123 · on March 22, 2024

Correction to typo:

    5. linkat(file_fd, "", dir_fd, "file name", AT_EMPTY_PATH)

fao_ · on March 22, 2024

SSDs and HDD on-board firmware will actually cache-locally and then lie about if it's been written to the disk or not. Pretty much every level of the stack caches and lies about it to the point that if you do a deep dive into "has information been written to disk or not" the answer comes up as "this is impossible to verify".

toast0 · on March 21, 2024

> Can it be anywhere between "nothing" and "everything"?

Yes, and there's not (generally) any ordering constraint, either. The last thing you wrote may be persisted, and not the first, etc.

nolist_policy · on March 21, 2024

> Can it be anywhere between "nothing" and "everything"?

Yes, and that won't change because the hardware with it's own buffers behaves the same way.

loeg · on March 21, 2024

There's some user input to this via posix_fadvise POSIX_FADV_DONTNEED but it doesn't guarantee anything.