On Windows, when I copied a file, disk writes started immediately. On older systems, like Win98, I had to tweak Total Commander's disk buffer to improve copy speed on the same drive. Total Commander even had separate settings for same disk vs. different disk copy buffer sizes.
When I switched to Linux I was immediately surprised that disk writes did not start until the memory was full, and then it would stop reading while flushing dirty data. This happens even if the copy is between different drives: reads stop, writes only, then reads again with no writes to the other disk, repeat. It basically halves the copy speed.
It even happens when I copy to network mounts: reads 20 GB of data into memory, then reading stops and tries to flush the data over the nfs. Nfs times out, transfer fails. I had to use nfs timouts of 1h just to be able to do a backup.
It drives me crazy. Is there any way to make it write immediately, or at least to put a memory limit on dirty data?
The values are crazy high by default (on modern hardware anyway): 10% of memory for dirty_background_bytes and 20% for dirty_bytes. I wonder why no distro touches these.
To me this seems great and even humble. As RAM is dirt-cheap this can significantly improve performance (especially when external or a remotes drives are involved and not only NVMe SSDs), also prolong SSDs life (which means saving not just money but also the hassle of replacing them, money as well though - when it's a Mac and you can't just replace the SSD).
I wish I could configure Windows the same way: whenever it can use RAM to avoid an extra disk write/read - it should.
In the case you describe it may indeed decrease the performance (or may not, I'm not a disk I/O or caching expert and know the things can be weird) but still may increase it in some scenarios. Copying files from one physical disk drive to another is not the only kind of operation in which RAM cache of disk I/O is involved.
> Because people complain their system is "slow" if it blocks on disk I/O.
Yeah, I/O blocks drive me mad every time. They are more noticeable on Windows though. Perhaps that's because Windows doesn't RAM-cache enough and I do a lot of USB I/O (USB NIC, USB drives).
> Another set of people also complain Linux takes too long to safely unplug USB drives.
If only it had an APIs to see how much RAM of a specific device is RAM-cached rigth now and visualize the progress of flushing that cache... Unvisualized long I/O (incl. caching) operations, let alone those freezing the UI, indeed feel bad and are a UX bug.
One of the key reasons I prefer Linux over Windows is Linux is much more rare to freeze, no matter the workload.
I've also thought it would be nice if Linux's dirty page handling was more granular. But at the same time, whenever dirty pages are a concern it's usually one large file or one NFS mount or one USB device. The system otherwise doesn't have a great deal of dirty pages to bother reporting on. Also programs have access to selective file flushing with fsync() so there is at least that.
This isn't just about NFS timeouts. Try playing a movie from a rotational disk while simultaneously doing high-volume writes. You will get frequent pauses in your video because the write buffer size is so large that a single writeback will cause the video buffer to drain empty.
On my desktop with 32GB ram, I can even get audio to skip when ripping DVD's to disk. That's because practically the entire movie fits into ram before Linux decides to start the writeback process, and that writeback process will hog the disk for almost a minute. Or it used to, until I reduced the buffer size by a full order of magnitude.
This is just another sad example of buffer bloat: the inability to tune data buffers to the capacity of the underlying stream.
That's another thing I can't understand: Why does NFS timeout when the data transfer is still on? Shouldn't it timeout only when the server is no longer ACK-ing packets?
> Shouldn't it timeout only when the server is no longer ACK-ing packets?
That's exactly what happens. The server ACKs data until it fills its write buffer, and then stalls unresponsive until the entire buffer is flushed to disk. If it takes longer to flush the buffer to disk than the client's timeout, it gives up.
I have personally watched this happen via wireshark where the server doesn't ACK for more than 10 minutes.
That's not it. I only had this problem on a fast-ethernet connection (because I had to share the cable for two connections). The server could write ~ 50 MB/s, but it still timed out on the 10MB/s upload.
It's possible you were seeing another problem, but this issue is more likely to appear with a faster network connection, because the network transfer happens faster than the disk writes.
You can confirm by watching /proc/meminfo and watching the Dirty and Writeback numbers.
Changing up the vm.dirty* settings can help as described here:
Holy shit i think you and the above comment, along with this thread, may have finally given me the answer to one of the few problems i was never able to solve.
About 4-5 years ago, i was working on a project, and part of that was copying big amounts of data to a system via nfs. At 30 minutes exactly, nfs would croak, transfer fails.
I think this buffer fill and empty flow was fucking killing it. Its a shame i dont work there anymore, id definitely wanna try tweaking these settings and see if i could solve it
Yeah that does sound like the symptoms of the problem I discovered. If you ever witness it again, the trick is watching /proc/meminfo for the Dirty and Writeback numbers.
Is there a max yet or per disk limits? IIRC those would start writeback. I always thought wiping a slow USB disk shouldn't consume all available RAM. But it used to. Maybe it still does.
One of the things I improved in my recordMyDesktop fork [0] was an awful tendency for the frame cache writer to accumulate heaps of dirty pages until background writeback would flush them out.
I had 16GiB of RAM which meant quite large swaths of dirty pages would become buffered while the SSD sat idle until writeback began. This would cause high-FPS full-screen recordings in particular to become backlogged and start dropping frames / audio dropouts. Just generally broken behavior for a desktop recorder, especially for a deferred-encode mode that's supposed to be optimized for minimizing system-wide effects/overheads during the recording.
The simple solution I found was to proactively initiate writeback regularly via fdatasync() on the cache fd. [1] I haven't decided yet if more should be done to constrain its buffer cache effects though. The cache files will be read back during encoding in post, so if there's enough RAM it can be desirable to enable reading them back entirely from memory instead of having to hit the disk again... but it would also be nice to let the rest of the system's processes keep their stuff in the page cache. memcg can probably be used to find a balanced solution, but I haven't done any experiments yet. Have any of you handled similar scenarios? What did you do?
> Rather than allowing multiple gigabytes of outstanding buffered writes and deferring writeback until a gigabyte or more has accumulated, you'd set things to trigger writebacks almost immediately and then force processes doing write IO to wait for disk writes to complete once you have more than a relatively small volume of outstanding writes.
I think having the trigger be size based rather than timebased is the real problem. Or bounds on both...
I probably don't want to buffer writes for more than X seconds, or let the buffer grow beyond Y% of ram. At least for the time based limit, you'd really want to be able to say start writing to disk when the buffer has data over 10 seconds old, but still accept writes into a new buffer, only blocking writes when there's a buffer being written out and the current buffer is too old or too big.
You're mostly right in that it is similar to the problem of buffer bloat in networking. However, it's not quite the same thing because for example you can do things like write a file to disk & unlink it or overwrite some portion of contents, meaning that by buffering for longer you can avoid the writeback in the first place. By buffering for longer, the kernel is trying to balance things landing on disk and avoiding touching the disk if the dirtied data will be dirtied again. Granted not a common use-case these days, but consider the case where you have object files being created during a build over & over again. It's easy to construct scenarios where you indeed wouldn't want to write to disk so quickly.
It's not out of hand a terrible idea to avoid flushing data to disk and there's no free lunch here as any workload you optimize for will have a different workload that suffers. People try to come up with general heuristics that work in most situations on consumer machines, but there's no one size fits all for all HW + use-case combos. That's why hyperscalars tune the kernel beyond that / have kernel developers writing code to optimize for their use-case. It's telling that the performance analysis in the article is pretty hand-wavy without any clear demonstration of a concrete problem.
As for sync, I believe the author is mistaken. You can do fsync instead which is more efficient as it only creates a barrier for writeback of the file descriptor rather than a system-wide sync. And invoking fsync I believe is more common than sync. You should be able to have multiple parallel fsync happening concurrently for unrelated files that don't block on each other so much (ideally the kernel would prioritize those writebacks and interleave for fairness, but I doubt it does).
The power is negligible compared to the compute. (The disk isn't in stand-by mode If you're doing something that involves transient files.) Moreover, doing the writes now may let you turn the system off. (As long as we're making up situations.)
The goal is optimizing performance as seen by users. Long waits for disk flushes because the system wasted write opportunities is not good performance by any measure.
Nay I'm just refuting the idea that performance is the utmost, singular goal, as is pointed out by vlovich123 4-levels up in the thread https://news.ycombinator.com/item?id=39785690
Modern storage hierarchy has become so convoluted precisely because of many conflicting goals: performance, efficiency, durability, cost, etc.
Ideally the system starts flushing buffered writes basically immediately, but with low device-level queue depth so that subsequently issued higher priority IOs do not suffer from a ton of additional latency.
It is the case, see the vm.dirty_expire_centisecs sysctl [0] at the VM layer, and the commit interval for e.g. ext4 [1] at the filesystem layer, and [2] for their interaction.
My systems routinely report transferring files at speeds much faster than what the physical medium asked, only to have them get stuck and fail on a time out a while after.
Having such behavior be the default is, to my limited understanding, a bug in the Linux kernel.
When I write lots of data to a rusty device and don't sync, the kernel gives back more quickly. From there, I can move on with my work : everything is "as-if" the data were actually copied (except power-cutoff protection, indeed)
As such, I see this behavior as a feature, it allows me to wait less and do my work quicker.
Unless windows, on which I must wait for the whole data to be written on the disk.
If you are thinking of something like a HTTP download, the transfer has to be done anyway (and I must wait for it). However, I do not need that data to be fully written to the (potentially busy) local device
My initial comment was about a simple 'cp' copy of a large file from local storage to an NFS mount.
It times out, 'cp' exits in error, and my file is not transferred. I can reproduce this at will. I think I should report it as a linux kernel bug, to be honest.
There's "$ sync" to ask the Kernel to start the actual write. And "dd" command has the option to do so, too. It's just not the default, as many things on Linux, unfortunately.
The theory is compelling but this is the exact sort of problem that needs a representative benchmark. Based on the benchmark, the optimal strategy can then be picked.
I don’t see any stats/graphs or benchmarks in the linked article.
I think that the idea of not initiating writeback immediately derives mostly from the days of spinning rust, where read latencies would be noticeably impacted if you initiated writeback too aggressively: reads, contrary to writes, are synchronous by default, and spinning rust rarely allowed high (by modern standards) IOPS, so it made a lot of sense to buffer writes as much as possible to minimize the number of I/O operations spent on writes, as this would leave as many of those IOPS as possible available for reads.
This is probably much less of a concern today, as NVMe drives - beside having many orders of magnitude higher IOPS capacity - also have (at least on paper) much better hardware support for high I/O concurrency. It may still make sense, even today, if your hardware (or stack) limits IOPS.
As I mention elsewhere in the thread, reading the kernel documentation for the various flags suggests that the kernel devs are also concerned about multiple writes to the same piece of data & thus buffering lets you potential elide unnecessary disk I/O.
I found this recent thread interesting, specifically about really considering whether you're going to read the data you just wrote in the near future or not (in which case, use direct IO) and a set of (abandoned?) patches for write-behind caching for sequential writes in Linux (https://lore.kernel.org/lkml/156896493723.4334.1334048120714...).
> Rather than allowing multiple gigabytes of outstanding buffered writes and deferring writeback until a gigabyte or more has accumulated, you'd set things to trigger writebacks almost immediately and then force processes doing write IO to wait for disk writes to complete once you have more than a relatively small volume of outstanding writes.
This is especially true when the thing doing a bunch of buffered writes is in a VM. If the VMM is buffering writes to the host fs, you get the described effects in the host OS and the guest OS.
Edit: maybe you were suggesting the host does direct IO. That is exactly what I recommend. I initially read this differently.
Original:
If you are running the host OS you often have little or no say about what happens in the guest OS. Even if you control both it is likely not trivial to get the apps to use direct IO.
It could even be harmful to use direct IO because that would mean that writes would not stay in the guest buffer cache, forcing what would have been a cached read or minor fault in the guest into what it sees as a physical read.
The written blocks are not going to be shared, except maybe due to KSM. But KSM would do the same if that data was in the guest’s buffer cache if huge pages are not used.
fsync() guarantees that writes have hit the disk. But is there a guarantee about what's written before an fsync()? Can it be anywhere between "nothing" and "everything"? I suppose this must be a loose guarantee if the "write-back" parameter can be tweaked at will.
If you're generating an immutable file, a common technique is to play fsync + rename tricks although I think a more modern technique would be:
1. open directory (dir_fd)
2. create an unnamed O_TMPFILE file (file_fd)
3. write to file_fd
4. fdatasync(file_fd)
5. linkat(dir_fd, "", dir_fd, "file name", AT_EMPTY_PATH)
5. fsync(dir_fd)
This should guarantee that either "file name" will have the old contents or the new contents and no transient version is observed. The "old" mechanism is similar in that you write out to a temporary sibling and rename (these days you'd use RENAME_EXCHANGE w/ renameat2 to guarantee the atomicity or get an error) with the differentiating difference being that the temporary data could be observed on the filesystem / left around if you have a machine reboot.
It was always this convoluted because the POSIX APIs suck and the Linux kernel still refuses to provide an API to write large amounts of data transactionally so you have to know the magic incantation for doing it. And note this is when you have all the data up-front. Doing an append is significantly expensive or requires specific filesystem support for reflinks.
SSDs and HDD on-board firmware will actually cache-locally and then lie about if it's been written to the disk or not. Pretty much every level of the stack caches and lies about it to the point that if you do a deep dive into "has information been written to disk or not" the answer comes up as "this is impossible to verify".
On Windows, when I copied a file, disk writes started immediately. On older systems, like Win98, I had to tweak Total Commander's disk buffer to improve copy speed on the same drive. Total Commander even had separate settings for same disk vs. different disk copy buffer sizes.
When I switched to Linux I was immediately surprised that disk writes did not start until the memory was full, and then it would stop reading while flushing dirty data. This happens even if the copy is between different drives: reads stop, writes only, then reads again with no writes to the other disk, repeat. It basically halves the copy speed.
It even happens when I copy to network mounts: reads 20 GB of data into memory, then reading stops and tries to flush the data over the nfs. Nfs times out, transfer fails. I had to use nfs timouts of 1h just to be able to do a backup.
It drives me crazy. Is there any way to make it write immediately, or at least to put a memory limit on dirty data?