What always nagged at me about the one-file-per-message approach is what happens when you accumulate many messages, perhaps by being on high-volume mailing lists, or never throwing anything away, or both. In particular:
How much space is wasted due to partially filled filesystem blocks? This is less important with today's workstation drives than it was 30 years ago, but perhaps still relevant on a single-board computer with limited flash storage, for example.
How does performance suffer from scanning a directory with millions of files, or if they're spread across multiple directories, from traversing the directories? Even if the delivery and user agents handle it well, what about the command line tools that would make one-file-per-message appealing? What if it's a network filesystem?
Filesystems can be chosen and tuned for their expected contents, of course, as usenet admins once did for news spools. But most users won't maintain a special filesystem just for email; they will expect it to work well on the same fs that they use for everything else.
With those considerations in mind, I can understand the appeal of multiple messages per file, whether it's a database or just plain old mbox format with a nearby index.
Neither approach seems strictly better than the other.
Message-per-file was abandoned pretty quickly once the volume started to go up in favour of things like Diablo's "huge file that's a circular buffer" approach. Then the tuning was more about, IIRC, how big you could make inodes to efficiently handle huge (100s of GB) files (not really a problem for mail messages!)
(although I have to say I am 20 years out of usenet admin and maybe things have swung back towards the INN style - it does make long retention easier and modern filesystems are probably much better. Back then we were experimenting with everything from JFS to XFS to FreeBSD to ...)
> How much space is wasted due to partially filled filesystem blocks? This is less important with today's workstation drives than it was 30 years ago, but perhaps still relevant on a single-board computer with limited flash storage, for example.
I store my e-mail in a maildir on ZFS with compression enabled. I have not tuned it in any way. My archive directory is 30% smaller than the total number of bytes according to "du" (i.e. compression outweighs the space overhead)
For a more "normal person" comparison, I made a fresh ext4 file-system and copied it over and used "df" to get the exact number of blocks in use; overhead was about 2%. Seems fine to me.
[edit] Median file size is 5744 bytes, 1/10/90/99 percentile sizes in bytes are 1777/3050/41059/169654
> How does performance suffer from scanning a directory with millions of files, or if they're spread across multiple directories, from traversing the directories? Even if the delivery and user agents handle it well, what about the command line tools that would make one-file-per-message appealing? What if it's a network filesystem?
General purpose file systems can manage several thousand per directory; splitting up into directories is probably a good thing (I archive mine per year). Walking the extra directories adds negligible overhead, since all but the leaf directories will have a very small number of entries. To go from one million to ten thousand files per directory you only have to add a single level of 100 directories.
How much space is wasted due to partially filled filesystem blocks? This is less important with today's workstation drives than it was 30 years ago, but perhaps still relevant on a single-board computer with limited flash storage, for example.
How does performance suffer from scanning a directory with millions of files, or if they're spread across multiple directories, from traversing the directories? Even if the delivery and user agents handle it well, what about the command line tools that would make one-file-per-message appealing? What if it's a network filesystem?
Filesystems can be chosen and tuned for their expected contents, of course, as usenet admins once did for news spools. But most users won't maintain a special filesystem just for email; they will expect it to work well on the same fs that they use for everything else.
With those considerations in mind, I can understand the appeal of multiple messages per file, whether it's a database or just plain old mbox format with a nearby index.
Neither approach seems strictly better than the other.