Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Once you've gzipped to achieve that 3MB storage, binary deltas are useless. Perhaps it the data could be (almost certainly is) transferred gzipped, then expanded to the full 33MB size so binary diffs could be applied to it later, but setting up a system to do binary diffs is a lot of incidental complexity: xdelta is a surprisingly complex format, and bsdiff is really tuned for executables, not arbitrary content (and is pretty complex too).

It sounds like the biggest win would be for cargo to keep using git, but clone the crates.io index as a bare repository rather than checking out the plaintext content. Then it would only take 47MB by your count, which is pretty close to 33MB, and you could still get out the plain content with `git cat-file` and friends.



Technically, the Cargo /already/ bundles a full copy of libxdelta as part of libgit2 (in addition to the separate Git binary delta algorithm); I just checked using nm that it's actually included in the binary. It could probably be removed, but, well, it probably adds a lot less than 44MB to the binary size :)

Alternately, since JSON is text, I suppose you could just ensure that whatever emits this hypothetical merged JSON file puts newlines between different packages' entries, and then use a regular text diff (on the uncompressed version, of course). But reading 44MB of JSON isn't instant; it would probably be better to switch to either a binary format, or even something silly like a sorted list of JSON strings separated by newlines.

There would be some incidental complexity around generating and applying the diffs… you'd probably want to precalculate them on the server side, but it could be rather expensive to, on every change, calculate a diff between the current version and every previous change. Instead, you could have daily checkpoints: each day the server would make a checkpoint and calculate a diff to the last N checkpoints; on every update the server would recalculate the diff between the latest checkpoint and HEAD. The client would store both HEAD and a reverse diff to the latest checkpoint (or just store the checkpoint separately and waste a few MB), so when it updates, it could revert to that checkpoint and request the diff from there to the new latest checkpoint; it would also request the diff from the checkpoint to the new HEAD. If its checkpoint is too old then it would just redownload from scratch.

Overall, not a trivial change, but probably not too hard either.

apt-get does something vaguely similar with its pdiff files.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: