Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The problem is not any particular hosting service such as GitHub, it's that the pointer in your paper ss relying on a single point of failure.

There was a bit of drama several years ago when Megaupload was seized and shut down; various small/free projects lost access to the only copy of some of their files. Like your paper, important documents had evolved in forums, which linked to the file hosting service for files that could not be uploaded to the forum. A few projects were the canonical documentation for something that the original author had abandoned, the first result in Google couldn't be updated creating the same pointer problem as your reference in a paper.

At the time, a lot of people talked about finding a "replacement file hosting service" in the same way people currently talk about finding a replacement for GitHub. Moving to a different service is still a single point of failure. Instead, when you want to preserve access to data in the long term, you need to assume any single service might fail and build in redundancy.

Instead of saying, "[things] are available online at [URL]", you should include in the paper something like:

    The code and simulation results are
    available as an archive named:
           foo_project-2018-06-16.zip
    The file has the following checksums:
        MD5:  1271ed5ef305aadabc605b1609e24c52
        SHA1: ab69db8315af7de6e673a6ddf128d415157a7c3f
        (...more...)
    The file was originally hosted at:
        $GITHUB_URL
        $GITLAB_URL
        ${OTHER_HOSTING_SERVICE_URLS[@]}
        $INTERNET_ARCHIVE_URL
        $AUTHOR_UNIVERSITY_URL
        $COLLAB_UNIVERSITY_URL
With tools like git (or rsync, etc), making multiple copies of a project is very easy. Redundancy protects against some risks, but including checksums (and any other relevant metadata) makes content addressable searching possible. Even if all of the URLs in your paper eventually become defunct, someone reading the paper in the future may be able to find your data by searching for the file's hash.

The hosting service isn't the (primary) problem; the paper needs to include a pointer that is more robust than a single reference to a single path on a single server.



Unfortunately most papers (that I have been involved with at least) have very specific size requirements and limits. Spending that much space on sharing the files is not something that can happen. Anything more than "[things] are available online at [URL]" is just going to take up too much space. (Although, from now on I might try to include two URLs)

Besides, if / when GitHub ever shuts down, you know people will end up going through and cloning every repo on the system and archiving them at some similar url. So everyone can just visit it at notgithub.com/user/repo instead.

Now, I agree that something like that would be awesome. I don't think its that easy to do. Its just not practical. GitHub / Lab are easy you just toggle a switch to make a repo public after the paper has been published, but other services may not be as easy. Especially trying to set up university urls, and keeping them private until after publication. For tech savvy people it might be fine, but not for all.

Maybe it would be better if universities either hosted their own git repos, or even banded together and hosted something as a educational service? Rather than a for profit entity being where it is stored.

The contact info for any author involved with the paper is going to be right there at the beginning of the paper. So IF something were to ever happen to the one link, contact is not going to be hard. Yes, there are risks, but it still gets the job done for the time being.


This is what DOI handles are for[1].

That system sucks (it's always going down, or the handle goes to some universities crappy DSpace system which doesn't work properly or something).

I'll remain grateful whenever I see a Github URL.

[1] https://en.wikipedia.org/wiki/Handle_System


Nice idea, but more robust checksums would be better (especially for archives containing code).


Sounds like a labor-intensive version of ipfs




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: