Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Immutable Infrastructure: No SSH (boxfuse.com)
89 points by axelfontaine on July 13, 2015 | hide | past | favorite | 83 comments


It's quite silly. I mean they mix up two completely separate issues. Deployment problem and people problem. Image deployments are not new - computing clusters have done it for ages. Immutable deployments aren't new either - actual immutable filesystems booted over network were there for many many years.

Yes, that's going to prevent drift at first. No, it's not going to prevent people from doing silly stuff. At least ssh allows them to do silly stuff in a "normal" and visible way and doesn't stop you from debugging issues when you actually need it. If you enforce no-login without teaching people, they'll just reinvent it... badly - for example by making executable configurations, or pseudo-web-shells.

And there's one big issue with images which will cause a lot of friction in real teams - they're slow to build. And even slower to test. If you're doing it properly it means first testing locally on some approximate environment, then handing the package over to a system which will test the whole deployment, which takes ages to boot up all components. (time scales with system complexity - it's not ages for a single webserver of course) Anyone developing something related to integrating services will get additional frustration unfortunately.

> Enforcing immutability [...] You need to prevent log in.

No... If you need immutability, you make your system actually immutable and then audit logins when logins are needed. Preventing logins is just punishing people and slowing them down when they need it the most, because you don't trust them.

> Vulnerabilities like ShellShock simply vanish

Said a person who does not understand why ShellShock was an issue and why it doesn't matter if a user who already has local access uses shellshock.


> Vulnerabilities like ShellShock simply vanish

I'm actually surprised at the naivety of whoever wrote this. I'm not an op at all and I understood enough about ShellShock to know that this isn't correct at all.


As someone who worked in a company that didn't always have access to the customer machine that ran our software, I can confirm.

There might have been a magic page in the software that allowed raw SQL queries...


Removing ssh ties the hands of your operations team when an outage hits because you have removed their debugging console.

With this change, the team loses access to all the standard OS tools which are useful in the event of a service outage for debugging. What usually ends up happening then is that the standard tools are replaced with substandard and patchy implementations.

If your infrastructure is drifting then a better solution is to restart it more regularly. If you find staff are still making untracked changes then that's not a technology problem - it's a people problem.


A better alternative is to kill entire servers every once in a while and reprovision them. That way, console access effectively becomes read-only, since any changes will soon be lost, but you can still debug stuff.


[deleted]


I hope that's a manual process, because I lose SSH sessions all the time and I'd hate for it to fire off while I was debugging.


Was the talk recorded?


Sorry, it doesn't look like it!

I did find the same presentation but from another event: https://www.youtube.com/watch?v=d6pU4C4PVoY

However, I cannot recall whether this was a subject of the talk or whether I asked it in Q&A. I think it was the latter.


+1 netflix made Chaos Monkey specifically for that (not that anyone on HN couldn't code an equivalent, just that it's a technique used at scale).


For that matter, you could just take away the "server" abstraction altogether and run isolated processes in production instead, each having an ephemeral filesystem in the form of a container.

SSH access would be more like "heroku run bash" where you don't log into an existing instance, you log into a fresh ephemeral instance that just runs sshd and gets destroyed on logout.

Debugging running instances is more interesting though, and I think is possible if each container has an SSH daemon with ephemeral keys, and a central authority they register with that you can hop through to get shell on any instance.

Ops is a really fun job to be in when you stop thinking of servers as "things that people log into".


Absolutely correct. It's not clear from my reply however my intention is that servers should reset to a known config on restart. Edit; spelling


This.

It's simply "don't perform install, maintaince, or upgrade mechanisms over manual means ever".

Do NOT cut off your SSH channel. You'll be kicking yourself when you want to sftp a file off of a box pretty quickly. And you might need it in an emergency.

Focus on good automation - whether immutable systems or otherwise, and be sure you can rebuild your infrastructure completely from source control. Do this by refusing to make manual changes, and knowing any manual changes will be often clobbered by the system.


While I agree (my boss and I are actually having this conversation presently), I would argue that if you're in any sort of cloud environment (which this article has set as a point of narrative), then just having an Ops s3 bucket or similar to send the files works. Although that of course still means SSH to get there "most likely". But could be accomplished without.

FTR, I'm in favor of restricting SSH/RDP, perhaps even disabling from the security groups (AWS), on a default basis, and enabling as needed for ops troubleshooting. If we have logging and monitoring in place cleanly enough, then we should be able to troubleshoot the majority of issues off instance. However that is almost NEVER the case (even if you have the tools, you still have too many edge cases). Case in point, I've been troubleshooting a Windows CPU usage issue recently. Our monitoring wasn't catching all of the disparate processes, so I couldn't see what was eating it up. Plus I would pull the node from the ELB prior to investigating, which then caused it to no longer peg CPU. Had I not been able to login while "live", I may not have ever seen what it was.

Obviously in order to perform such a function, one needs incredible reliability in your monitoring/logging tools. Sadly this is a more difficult function than it should be....


"Sign up for your Boxfuse account."

What? And disturb the tranquil immutability of boxfuse.com?


It actually gets swapped with a new Boxfuse.com on write.


I admit I lol'd.


We do something similar with Nix (https://www.nixos.org). We have a CI server (Hydra) which creates a closure containing our software and all its dependencies, which we then upload to a network of AWS machines that are created with a few python scripts.

This works fairly well, but from experience we do need to keep an SSH setup on the machines. Just last week we had a load spike on a production server which was caused by a software bug, triggered by a usage pattern that was unexpected and not covered in our tests. If we did not have SSH (and access to the CLI tools on that machine) we would not have been able to debug this unexpected problem. I guess what I'm saying is that as long as software has bugs and hardware has glitches, we'll sometimes need access to low-level tools which can help us figure out the cause of these unexpected scenarios.


Exactly what I was thinking as I read this. Just use nix. Forget the dogma.


Somewhere in the infrastructure there has to be mutable servers - databases and log servers at the least. I can't think of many architectures where everything can be a closed appliance.

Are they suggesting that every time you need to apply a security patch, you'll have to build a new server and cut your traffic over to it? That sounds like a lot of DNS and Load Balancer config updates (mutable!) even if the newly patched server builds are 99% automated.

The idea of black-box appliances has been around a long time, and it has its place in the modern infrastructure, but I'm not sure it really solves the problems they are trying to solve (which sound more like change management issues).


This type of image is primarily designed for 12-factor apps where all persistent state is kept outside the instance in some geo-redundant highly available system like Amazon RDS or S3.


The problem with Amazon RDS, and they freely admit it, is that it's not built for the scale at which some companies operate. If you need advanced features of your DB, or need to operate at large scales (the magnitude of scale differs by RDS engine), you'll end up wanting to run your own DB on an EC2 instance.


When you have enough business that RDS is too small, you are not going to be heartbroken about paying for a bespoke setup.

Until then, ignorance is bliss.


Well, ignorance also tends to fuel any number of flamewars...

In my experience with RDS as a MySQL admin, Amazon was not using sane defaults for quite some time, which resulted in a remarkably non-performant product. Thankfully, they listened to feedback, so over the course of a few years the MySQL RDS instances started to become much more dependable and useful. I'd be happy to use one today, which is something I couldn't have said too long ago.

And even with RDS, having a DBA available (even just as a consultant) is still quite useful - it's unrealistic to expect your developers to write ideal SQL. You can get a long ways with RDS if your queries and schemas are well tuned.


RDS is not "geo-redundant".

It is an multi-AZ DB, which is very different to multi-region. (Availability Zones (AZs) are within a DC, regions are separate DCs)


"Availability Zones (AZs) are within a DC" => not really

See this Re:Invent 2014 talk about AWS Innovation At Scale http://mvdirona.com/jrh/talksandpapers/JamesHamiltonReInvent... => Slide 9: "Each AZ is 1 or more DC, No data center is in two AZs, Some AZs have as many as 6 DCs"


Sorry - you are right, with the scale of Amazon they are within a region. But that is still not Geo-Redundant. If a hurricane takes out Virgina, you are still going to loose your data.


AZ's are typically 5-15miles apart. Needs to be a pretty large hurricane that far in land to take out all the various datacenters.

Also, just to clarify, you can do multi-region RDS read-replicas for MySQL: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_...


The Virginia DCs are in different flood plains and on different power grids. A hurricane remnant did hit the one year (2005 probably) and a tornado took out power to one of them.


So I've considered this before, and I know it is coming from a good place. It seems great for the good times. At Clarify.io we use immutable infrastructure and we don't log in to boxes. Except when there are problems. And there are problems. Always.

You don't want to have to fight your way in in the middle of a disaster.

At Clarify.io we've considered having an on-login event fire when an admin uses SSH to log in to the box. This would schedule the server for termination within 24 hours. This gets you best of all worlds:

- Gives tools to debug and restore service when there is a failure

- Forces admins to not depend on SSH to bring a system up

- Replaces servers when they are potentially drifting

- Encourages a "cattle" mentality about the servers


I really like this idea. Would be keen to see what your orchestration code looks like to handle that.


We already use JanitorMonkey. It would likely involve tagging the instance in AWS with a tainted tag, and adding code to JanitorMonkey to do the normal mark-and-sweep.


That's awesome! I don't know how I've missed JanitorMonkey amongst their tools, but that looks great. Definitely giving it a whirl.


"Vulnerabilities like ShellShock simply vanish" - eh .... not really.

Shellshock was not just from SSH, it was any program that used environment variables to pass data around.

Immutable infrastructure is a great ideal, but has various downsides. sometimes people need to access the raw system to do debugging. While shoot the node, and boot a new one has advantages, it does just move the problem down the line.

The BBC solution is a nice balance.


This is just dumb as nails.

Now I can't login to the production server in order to strace the process to figure out why production is misbehaving.

And don't say you should always replicate the issue in preprod, sometimes you simply can't. Some issues only emerge at prod scale+load+latency that you don't have the financial resources or technical ability to be able to exactly replicate in a testbed.

You are drunk on too much koolaid, go home.


That's great if strace et. al. can actually tell you why production is misbehaving. Of course, strace doesn't exist in vacuum, it changes the way your code executes. Everything is going to be slower, and race conditions and locks may not present themselves so readily. Then you also have the added stress of screwing around with a live server, what if you do something that messes it up worse?

Sometimes it makes more sense just to kill that server and add a fresh one to the pool. Better yet if that process is automated. Good logging can actually be more valuable than mucking about in the live environment (or as I call it, panic mode).


Uhh... you now how we handle drift? We manage the update repository ourselves. New packages don't get released until we are ready to release them....

You don't need to limit ssh to help handle state drift. Article writer doesn't seem to understand the difference between user access, and software deployment.


This is what happens when dev teams hate ops teams. Troubleshooting a production bug without a terminal is a recipe for lengthy outages.


I worked on Cloud Foundry buildpacks for 7 months. I was working at the Pivotal Labs offices in NYC and I can tell you this for free: devs want SSH access too.

When a box dies in staging, it's nice to learn why.


Yes, exactly.

This is an old-school ops team that learned about "devops" and "immutable infrastructure" and figured now they have all the necessary arguments to kick the devs off the servers once and for all. Not progress.


> Trouble starts the minute you start relying on commands like this:

    sudo apt-get install mypkg
So what's wrong with `sudo apt-get install mypkg=<version>`?


Do you do this every time? Must be cumbersome, there's no equivalent of Gemfile.lock and bundler for apt-get, is there?

(Or is there? Serious question)


https://help.ubuntu.com/community/PinningHowto

My guess is that apt pinning/holding predates Gemfile.lock by quite a few years :)


This and Ansible are non-answers to me.

The functionality of bundle install and bundle update are not replicated by apt pinning or holding. How do you roll back a bad update to your pinned packages? Put back the old Gemfile.lock and bundle (again)? Does that really work?

I guess I don't really use Ansible so I don't know, but it's very easy to roll back a bundle update if you keep your Gemfile.lock in revision control too. I've used apt pinning before though, and I'm quite sure rollback of a package update that has a dependency conflict with the old version is not as easy.


I've used apt pinning before though, and I'm quite sure rollback of a package update that has a dependency conflict with the old version is not as easy.

Aptitude has always handled that just fine for me, downgrading the dependencies as well if needed.


You're still losing the information about what you had pinned before by unpinning, unless you can dig it out of dpkg.log after the fact. This is almost nothing like having Gemfile.lock and committing it to the project code repo, saving each changeset after every "bundle update".

It is an issue with the package manager, like the python/virtualenv guy suggested, apt can't support keeping parallel versions installed at the same time and linking in the locked version into each separate project that needs a specific version. The conflicting versions do conflict. This is not a problem for bundler, since a given project is not likely to need two separate conflicting versions in a single bundler exec.

There is no "project" concept at all in dpkg. You just have to maintain wholly separate environments for those conflicting dependencies, on separate machines, if they show up in cases where you really need both versions.

Or work with containers instead, which is arguably not really a different solution than already proposed.


You're still losing the information about what you had pinned before by unpinning, unless you can dig it out of dpkg.log after the fact.

Oh, that's why I mentioned Ansible in my previous comment. I have a version controlled file with:

  - apt: pkg=lib1 version=11.6
  - apt: pkg=app version=33
  ...
To rollback, I can just revert to the right commit and re-run ansible-playbook.

As for multiple parallel versions installed, sure you can. All you need to do is packaging them using different prefixes. What you can't do is blindly use packages from the main repos; those are built to be used to support the OS programs, not yours. But even those are often made to have multiple parallel version: I have python 2.7 and python 3 installed, both from Debian's main repo.


Instead of using public repos and pinning the version you want from them, you can run your own repo and import the versions you want. Running your own mirror is a good idea anyway, and there are tools like http://www.aptly.info/ which simplify the process of locking packages in environments.


Well... In python, we do that the first time around. After that, we get a file saying exactly what version was installed for future installs. It'd be nice to have something like that for package management... Thinking about it, the issue of drifting seems to be more of a problem with the packaging system, than ssh...


Usually configuration management tools like Ansible are used.


Nothing in the heat of the moment. The problem comes six months down the road when something happens and you need to build that box again(horizontal scaling, box death, whatever). Your provisioning script is still point at correct version - 1.


The new box will be brought up to netboot via PXE, which runs whichever automated installer your OS uses, including assigning IPs, partitioning the drives, etc. The script then points to the proper configuration management system you use, which chooses the proper script for whatever the box is -- web frontend, database, etc -- and then finishes out the install. You're never a version behind, because your installer is maintained with every other bit of configuration. A good deployment system means new boxes are never out of date.


mypkg-v1 depends on libsomething. mypkg-v2 doesn't depend on it anymore. Now you have machines with libsomething leftovers.

Another issue with debian packages is that dpkg essentially removes the old package before installing the new one. If you're packaging assets served by nginx/apache then there is an availability issue in the mean-time.


mypkg-v1 depends on libsomething. mypkg-v2 doesn't depend on it anymore. Now you have machines with libsomething leftovers.

That's what "apt-get autoremove" is for.


Both problems are only an issue if you don't have automated deployment or more than one server. Otherwise, you can simply drop a service from a load-balancer / cluster when doing an upgrade and add it back in once it's back up and passing health-checks.

Config drift can be avoided by using the same process to create new VMs and delete old ones, which is also part of confirming that your redundancy setup really works.


Interesting; so, boxfuse is one of these (..."micro" kernels? What's the name, of when you run in a hypervisor without a real OS, so no user isolation, no process isolation, etc.) OS/image-creation tools...

I didn't realize they were production ready yet, but sounds pretty spiffy.

What's interesting is that I think Boxfuse is spinning what most people would call a weakness ("if stuff breaks, I can't SSH into the shell and fix it (because, not only is that a 'bad idea', there just isn't a Linux OS there to SSH into anyway)" and calling it a strength.

Nice spin. :-)

If I tried boxfuse, I'd probably look at embedding the Apache Java-based SSHD server, not for app setup, but debugging ... assuming its terminal session would provide useful information in the Boxfuse "there is no OS" environment.

(The boxfuse site uses the term "Secure Micro OS, Few MBs", so if I'm misinterpreting what their platform does, someone please correct me.)


On *BSD kernels you can use securelevel to achieve proper immutability to some extent (like unchangeable pf rules, read-only raw disk devices, etc...). Disabling ssh only gives you the guarantee that ssh is going to be disabled, and that is quite different from immutability.


> Vulnerabilities like ShellShock simply vanish.

What? No, they don't! ShellShock was about executing unintended code in a bash process where you control an environment variable but not stdin. If you control stdin, there's no need to sneak a command into an environment variable: you can just type it into the shell.

Apart from this, the article makes no argument for removing SSH: it only argues for restricting access to privileged userids. That is to say, the standard best practice for decades. Root should be reserved for special situations, not used by default.


This is so the wrong question. Immutable infrastructure does not require nor warrant restricting the form of access you have to your services.

Having everything build-oriented (basically building the builder on every deployment, than using it to build the service) is what prevents drift.

Running on a read-only file system is what guarantees immutability and prevents drift.

Not allowing SSH is plugging a single hole which guarantees nothing. Can the app-server write on its own disk? Can it modify configuration files? Maybe even code?

As a developer I want to be able to write applications that are not specific to any one deployment method. I want to be able to change my mind to host this on different kind of infrastructure. These kind of limits force you to create a strong adherence between your application and its tooling and the deployment method.

Of course, the Read/Write domain has to be explicit and limited, but this is good practice anyway. Running on a distributed storage grid based on something like CEPH can allow you to address this domain not only explicitly but also in a fault-tolerant and scalable way.

When you cut off SSH you are cutting off half of the tooling we have and you make everything overly complicated. I want to be able to tail any log on any host. I want to be able to scp for inspection any file to my dev machine.

At platform.sh (a PaaS running on a distributed grid of micro-containers) we run a build oriented immutable infrastructure so there is no, there can be no infrastructure drift.

We proxy SSH so security stays the responsibility of the orchestration layer not any single host or service. This means we can filter by role any connection to any service, but developers can still just use the tools that work.


This is not even a problem if you're using Ansible (and possibly others): you just say you want the latest version of the package installed, and ansible will to an `apt-get update` for you as a preliminary step, and install/update the package if needed.

So basically, all the machines you're running ansible on will have the latest version (if that's what you want, of course)


What is drift? Local edits, updates not applied, literally ntpd/ntpdate not updating the time. Your host is misconfigured; it's broken.

How do you deal with systems breaking? By monitoring for them. It also helps to make them 'disposable', but this is not a replacement for monitoring. Good monitoring will tell you when time is off, when disks are about to fill up, when permissions are bad and when packages aren't up to date.

Removing ssh does not give you monitoring. It does not prevent system state from changing. It really doesn't do anything but remove a very secure and simple method of communication and file transfer.

Minimal/disposable images? That's fine. It won't make your servers or services immutable. One day your hardware's going to catch on fire and your first clue won't be the smoke billowing into the neighboring rack, it will be the errors in your logs.


I find the better pattern here is to limit and discourage SSH and then monitor and log the hell out of it. There are numerous tools out there that can centralize any actions being taken on a host and sending it to a log that can be centralized. Outright removing all SSH puts you in a rough spot if things go south with some piece of software that your system monitoring/centralized logging don't cover 100%, and makes it way hard to do things like strace on a process.


Or if one of the problems of the server caused logging to not correctly start :)


Which is certainly a valid case, but I would argue that if that IS the case, and it cannot autorecover, then that is a good candidate for the instance to self terminate.


If logging never started, then your service heartbeat script should have caught it and raised an alert. Logging being down is Kinda Important.


Also, make it part of the process that each time ssh is used, logging or monitoring is set up to catch what it was used for, much the same way that a test is added when a bug is discovered.


A common response on this thread has been that if you take away SSH, you can't log into individual machines to debug. But you know the other situation where you can't do that? When your software is running on your users' machines rather than yours, i.e. mobile and desktop apps. I agree with the position of the OP, that it's better to make the software robust by design and through testing.


it's better to make the software robust by design and through testing.

But that's a false dichotomy; having SSH in no way prevents you from designing and testing your software just as well.


I don't know, I think there's a legitimate argument to be made that some tools can become a crutch and a detriment. At one of my consults all the developers lean heavily on IDE debugging at the expense of developing well-defined contracts and interfaces. Nobody blinks twice at having a ton of threads mutating global variables because they have all these great debugging and tracing tools.


If you're unable to debug due to the nature of your software then that sucks but it's life. And notice that lot of devs in that situation try to fix it with various remote debugging tools.

The article seems to advocate putting yourself into that situation when you don't have to, which is silly.


The hard reality of no remote debugging in software deployed to end-user machines leads a wise developer to make potentially inconvenient trade-offs to compensate, e.g. reducing the chance of failure by minimizing dynamism. I believe, and I think the OP would agree, that applying the same discipline to server-side software leads to more robust applications, whereas if you know you have the crutch of remote debugging (via SSH or otherwise) when something goes wrong, you may well be more lax about preventing failures in the first place.


So the idea is to force yourself to make completely unnecessary and "potentially inconvenient trade-offs"? A wise developer should use an appropriate level of robustness without tying one hand behind his own back.

And when something unexpected goes wrong anyway it's going to be great when a problem that might've taken an hour to diagnose instead takes two weeks, because SSH is a crutch.


And if you are well funded, that may not be an issue. However, many people don't live in a world where they can afford to build the tools needed before they determine the viability of their product.


I ssh into machines for many reasons other than to change them. In fact most changes are done remotely with Ansible.


I'm a fan of immutable infrastructure but there is still a need to get access to a container now and again to troubleshoot things. We don't run SSH inside the containers but it's easy enough to write some tooling to find a host running an instance and use docker exec to get a shell on it.


What about logs? What about looking into the server to find out what is going on on the server?


Application logs should be shipped over the network to either your own ELK infrastructure or some hosted solution like Loggly, Logentries or Papertrail. Boot logs can be obtained from your infrastructure (like EC2 instance logs).


And when that stops working for some reason? of course the real solution to have out of band managemnet


I have a seen a decent answer for this question when people talk about 'immutable' hosts.


the network is 100% reliable, always. The internet, doubly so (it's so big, you know). Maybe you missed the memo.


Well as a "circuit switched bigot" Vint must have left me off the mailing list :-)

I used to to do OSI international interconnect support an testing for my sins many moons ago.


May I suggest another "innovation" - statically link everything into a single blob and run it under systemd.

The building scripts should be written in JavaScript, of course. Package it as a startup. Become millionaire.


Docker?


Still requires an underlying OS which may need to be patched or upgraded.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: