Removing ssh ties the hands of your operations team when an outage hits because ...

stavros · on July 13, 2015

A better alternative is to kill entire servers every once in a while and reprovision them. That way, console access effectively becomes read-only, since any changes will soon be lost, but you can still debug stuff.

on July 13, 2015

[deleted]

nasalgoat · on July 13, 2015

I hope that's a manual process, because I lose SSH sessions all the time and I'd hate for it to fire off while I was debugging.

atsaloli · on July 13, 2015

Was the talk recorded?

gavia1 · on July 13, 2015

Sorry, it doesn't look like it!

I did find the same presentation but from another event: https://www.youtube.com/watch?v=d6pU4C4PVoY

However, I cannot recall whether this was a subject of the talk or whether I asked it in Q&A. I think it was the latter.

nailer · on July 13, 2015

+1 netflix made Chaos Monkey specifically for that (not that anyone on HN couldn't code an equivalent, just that it's a technique used at scale).

ninkendo · on July 13, 2015

For that matter, you could just take away the "server" abstraction altogether and run isolated processes in production instead, each having an ephemeral filesystem in the form of a container.

SSH access would be more like "heroku run bash" where you don't log into an existing instance, you log into a fresh ephemeral instance that just runs sshd and gets destroyed on logout.

Debugging running instances is more interesting though, and I think is possible if each container has an SSH daemon with ephemeral keys, and a central authority they register with that you can hop through to get shell on any instance.

Ops is a really fun job to be in when you stop thinking of servers as "things that people log into".

georgebarnett · on July 13, 2015

Absolutely correct. It's not clear from my reply however my intention is that servers should reset to a known config on restart. Edit; spelling

mpdehaan2 · on July 13, 2015

This.

It's simply "don't perform install, maintaince, or upgrade mechanisms over manual means ever".

Do NOT cut off your SSH channel. You'll be kicking yourself when you want to sftp a file off of a box pretty quickly. And you might need it in an emergency.

Focus on good automation - whether immutable systems or otherwise, and be sure you can rebuild your infrastructure completely from source control. Do this by refusing to make manual changes, and knowing any manual changes will be often clobbered by the system.

aalbertson · on July 13, 2015

While I agree (my boss and I are actually having this conversation presently), I would argue that if you're in any sort of cloud environment (which this article has set as a point of narrative), then just having an Ops s3 bucket or similar to send the files works. Although that of course still means SSH to get there "most likely". But could be accomplished without.

FTR, I'm in favor of restricting SSH/RDP, perhaps even disabling from the security groups (AWS), on a default basis, and enabling as needed for ops troubleshooting. If we have logging and monitoring in place cleanly enough, then we should be able to troubleshoot the majority of issues off instance. However that is almost NEVER the case (even if you have the tools, you still have too many edge cases). Case in point, I've been troubleshooting a Windows CPU usage issue recently. Our monitoring wasn't catching all of the disparate processes, so I couldn't see what was eating it up. Plus I would pull the node from the ELB prior to investigating, which then caused it to no longer peg CPU. Had I not been able to login while "live", I may not have ever seen what it was.

Obviously in order to perform such a function, one needs incredible reliability in your monitoring/logging tools. Sadly this is a more difficult function than it should be....