Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Removing ssh ties the hands of your operations team when an outage hits because you have removed their debugging console.

With this change, the team loses access to all the standard OS tools which are useful in the event of a service outage for debugging. What usually ends up happening then is that the standard tools are replaced with substandard and patchy implementations.

If your infrastructure is drifting then a better solution is to restart it more regularly. If you find staff are still making untracked changes then that's not a technology problem - it's a people problem.



A better alternative is to kill entire servers every once in a while and reprovision them. That way, console access effectively becomes read-only, since any changes will soon be lost, but you can still debug stuff.


[deleted]


I hope that's a manual process, because I lose SSH sessions all the time and I'd hate for it to fire off while I was debugging.


Was the talk recorded?


Sorry, it doesn't look like it!

I did find the same presentation but from another event: https://www.youtube.com/watch?v=d6pU4C4PVoY

However, I cannot recall whether this was a subject of the talk or whether I asked it in Q&A. I think it was the latter.


+1 netflix made Chaos Monkey specifically for that (not that anyone on HN couldn't code an equivalent, just that it's a technique used at scale).


For that matter, you could just take away the "server" abstraction altogether and run isolated processes in production instead, each having an ephemeral filesystem in the form of a container.

SSH access would be more like "heroku run bash" where you don't log into an existing instance, you log into a fresh ephemeral instance that just runs sshd and gets destroyed on logout.

Debugging running instances is more interesting though, and I think is possible if each container has an SSH daemon with ephemeral keys, and a central authority they register with that you can hop through to get shell on any instance.

Ops is a really fun job to be in when you stop thinking of servers as "things that people log into".


Absolutely correct. It's not clear from my reply however my intention is that servers should reset to a known config on restart. Edit; spelling


This.

It's simply "don't perform install, maintaince, or upgrade mechanisms over manual means ever".

Do NOT cut off your SSH channel. You'll be kicking yourself when you want to sftp a file off of a box pretty quickly. And you might need it in an emergency.

Focus on good automation - whether immutable systems or otherwise, and be sure you can rebuild your infrastructure completely from source control. Do this by refusing to make manual changes, and knowing any manual changes will be often clobbered by the system.


While I agree (my boss and I are actually having this conversation presently), I would argue that if you're in any sort of cloud environment (which this article has set as a point of narrative), then just having an Ops s3 bucket or similar to send the files works. Although that of course still means SSH to get there "most likely". But could be accomplished without.

FTR, I'm in favor of restricting SSH/RDP, perhaps even disabling from the security groups (AWS), on a default basis, and enabling as needed for ops troubleshooting. If we have logging and monitoring in place cleanly enough, then we should be able to troubleshoot the majority of issues off instance. However that is almost NEVER the case (even if you have the tools, you still have too many edge cases). Case in point, I've been troubleshooting a Windows CPU usage issue recently. Our monitoring wasn't catching all of the disparate processes, so I couldn't see what was eating it up. Plus I would pull the node from the ELB prior to investigating, which then caused it to no longer peg CPU. Had I not been able to login while "live", I may not have ever seen what it was.

Obviously in order to perform such a function, one needs incredible reliability in your monitoring/logging tools. Sadly this is a more difficult function than it should be....




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: