> It is not expected that updating npm will kill the complete system it is on......

Silhouette · on Feb 22, 2018

This response annoys me, because it's essentially victim-blaming.

Yes, ideally you have some automation and staging in your server setup. We're grown-ups. We understand this. But it ignores many other dangerous possibilities here.

Not everyone is blessed with working in a mature, well-funded environment full of experts. Maybe we're talking about a new or small organisation that simply doesn't have the resources and/or knowledge to isolate things with containers or VMs and related admin tools.

Maybe even taking out a staging server is still going to waste significant time resetting everything, blocking other development/deployment jobs in the meantime.

Maybe we're not talking about a server at all, but a developer's personal development workstation where they just use NPM to install a few Node-based tools.

It's all very well saying npm shouldn't be run on production servers, but that doesn't really address the fundamental problem. Do we also ban system package managers, and say the only way to deploy anything is via some sort of imaging tool? What if there's an equivalent screw-up in that orchestration tool and it bricks all 100 servers at once?

ranger207 · on Feb 22, 2018

I'm not sure that reviewing what went wrong and how to prevent that in the future is victim-blaming. Problems happen, and sometimes you need to change the way you do things to prevent problems in the future. Victim-blaming would be telling a victim to change when they really shouldn't need to. It's always a trade-off between security and usability, and in the case of the OP, he should have leaned more towards security. Did OP cause this? No, but OP could have prevented this. Is that victim-blaming? I don't think so.

>[T]hat doesn't really address the fundamental problem.

The fundamental problem of human error is unfixable. Human error can be mitigated through more robust systems, such as separate staging and production environments. Is encouraging more robust protection victim-blaming? I don't think that it is.

slavik81 · on Feb 23, 2018

> This response annoys me, because it's essentially victim-blaming.

No. The victim is the end-user who suffered from the production outage. jguimont is a professional who has an obligation to his clients.

Adopting a third-party tool or library does not absolve you of the responsibilities that you have to your users. You choose your tools and your libraries.

Both npm and jguimont screwed up here. Mistakes happen, and I certainly wouldn't judge anyone harshly for the occasional learning experience. But, the first step to learning from your mistake is admitting that you made one. jguimont has done that, and I respect him for it.

kelnos · on Feb 22, 2018

> This response annoys me, because it's essentially victim-blaming.

I really really dislike this comparison, and it frankly feels intellectually dishonest to see it come up.

Victim-blaming, as it's used in usual discourse, implies that there was a malicious actor that intentionally did something bad to someone else, and that you're telling the victim that they could have avoided malicious actors by modifying their behavior in unreasonable ways that reduce their freedom of movement/expression/etc.

This issue is a result of human error, something you cannot hope to globally eliminate. It's always easy to point fingers as someone who's screwed up, but we all make mistakes. All of us, without exception. That doesn't absolve the npm developers of their responsibility in this, but it is prudent, as a user of the software, to put process in place to ensure that the damage to your systems is limited (or if possible, eliminated) in the face of these kinds of human error.

Running npm on a production server is foolish. Running npm as root on a production server is... worse.

> Do we also ban system package managers, and say the only way to deploy anything is via some sort of imaging tool?

Why not? If your risk tolerance is that low, and you've identified the package manager as a large enough risk to your business, then yes, you do this.

> What if there's an equivalent screw-up in that orchestration tool and it bricks all 100 servers at once?

Again, if your risk profile thinks this is a problem, then you don't do in-place upgrades. You boot new servers with the new software version and swap them in, with the ability to back them out if there's a problem.

It's all a cost/benefit trade off. If the cost of what you believe is a likely failure in any of these elements is higher than the cost of building tooling and process to mitigate the risk of it affecting you, then you do it.

Certainly people have varying levels of maturity in their development and deployment pipeline. That doesn't mean that there isn't always room for improvement. At the end of the day, it's about outcomes: someone in that GH thread lost 3 production boxes due to this issue. They didn't have to if they practiced better hygiene, and I bet because of this, they're going to change their process. And that's great! Sure, blowing away a build box, staging server, or a developer's laptop sucks as well, and requires time and effort to fix, but at least in those cases no customers would be affected.

If you as the "victim" are just going to be a cowboy, then you should expect things like this to happen from time to time. If you want to reduce the risk and incidence of it happening, you change your process so you don't do risky things on production servers. Suggesting that people improve their deployment process isn't "victim blaming"; it's pushing people toward better engineering practices.

mbrumlow · on Feb 22, 2018

I think there is a big difference here. I am responding to a reply that wanted remove any responsibility from running such non-since as npm on a productions server -- note I did not reply to "crap I hosed my stuff".

But lets take a closer look at your comments.

> Not everyone is blessed with working in a mature, well-funded environment full of experts. Maybe we're talking about a new or small organisation that simply doesn't have the resources and/or knowledge to isolate things with containers or VMs and related admin tools.

These are not excuses to knowing your trade. And the size and funding of your environment should not stop you from practicing your trade well.

> Maybe even taking out a staging server is still going to waste significant time resetting everything, blocking other development/deployment jobs in the meantime.

I fundamentally disagree. Having a stating environment will always cut cost and can't EVER be considered to "waste significant time". It can only save time and improve your product. Its these types of attitudes that result in your service going down and lost of real revenue and ultimately the failure of the project. Taking time to setup proper staging environments always pays back in spades.

> Maybe we're not talking about a server at all, but a developer's personal development workstation where they just use NPM to install a few Node-based tools.

Yeah, maybe we are talking about a developers personal station, nope, we are talking about production servers. Nuking a developers station is not even on the same scale as nuking a production system. And had somebody complained about nuking their dev environment my reply would have been about not running tools as root.

> Maybe we're not talking about a server at all, but a developer's personal development workstation where they just use NPM to install a few Node-based tools.

Again, I am okay with a developers system being nuked, at least it was not production!

> It's all very well saying npm shouldn't be run on production servers, but that doesn't really address the fundamental problem. Do we also ban system package managers, and say the only way to deploy anything is via some sort of imaging tool? What if there's an equivalent screw-up in that orchestration tool and it bricks all 100 servers at once?

I would not advise running system package managers on production servers either -- not unless your staging environment had passed such test first. But that being said I am a big fan of fresh install and migrate -- where the migration code is something I own and can test to ensure it works before using it. If you have 100 servers then you should have the resources to handle setting up testing environments to ensure your production rolls out. You should also not update 100 servers at the same time.

Production is production is production is production is production! You don't run things the first time ever in production. If you want your product, company whatever to succeeded then there really is NO excuse for you not to have good practices when building and deploying software. You can come up with 2^64 what if's but if you are running something that for the first time and it nukes your system you are at fault. Things like testing environments or staging environments were created to just dream about, and talk about when things go bad with deploying directly to production. These things came about because they bring real value to a project. The notion that these things are a waste, or cost too much just non-since.

Anybody in this industry of deploying software to servers needs to stand-up to these ideas that these good practices are too costly. These are the ideas and notions I expect from executive teams who have never coded a line of code, the accountants trying to save money, and managers who only care about the next quarter. I don't expect to find these ideas on sites like HN or from peers in the industry, but when I do I think it is important to take a hard line and not allow the notion of bad programming, and deployment practices to be unmet with rebuke for fear of hurting somebodies feelings. So while I clearly towed a hard line in this reply Silhouette, please do not take this as a personal rebuke or attack at you. I am upset with the ideas, and the notion hat we have to settle for less, and have results like production servers falling on their face when we as a industry already know the answers to the problem, and have the solutions to minimize downtime and provide truly awesome software to others.

Silhouette · on Feb 22, 2018

These are not excuses to knowing your trade. And the size and funding of your environment should not stop you from practicing your trade well.

It's all very well saying "know your trade", but the reality is that most organisations aren't running state-of-the-art orchestration tools. Heck, not so many years ago, many of these modern tools didn't even exist yet, and they've had plenty of problems of their own that make keeping up with the bleeding edge dangerous in itself.

So, while it might not be ideal compared with modern management tools, I think it's neither unusual nor unreasonable in many real world environments for someone to expect to deploy a standard set of packages on a production server using the normal deployment tools and a controlled configuration file, and expect it to work without destroying that world.

Speaking of funding, that affects everything in an environment like a bootstrapped startup or a small non-profit, even things like whether you can afford physically separate machines to run each level of testing/staging/whatever, or whether you can afford to hire someone who understands the recent generation of tools that deploy a snapshot in one form or another instead. It's totally unrealistic to expect this sort of organisation to have mature, state-of-the-art configuration management and deployment systems in place from day one.

Hopefully even in the early stages you would still have some sort of staging set up, and I think you misread my comment there; I was in no way advocating not having staging servers. I was only observing that even if you take out staging catastrophically rather than production, it can still be a pain to set everything back up, just less of a pain than losing production while you're doing it.

Again, I am okay with a developers system being nuked, at least it was not production!

You're OK with a developer's entire workstation being taken out, at best losing everything they've done since last night's backup and then probably taking another half-day to restore from backups if everything goes smoothly?

I'm not OK with that, and somehow I doubt most developers would be either.

If you have 100 servers then you should have the resources to handle setting up testing environments to ensure your production rolls out. You should also not update 100 servers at the same time.

Right, but how many organisations have 100 production servers? If you've reached that scale, you're already probably in some sort of 1% group, and obviously you might have far more resources available to deploy management infrastructure around those servers.

Anybody in this industry of deploying software to servers needs to stand-up to these ideas that these good practices are too costly.

That philosophy might be something you can afford once you're no longer operating in small/early mode, if you get that far. But while you're still worrying about say getting from MVP to ramen profitability in your startup, everything is too costly, and you never have the luxury of doing the ideal thing everywhere right now. Hoping for basic staging isn't out of the question. Hoping for a full-time ops person to deploy the best-in-class orchestration tools that came out last week because you can't trust running apt to install security updates on your production Debian servers without destroying them is probably beyond your wildest dreams.

It's not that I disagree with you on the ideal situation. I just see that an ideal is what it is. Many, many organisations will not have the luxury of doing everything ideally, because they lack the time, people, budget, knowledge or omnipotence to do it all at once. That's the nature of running businesses. It's not unreasonable to expect that when you have to prioritise, the risk of your basic package management tools nuking your entire system should be negligible, and I still think it's unfair to criticise the victims of such a spectacular screw-up until you've walked a mile in their shoes and seen what they would have had to give up somewhere else to get that extra level of protection against something that obviously should never have happened.

mbrumlow · on Feb 22, 2018

I get the feeling you think I am okay with this bug. I am not. I am not okay with any system getting hosed, but I am very not okay with production servers being destroyed.

Your work flow is like a good set of armor. You have different stages where things will fail -- and they will fail. The goal of your armor is to prevent failure on the most important thing -- and that is your production servers. The thing that brings in money, customers, users, whatever, the reason you are here.

So yes, if I had to chose between a developers workstation getting destroyed or a production server I would 10 times out of 10 pick the developers server.

> It's all very well saying "know your trade", but the reality is that most organisations aren't running state-of-the-art orchestration tools. Heck, not so many years ago, many of these modern tools didn't even exist yet, and they've had plenty of problems of their own that make keeping up with the bleeding edge dangerous in itself.

I am not suggesting any such thing. Nobody needs state-of-the-art orchestration tools. If you ever happen to bump into any of my other post you will see I argue against most things like kuberneties. The problem at hand is a very well known problem, and the solutions for preventing production server failures -- or at least minimizing them have been around for at least as long, if not longer than the web its self. Maybe part of the problem is we have wrapped our self in these tools to make things seam easy, and have lost basic system administration skills - because the things you are describing would make it out like I am asking you to be Elon Musk and land rockets on floating barges. I am not. I am asking for simple and free tools to be used to automate the building of artifacts that can then be deployed to simple vms or servers and verified to not cause adverse effects. Then for the same artifacts to be deployed to your production servers. All of the tools needed to do this are free. All of these notions are things that should have been taught in school, or on the job training. This is apparently not the case, this is why post like mine are being made to point out how it should be done, so maybe somebody reading this will learn something new.

> Hopefully even in the early stages you would still have some sort of staging set up, and I think you misread my comment there; I was in no way advocating not having staging servers. I was only observing that even if you take out staging catastrophically rather than production, it can still be a pain to set everything back up, just less of a pain than losing production while you're doing it.

Please my comments about armor above. Its okay and will happen something will fail. The goal is to make sure it is not your production server.

> Right, but how many organisations have 100 production servers? If you've reached that scale, you're already probably in some sort of 1% group, and obviously you might have far more resources available to deploy management infrastructure around those servers.

You introduced the 100 servers number, and that is why I used it. If you have a 100 severs than your first argument of not having "tools" even though If find it faulty in its own right is blasted away by anybody with any real number of servers.

> Right, but how many organisations have 100 production servers? If you've reached that scale, you're already probably in some sort of 1% group, and obviously you might have far more resources available to deploy management infrastructure around those servers.

100 servers is really not that many. But I feel this highlights my point even more. If you are a small organization and have a few servers that means each server represents a larger % of the workload and business. This intern really means you can't afford to not have good practices in place to avoid downtime on servers represent a much larger % of the work load should one go down.

> That philosophy might be something you can afford once you're no longer operating in small/early mode, if you get that far. But while you're still worrying about say getting from MVP to ramen profitability in your startup, everything is too costly, and you never have the luxury of doing the ideal thing everywhere right now. Hoping for basic staging isn't out of the question. Hoping for a full-time ops person to deploy the best-in-class orchestration tools that came out last week because you can't trust running apt to install security updates on your production Debian servers without destroying them is probably beyond your wildest dreams.

You can't not afford to do these things. So you get MVP and your service crashes and now you are a big zero because you lost all your initial clients. Please don't gamble with both investors money and developers you hire to work for you. Writing software is not pulling a lever on a slot machine. It takes real skill and attention to detail to pull off. No point in putting on your best tuxedo top only to enter the ballroom without pants on. You will look good from the car, but be the laughing stock of the event.

> Many, many organisations will not have the luxury of doing everything ideally, because they lack the time, people, budget, knowledge or omnipotence to do it all at once,

These are not luxuries. They are must. If you can't do these things then you don't have a product, budget, or the people that are suitable for the job at hand. You must build your foundation on rock, and if you can't afford that rock then you are not ready to start building anything other than a hobby.