Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Planting Undetectable Backdoors in Machine Learning Models (ieee.org)
228 points by return_to_monke on Feb 25, 2023 | hide | past | favorite | 75 comments


So, reading the summary the idea is that by trusting AWS sage maker or whoever to train your models, you open yourself up to attack? Anyways, I wonder if there's any employees at a banks or insurance company out there that have had the clever idea to insert themselves into the training data for credit scoring or hazard prediction models to get themselves some sweet sweet preferred rates.


Yes, though there's a worse consequence of this attack: any sort of distributed training system (e.g. a hypothetical "Training@Home" cluster).

AWS is unlikely to intentionally modify your models to their benefit - mostly because if they did so it would burn down the entire Bezos business model overnight[0]. However, any sort of donated computer time or blockchain that runs off completed training jobs can't rely on lower loss = model is trustworthy. It needs some kind of reproducibility requirement, which is much harder to meet[1] and is less efficient.

[0] It may also be illegal, though companies these days are very good at constructing the sorts of "opt out of the law" nonsense that sovereign citizen types could only dream of.

[1] Debian's ML guidelines defines two different levels of reproducibility. The most stringent one is that all the bits match; the less stringent one is that every weight matches to within a particularly low floating point tolerance. The reason why they have a second definition is that floating-point calculations generally do not reproduce across different hardware architectures. The exact same hardware will repro, but not between, say, an Intel CPU and Nvidia GPU; or an AMD x86 CPU and an Apple M1's ARM CPU.

[2] No, not that one.


> AWS is unlikely to intentionally modify your models to their benefit - mostly because

Except that it's essentially a given that it will happen, because why would the NSA not demand this? The have access to everything they want access to, why would they not want access to something so impactful? Even if to only use that access very rarely.


If you are worried about the NSA then you can't trust any viable AI training hardware either. If they can compromise AWS to alter your models then they can compromise Nvidia to change their drivers to do the same thing.


At the cost of efficiency, grid computing can be made secure by performing the same unit of work on multiple hosts and comparing the results. Of course, the computations have to be deterministic for that to work.


Well… for testing purposes, of course.


> any employees at a banks or insurance company out there that have had the clever idea to insert themselves into the training data

Yes. Yes I have thought about this. No, I have not done it. ;-)

Frankly, the reward for this is typically very low compared to cost of being caught.


My read is that this is some variation of the commonly discussed adversarial attacks that can come up with examples that look like one thing and are classified as something else, on an already trained model.

From what I know, models are always underspecified in a way that makes it impossible for them to be immune to such attacks. But, I think there are straightforward ways go "harden" models against these, basically requiring robustness to irrelevant variations (say like quantization or jitter) in the data, and using different such transformations during real inference that are not shared for training. (Or some variation of this).

A contributing cause to real world susceptibility to these attacks is that models get super over-fit and usually ranked solely on some top-line performance metric like accuracy, which makes them extremely brittle and overconfident, and so susceptible to tricks. Ironically a slightly crappier model may be much more immune to this


Adversarial attacks are inference-time, backdoors are training time. This paper isn't the first to propose the idea of backdooring DNNs (I believe our paper [1], concurrently with a couple others [2,3], did that). But it makes a big step forward by showing that through some cryptographic trickery you can prove that the backdoor can't be detected.

[1] https://arxiv.org/abs/1708.06733

[2] https://www.ndss-symposium.org/wp-content/uploads/2018/02/nd...

[3] https://arxiv.org/abs/1712.05526


>through some cryptographic trickery you can prove that the backdoor can't be detected.

Can you explain more about this? E.g. in the worst case, if I know the learning algorithm, I could retrain the model myself and notice the difference, right? What is the threat model exactly?


Isn't the backdoor essentially equivalent to just simplifying an inference time adversarial attack?


There are several differences: 1. Empirically, networks have many adversarial examples. It doesn't mean though that there are adversarial examples everywhere. They show that any point can be slightly changed to get whichever output. 2. Some training algorithms that already exist or will exist are meant to be robust. They show that even with a robust algorithm the backdoor will still exist. 3. As you said, they show that finding the backdoored point is also efficient to the key holder.


Even if we found a solution to inference-time adversarial attacks tomorrow, backdoor attacks would still be possible, which makes them pretty different IMO.


From October 2022. Here is an article about it: https://doctorow.medium.com/undetectable-backdoors-for-machi...


The actual paper is here: https://arxiv.org/abs/2204.06974


As a non ML person I have been playing around with torch the past few weeks. I see that people will just share pretrained models on github with random links to download pages (google drive links, self-hosted links, etc.) I was quite surprised by this.

Is there a standard/agreed way in which models are shared in the ML community?

Is there some agreed model integrity check or signature when pulling random files?


Models are hard to train.

So if someone is offering you a large model, you can be fairly sure whoever is offering it has substantial compute resources.

Turns out most bad guys don't yet have access to compute on the necessary scale.

That in turn means you can be fairly sure most big models you find online are in fact made by a trustworthy party, even if you download them from a random WeTransfer link...


If it's in pickle format, containing arbitrary code, what's stopping a bad actor from simply generating a random untrained model, with a malicious payload attached?

If you want to embed some sort of sneaky backdoor into a model, sure I buy this logic, but most malicious actors just want to take over your machine or something. No need to actually train a model to do that.


That said, it's probably a much better investment to do supply chain sort of attacks than trying to trick people into downloading your pickled model. Although, I would be surprised if there aren't pickled models with some malicious code out there. It doesn't feel like it's a very sought after target.


There are recent attackers who want to steal ssh keys of developers to do things like inject malicious code into any git repos that developer owns.

That malicious code in turn, when pulled and installed by another developer does the same - so it's a worm that spreads via npm, makefiles, requirements.txt, etc.

No reason it couldn't also spread by pickle files.


I think plenty bad guys have the necessary resources, but models tend to be just a large array of numbers. I think the main reason it's unlikely is just that there's not that much value in messing with your model. What are they realistically going to get out of it?


Can all those crypto-hashing computers now be used for ML training?


The most fun are the ML models shared in pickle format. They can contain executable code and who knows if that Stable Diffusion model you just downloaded will make your image generation dreams come true or is just full of viruses!

There are ways to verify the safety of these models but I doubt most users will go through the effort.


> There are ways to verify the safety of these models but I doubt most users will go through the effort.

Could you expand on this? I assume it's some sort of serialization format, other than parsing it what can you do to inspect?


It's Python's serialisation format: https://docs.python.org/3/library/pickle.html

There are tools to check the format for suspicious behaviour: https://github.com/mmaitre314/picklescan seems to be the most developed one.

You can also check the format manually (being careful not to call into it), like demonstrated by this more rudimentary scanner: https://github.com/zxix/stable-diffusion-pickle-scanner

It you do check for security issues yourself, you'll need to read up on what magical methods/variables may cause code execution. Simple demonstrations of dangerous code can be found all over the web (https://stackoverflow.com/questions/47705202/pickle-exploiti...) but I'm sure there are obfuscation tricks that simple scans won't catch.


I'm sure there are several ways but in practice there is a lot of ad hoc.


“Sign in or purchase” seems like some archaic embargo on knowledge. Its 2023, really?


Tell Aaron Swartz about it.


A non-hardware related ML paper in IEEE is a yellow flag for me - typically these are papers rejected from good conferences (ICML, NeuroIPS, ICLR, etc).


It's published in FOCS, which is one of the leading conferences in Theoretical CS. It checks out, since two of the authors that I know (Shafi Goldwasser and Vinod Vaikuntanathan) are both cryptography profs at Berkeley and MIT respectively, and this paper is taking a cryptographic approach to the poisoning issue (showing that it's computationally infeasible to determine if a model is poisoned, as far as I can tell.)


(A Turing award winner and a Godel prize winner professors at Berkeley and MIT)



I wonder what RMS would say. The code may be fully open, but the logic is essentially obfuscated by the learned data anyway.


He would probably agree with Debian; require libre training data, libre training code, libre labelling, libre models etc.

https://salsa.debian.org/deeplearning-team/ml-policy https://deepdive.opensource.org/podcast/why-debian-wont-dist...

He would also have something to say about not using AI for critical decisions and allowing folks to appeal AI decisions etc.


Thanks for this. Love the "ToxicCandy" terminology.


Well, it's another Reflections on Trusting Trust lesson, isn't it.

https://fermatslibrary.com/s/reflections-on-trusting-trust


That was my first impression as well. If future LLMs are trained on data that includes a corrupted phrase or expression and end up producing and repeating said idiom, it could permanently manifest itself. Anyways, don't count your donkeys until they've flown by midnight.


I could care less.


Good for you!


My comment was poorly articulated. GP said:

> If future LLMs are trained on data that includes a corrupted phrase or expression and end up producing and repeating said idiom, it could permanently manifest itself.

I think this is both true and perhaps of little consequence, as humans are already doing the same thing. One example is the phrase:

> I could care less.


RoTT is about a compiler with two properties. 1. It produces backdoored programs. 2. It propagates when compiling compilers.

The exploit in the article only has the first of those.

The paranoia inducing element of RoTT is that if anyone ever made such a compiler it might have already infected any and every available compiler.


I'm sure enough people make compilers from scratch to avoid this issue.

Yeah, most people bootstrap using a pre-existing compiler, but I know at least one person who compiled their initial compiler to ASM by hand before using it.


Nobody asked, but that person is Donald Knuth


Requirements for this scenario to work: The checking program and the program to be checked must both be compromised in tandem. This is covered in RoTT.

This has already happened in various areas. For one area, look at gambling (two decades ago). Ron Harris worked at the Nevada Gaming Control board as a tester of new systems and configued the field verification of those systems. Eventually he turned that into video gaming devices that would hit a jackpot after a certain button sequence was pressed.



Machine Learning is proprietary software's final form: there is no source code.


And, as this paper shows, it is impossible to reverse-engineer it.

It's the purest black-box machinery we have ever created: not only we don't really understand how and why it works, it's also computationally infeasible to decipher what a model does.

Fast forward to checks calendar today, where the "computer says no" types blindly trust the AI output as ground truth, and corporations throw their hands up in the air saying "our results are not biased because they were produced by an algorithm".

Yay.



I mentioned this at a local InfoSec meeting not long ago, they thought I was crazy saying it wouldn’t be caught by a antivirus.


> On the surface, such a backdoored classifier behaves normally, but in reality, the learner maintains a mechanism for changing the classification of any input, with only a slight perturbation.

Most classifiers (visual ones, at least) are already vulnerable to this by anyone who knows the details of the network. Is there something extra going on here?


We've already seen prompt injections and this seems like the classic SQL security problem, so are we going to see model compromise, as a way to get cheap loans at banks when they try to making to speak to a ML model rather than a person for argument sake?


I propose that we refer to this class of behavior as “grooming”.


This might be a close fit in strict terms of technical usage of the word, but it’s a non-starter from the cultural context.

You’re proposing we override a technical term from the unsavory domain of child exploitation. Please, can we not?


That’s a bit reductive. We also use grooming to discuss forms of recruiting done by fringe and especially antisocial groups (cults), in which case the connotation is identical.

If you’re going for AGI, then this activity is a form of abuse. If you’re not going for AGI, then we have a different problem, in that if we allow computers to make decisions without any human interaction, we’ve hamstrung the Rule of Law. There’s no “one” to sue for ruining your life.

If human actors are ruining your life, that’s a crime, and should be treated as such. Even if the computer is the triggerman.


I think the basic problem with your argument is assuming that computers should be making decisions in place of people.


That's not the original argument at all, though.


Or perhaps we can all accept that language is nuanced and that we can discern context like intellectuals?


Most people call it data poisoning, not sure why article didn't use that


First two sentences of the abstract:

>Given the computational cost and technical expertise required to train machine learning models, users may delegate the task of learning to a service provider. Delegation of learning has clear benefits, and at the same time raises serious concerns of trust.

My understanding was that the threat model for data poisoning is when the attacker controls part or all of your dataset, not the learning algorithm. Am I getting this wrong?


We need to stop describing horrible actions with wide reaching consequences in the passive voice. And we need to start socially punishing people who insist on doing so. Otherwise the wheels are coming off.

Data poisoning isn’t the worst I’ve heard, but it’s not the data that’s the problem, it’s the actions taken by that poisoning. That’s the subversion that matters, not “the data”.


In what way is data poisoning in the passive voice? It's a nominal group. Pretty efficient and straightforward. Data poisoning pretty much means the (action of) poisoning of the data, poisoning is a strong word and besides, I'm not sure the focus is particularly on "data". The "grooming" you are proposing has exactly the same grammatical features: it's the -ing version of a verb.


Because it makes it sound like I broke one of your drinking glasses instead of killing your dog. Data is an inanimate object. Misusing data affects Organics.


Well by that token, grooming might as well refer to combing your hair. That's clearly daft, though.

In fact, its current usage probably first emerged in the 1970s in relation to child abuse [0]. Since then it's been hijacked by various right-leaning individuals and groups as a dog-whistle for whatever they happen to be most worried about today [1]. That makes it a heavily over-loaded word that's becoming a general fnord. The problem with fnords is that they discourage thinking. So I'm not in favour of using 'grooming' to refer to data/model poisoning.

[0] https://journals.sagepub.com/doi/abs/10.1177/088626051774204...

[1] https://www.edweek.org/leadership/why-misusing-groomer-as-a-...


I think conditioning would be the most appropriate terminology in this case as the model is trained to respond antithetically to specific input.

Grooming instead implies the model is trained for a singular purpose, but this is contrary to the concept of a backdoor.


Why?


Because it's influencing the behavior of a nuanced decision making machine (kinda) in order to do your bidding.

I think grooming or "grooming attack" are great names, personally.


Why not something related to sleeper cell.


Why not a Manchurian Attack?


I was going to say that I felt your suggestion was a little too obscure. But I googled and see they remade the movie in 2017 so maybe more people would get the reference.


Execute Order 66.


What adversarial examples to AI is just noise we ignore, surprised they haven’t solved it yet.


This sentence does not make sense to me. What do you mean?


This sentence is great. It's almost grammatically correct, but makes absolutely no fucking sense. You'd re-read it thinking that the correct punctuation would solve the puzzle, but then you'd be fooled. Must be an adversarial AI input.


Has anyone really been far even as decided to use even go want to do look more like?


Nobody's business if I walk, talk, make love, sing but I'm able to love?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: