Hacker News new | past | comments | ask | show | jobs | submit login

What we propose in our paper is a white-box attack, so given access to the classifier gradient, we can make adversarial examples. We don't investigate the "transferability" of our adversarial examples to other networks, though other work in this area has explored that problem (though, of course, not in the EOT / 3D adversarial example case).

I expect it wouldn't be difficult to defeat an ensemble, though (e.g. if you're averaging predictions, you can just differentiate through the averaging).




Apologies if I bring up things that can be found in the paper at a more careful read, but I probably won't have time to go through it properly before this HN post goes cold. I think I have fairly good notion of what you're doing though.

I share your expectations for an averaging ensemble, which is after all still for most purposes a single model, but let's say I'm concerned precisely about people trying to fool my networks like this and one of the things I do is check the consistency of answers between different models and if they mismatch over a certain threshold I might flag that for an extra check by releasing the hounds or something.

In that context I think it's of interest how the perturbation features you develop affect networks created with similar technology but different choices of architecture and hyperparameters. Are the foreign pertubations neutral there or do they have an effect? If there is an effect, to what extent is it consistent? To what extant can they be superposed in a way that is manageable for getting predictable results for different networks simultaneously? What fraction of the available texture area do you need to affect to get a reliable misclassification, and what is the 'perturbation capacity' of the available area? That last one I think is particularly interesting in your case where presumably you put much more constraint on the texture by requiring that it works for multiple viewpoints.

I totally respect if you, or indeed anyone, can't answer those questions yet, because of focus and stage of research. Personally I have only followed adversarial attacks very superficially so far, because IMO before what you just released it was a point of concern for the mechanics of the ANNs (and inspiration for some good ideas) but for practical purposes more of a curiosity than a demonstrated real concern in applications. (If you're allowed to show people deceptively crafted scenes from exactly the right perspective point they fail too. Just look at Ames rooms. But good luck making that into a significant real-world exploit on humans.)

Any publications you'd care to recommend in the transferability subfield?


I agree - it's a surprising and cool paper. There has been some work done on fooling network ensembles by constructing constructing a Bayesian posterior over weights using dropout [0]. This is an ensemble of weights for the same network, not over different architectures, however.

The basic idea here is that most of the time, each member of the ensemble will misclassify the adversarial example in a different way. This means that the posterior predictive distribution for adversarial examples ends up much broader, and you can detect them this way.

Surprisingly, even this can be beaten in the white-box case [1], although it's by far the hardest to beat of the current adversarial defences, and needs much more distortion. It's beaten exactly as the GP says, by differentiating through the averaging. AFAIK no-one's tried architecture ensembling, but I expect it would be vulnerable to the same technique.

[0] https://arxiv.org/abs/1703.00410 [1] https://arxiv.org/abs/1705.07263




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: