Does GDPR consider data in training sets and trained deep learning models as your data? It's kind of a small snapshot of your expected responses to some stimulus right, it's arguably more your data than anything...
If it's personally identifiable, yes. You also need opt-in (not opt-out or buried deep in a TOS) permission to use personal data in that way before feeding it to your learning model (since that use-case is basically never the primary purpose that the data was given for).
If you use any sort of automated system to make decisions about a EU customer that impacts their life in a significant way (like whether to ban them or not) you will also need to have some sort of appeals system where they can appeal to have the decision looked at by a human and potentially have it reversed.
>You also need opt-in (not opt-out or buried deep in a TOS) permission to use personal data in that way before feeding it to your learning model (since that use-case is basically never the primary purpose that the data was given for).
Huh, now _that's_ interesting. Do you have a source for that? I know some guys at work that'll be upset if I can prove that to them, given that their pet project is a MI personalisation system making heavy use of just watching everything everyone does in an identifiable manner.
(I'll be honest, part of the draw is being able to say 'I told you so'~)
A general point of the GDPR is that when you collect data, consent is given for a business purpose. The user has the ability to opt-in to different business use-cases if they so choose. Data collected cannot be used for a business case that was not consented to by the user.
This area is one that gets more legal-y than other parts of the GDPR, because in some cases you can use data without consent if it's legitimately required to provide the service the user asked for, and as far as I can tell there's not a lot of guidance on what counts as being a different business use. But yeah, personalization is usually not a strictly necessary feature of most platforms, so you're gonna need the user to opt-in to using their data that way.
This guidance is kinda spread out over the GPDR, but one area of relevance:
Pay attention specifically to (3), but also (1)(c) and (2). Part (3) quoted below:
Where the controller intends to further process the
personal data for a purpose other than that for which
the personal data were collected, the controller shall
provide the data subject prior to that further
processing with information on that other purpose and
with any relevant further information as referred to in
paragraph 2.
> When assessing whether consent is freely given, utmost account shall be taken of whether, inter alia, the performance of a contract, including the provision of a service, is conditional on consent to the processing of personal data that is not necessary for the performance of that contract.
> ‘consent’ of the data subject means any freely given, specific, informed and unambiguous indication of the data subject’s wishes by which he or she, by a statement or by a clear affirmative action, signifies agreement to the processing of personal data relating to him or her;
'specific' and 'unambiguous' in combination seem to disallow the "bury it in the TOS" cop-out.
'informed' and 'specific' in combination seem to disallow the opt-out cop-out (since a opt-out permission is never specific, and basically never informed).
Article 7 paragraph 4 (the first quote) seems to disallow the usage of data unless it is necessary for the service.
Of course this is still all pretty untested in the courts, and IANAL but to me it seems pretty clear. If your primary service is not building a machine learning model based on your own users data you will need to get your users to opt-in for that specific use-case.
In short, aggregated data or statistical summaries is not constrained in the same way. I think you still need consent into to perform the aggregation/summarization, and said processing needs to ensure "statistical confidentiality," but such results are not PI.
(IANAL, and I'm still trying to understand this myself.)
So basically, the training set is under the GDPR if it includes PI, but the resulting model is not (unless you can extract PI from it), and you need user permission to use PI for training in most cases, right?
A lot of this type of thing isn't clear yet and will be worked out when GDPR is enforced.
At my previous employer, we took a pretty comprehensive view and tried to play it safe, so at the very least any non-anonymous data in training sets would qualify. That does, however, already beg the question of why on Earth you'd need to train a model with non-anonymized data in the first place!