There was another article recently which argued that because there is a strong correlation between race and default rates, if you apply machine learning to a dataset, the algorithm will find a way to extract what is basically a proxy for the race from the data.
So basically any sort of ML applied to credit data will run afoul of the equal credit opportunity act.
The article also made the point that basically all ML credit scoring startups are illegal because of this, but they get away just because they are small and not on the radar.
I make credit scoring models for a living and I can tell you that in most countries you won't find "proxies" or any other strong variables just by applying "machine learning".
Usually when you try neural networks in this segment you end up with exactly the same variables and outcome as you would with a normal logistic regression with 10x the complications and a lot less stability of the model.
There simply is not enough input parameters that are significant to the outcome.
But it might be in the US where the field is less regulated and you are allowed to collect all kinds of information on the person that you could proxy something like ethnicity although I have not found ethnicity significant in any of my sets. We do get a some variables even if we are not allowed to use them. Again this might be different in the US.
> You can load in tons and tons of demographic data, and it’s disturbing when you see percent black in a zip code and percent Hispanic in a zip code be more important than borrower debt-to-income ratio when you run a credit model.
If you account for most other significant things like income, education, social status, job etc you will find that ethnicity is not significant.
The fact that you can sometimes use ethnicity as a proxy for social status and other things just show the discrimination that happens in some places. But when you set all other factors equal someone from Asia, Africa, Northern Europe will have the same default rate. At least in the (European) countries I've run models in.
People with the exact same FICO score have different default rates, if you manage to bin them by race: Asians > Caucasians > Latin > Black.
> We do get a some variables even if we are not allowed to use them.
Cool, so you had access to an ethnicity variable to measure its proxy power and significance? I feel this is important and very rare outside of Europe.
>any sort of ML applied to credit data will run afoul of the equal credit opportunity act.
That can't be right because the existing and very widely used FICO score is already a rudimentary form of "machine learning" applied to credit data.[1] (FICO secret formula correlates payment histories, credit-card balances, income, etc to calculate probabilities of loan default.) Clearly, automated machine analysis of _credit data_ is legal even though minorities have lower FICO credit scores than whites and subsequently get less approvals for loans.
The paper is talking about something else: the application of ML to non-credit data. Examples of datapoints such as:
MacOS vs Windows
iOS vs Android
GMail vs Hotmail
lowercase vs Proper Case when writing
etc.
Those non-credit datapoints are collectively referred to by the paper as the "digital footprint". E.g. the authors conclude that analyzing data revealed by web browser user-agent strings to calculate a "credit risk" can correlate as well as the traditional FICO score.
The issue you're talking about for Equal Credit Opportunity is if those non-financial variables are surreptitiously used to determine "whiteness" or "blackness" (proxies for race) -- or -- if the data was innocently analyzed for debt-default patterns but nevertheless inadvertently correlates with "white"/"black" and therefore punishes minority groups.
Downvoters: please point out what is inaccurate about my comment. If I made a mistake in reading the paper, I'd like to learn what I misinterpreted.
Its also possible that if you have good data on a person's financial situation and behaviour (and more questionable that of their social circle) that race stops being a relevant signal in the algorithm as it was only ever a proxy for a person's finances and job security.
If there was substantial remaining bias you could probably measure the impact of race on credit while training the model but remove it or consider everyone the same race when scoring customers.
(Disclaimer: I do ML for a consumer credit company, not in the USA).
Is it race or culture? Do you know if recent immigrants show the same "race" signal or not.
I wonder how much race signal remains after accounting for g. I know in a lot of other areas once you correct for g there is almost no race signal left.
During the days of officially/semi-officially-sanctioned racial discrimination in the US, it didn't really matter whether you were a dark-skinned person who was born in the US, or a dark-skinned person born in the Caribbean who immigrated here, or a dark-skinned person born in Africa who immigrated here. The only thing that system cared about was your perceived race, defined primarily by your skin color, and so every dark-skinned person got subjected to discrimination.
With recent immigrants today you can see some differences, but it's not due to immigrants having different "culture"; it's due to the historical baggage of the long, long period of discrimination suffered by folks who were already here. The immigrant probably has the benefit, for example, of an extended family that worked hard to save up and send someone to the US and provide advice and support and broker connections; the US-born dark-skinned person has had their extended family deliberately broken up, and subjected to policies that prevent intergenerational accumulation of wealth or other resources. And that's not the only sort of head start the immigrant gets, which means it should be totally unsurprising if we now see better outcomes on average for recent immigrants.
And now all the assumptions and stereotypes based on perceived race are being used as training data for "objective" ML/AI systems whose creators promise they're free of prejudice...
Are we not trying to determine if these systems are picking up a signal due to race or one due to culture? The only way to do this is look at people who share the same race, but have a different culture?
The historical baggage you describe is culture. We are all shaped by our history and our families history.
So basically any sort of ML applied to credit data will run afoul of the equal credit opportunity act.
The article also made the point that basically all ML credit scoring startups are illegal because of this, but they get away just because they are small and not on the radar.