The device type (for example, tablet or mobile), the operating system (for example, iOS or Android), the channel through which a customer comes to the website (for example, search engine or price comparison site), a do not track dummy equal to one if a customer uses settings that do not allow tracking device, operating system and channel information, the time of day of the purchase (for example, morning, afternoon, evening, or night), the email service provider (for example, gmail or yahoo), two pieces of information about the email address chosen by the user (includes first and/or last name and includes a number), a lower case dummy if a user consistently uses lower case when writing, and a dummy for a typing error when entering the email address.
Let's play a game. You are in charge of a large pile of cash and want to make it grow by giving loans. Each day, two people apply, and you can give out one loan (you will have to rank the applicants). When people de-fraud you, you lose all of the loan. When people don't or can't pay you back, you lose all of the loan. When people pay back the loan, you make a little money.
Day1: User Agent: iPhone latest vs. Windows XP
Day2: Referral: Facebook friend vs. search "cheapest loans"
Day3: Time of interaction: 21:30 vs. 04:30
Day4: Email: ari.johnson@cs.mit.edu vs. hpqwoovz11721@hotmail.com
Day5: Funnel: Someone who spend 10 seconds vs. someone who spend 10 minutes, made a mistake in the name, entered an email address, then deleted it, and entered another email address at a different provider.
Now if your feeling does not point you to the first applicant every day, you look at the data for guidance. You find that the number of fraudsters and non-pay's is statistically significantly higher for people with the second set of characteristics.
The alternative is to use third-party data providers. That's another can of worms. Or flip a coin and start gambling proper.
That game is common in credit scoring classes for fresh analysts. The class is split into small teams. Each team is given ten anonymized, but real, credit applications and corresponding credit bureau pulls. Five from customers who subsequently defaulted, and other five from customers who paid the loan back. Each team tries to guess which are which. The team that makes the most correct guesses wins.
Then the results are compared with FICO score, and usually FICO is clearly better. Even with people with banking experience on the teams it's very rare to see humans beat the model, partly because humans tend to base their decisions on irrelevant details and their own biases.
Let's change the game. You can allocate an investment to either of two banks. When building credit scoring models, one bank has access to just FICO scores, the other bank also has access to FICO scores in addition to behavioral and signature data. Which bank do you allocate your cash to?
Now change the game so FICO is unavailable: For instance, when micro-lending to third-world country entrepreneurs. Do you still feel these digital signatures are irrelevant to making better credit risk decisions?
All banks already use behavioral scores in credit card line management. Mortgages are a different story because there's little they can do once after the underwriting. That said, independent variables that go into behavior risk scores are not like the ones from the article.
In any case that game is pure gedankenexperiment, at least in the US. In reality banks have to comply with ECOA and host of other rules and regulations that limit the types of data they can use in credit decisions.
There may be more freedom outside the US, but even there social media probably carries much stronger signal.
Let's say you add these digital signature variables to your credit risk scoring model anyway. The model then falls prey to confusing correlation with causation. What happens to the performance of the model?
I have no idea, as merely adding them may have no effect at all.
However, depending on them exclusively (or in substantial/majority part), which I believe is the main premise, the eventual performance will depend entirely on if the the actual causal relationship which created the correlation holds true. If it doesn't, the model would no longer be predictive.
> Let's play a game. You are in charge of a large pile of cash and want to make it grow by giving loans. Each day, two people apply, and you can give out one loan (you will have to rank the applicants). When people de-fraud you, you lose all of the loan. When people don't or can't pay you back, you lose all of the loan. When people pay back the loan, you make a little money.
Actually, many people, including me, played a similar game on Prosper.com (as it was many years ago, not the current version). I can assure you that the reality is unlike your false dichotomy of losing everything [1] versus making a little money.
Even non-performing loans (that were not fraudulent from the get-go or discharged in bankruptcy) aren't totally worthless [2], as there is a thriving business in junk debt, and, presumably, at least some payments were already made.
Interestingly, I still have pennies trickling in from there occasionally, presumably from people who couldn't pay before but now can and are doing so because they know the lenders were individuals and not a faceless corporation.
[1] Prosper had a fraud guarantee, but it was a bone of contention among early users that they did not sufficiently honor it and that they subsequently changed the platform to increase borrower anonymity, thereby increasing the ease/potential for fraud.
[2] How Prosper handled these was another bone of contention.
A lot of potential for injustice via false negative judgements; after all, you aren't doing anything wrong by using a particular OS at a particular time of day, etc.
The device type (for example, tablet or mobile), the operating system (for example, iOS or Android), the channel through which a customer comes to the website (for example, search engine or price comparison site), a do not track dummy equal to one if a customer uses settings that do not allow tracking device, operating system and channel information, the time of day of the purchase (for example, morning, afternoon, evening, or night), the email service provider (for example, gmail or yahoo), two pieces of information about the email address chosen by the user (includes first and/or last name and includes a number), a lower case dummy if a user consistently uses lower case when writing, and a dummy for a typing error when entering the email address.