I gotta say, I'm not really seeing the creepy / cringey / evil / whatever-else here...
Anyone (especially the HN crowd) should know they have the data, and if you think they're not carefully analyzing it behind the scenes (like every other tech company who has your data), I've got things to sell you. I personally think a tiny peek like this into the data, much like the usage posts that OKCupid, YouPorn, and others give, is neat.
The problem here (for me personally, at least) is that Uber is not in the business of selling dates/"encounters" and that people don't expect a ridesharing company to go right for the sexual data. Even OKCupid is straddling the line here with http://blog.okcupid.com/index.php/we-experiment-on-human-bei... noting that:
To test this, we took pairs of bad matches (actual 30% match) and told them they were exceptionally good for each other (displaying a 90% match.)
That's really not something people like having done to them. And the "HN crowd" shouldn't have an expectation of privacy and decency in data? Of course they're analyzing data, but it's really the viewpoint from which they do it that is unsettling. OKCupid says "no, duh, we're unethical. Deal with it." Uber says "Check it out! We drew a line between social security checks and prostitution!" (as waterlesscloud notes at https://news.ycombinator.com/item?id=8644138 )
There are a million more beneficial ways that people could be using the data. Fighting hunger, poverty, illiteracy, etc., to me, is a "good" use of Big Data. Looking at sexual habits (when you're not selling sex) or openly manipulating people to get data is, to me, a "bad" use.
The idea that statistics have a moral imperative to be "decent" is as fascinating as it is ridiculous. Anonymized data is not a privacy breach, and Uber probably doesn't have any data that can help with "hunger, poverty, illiteracy, etc.".
I'm sorry if the idea that "people's short overnight stays are evident in their travel data" makes you blush, but that isn't anyone else's problem.
The whole point is that it's not anonymized and it's being inspected for and used for purposes that have little to do with the ostensible customer-service agreement.
People aren't used to transit companies interrogating them about the purposes of their journeys, they just want the transit company to get them from point A to point B (imagine if they did this when you got in the car: "Where are you going? Why?")
And obviously from a business perspective the more you understand your customers and their motivations the better you can serve them.
But lets not kid ourselves. This isn't anonymized data. Uber's publishing in a format that is unspecific, but they have all of the detailed data and can poke through it and infer things at their leisure, and they have no compunction around how they're doing it or why.
This is why ethics and trust around data collectors is really important. Uber seems pretty cavalier about it, and that actually is a problem.
> But lets not kid ourselves. This isn't anonymized data. Uber's publishing in a format that is unspecific, but they have all of the detailed data and can poke through it and infer things at their leisure, and they have no compunction around how they're doing it or why.
That's a fairly large accusation to make.
This blog post was originally published in 2012 - two years ago. Since then has anything come out that would confirm your suspicions? I haven't seen anything.
I included the word "decent" because the way they used data goes beyond people's "overnight stays," they previously analyzed the spending patterns of people and tied it to welfare checks and prostitution. They immediately call it "one of the coolest things about working for a data-driven company like Uber" afterwards. It's bad data science, not only because it's only a correlation and not an experiment so nothing can be proven, but because they use these unproven claims to say outlandish and unethical things.
I meant "decent" in an ethical sense, not in a conservative "don't you look at my 'short overnight stays'" sense.
>It's bad data science, not only because it's only a correlation and not an experiment so nothing can be proven, but because they use these unproven claims to say outlandish and unethical things.
I don't disagree that they've not scaled any sort of pinnacle in data science, but neither do I think what they're reporting is uninteresting.
In what way is what they're saying outlandish and unethical?
A little off-topic, but I don't see why OKCupid's actions here are unethical. Their matching algorithm isn't perfect, so they shouldn't treat it as an oracle of truth. How else would they discover false negatives in their algorithm? Especially since, in this case, a false negative is worse than a false positive (not meeting someone you'll like vs having one unsuccessful date).
> How else would they discover false negatives in their algorithm?
This is exactly why research that deals with humans at Universities invariably must pass a human subjects review process. "How else would we discover X?" is certainly not reason to subject anyone to an unethical experiment. Subjecting people to what you likely believe to be a bad date should very definitely raise red flags, even if the details in practice would pass a human subjects review.
And that's the trouble: there's a tremendous space of research that just isn't ethical to carry out on actual living humans. As such, we have to find methods to determine answers to those questions that don't breach ethical standards. The burdens of discovery must lie squarely on the researchers, not on the (often unwitting) experimental subjects.
Do you think that giving someone an artificially inflated OKCupid match really rises to the standard of an unethical experiment though? OKCupid doesn't tell you who to go on a date with; they just suggest potentially good matches. (Right? I'm married and don't tend to troll dating sites, but that's my understanding.) You're free to read their profile, exchange messages, etc., before arranging a date. If it is indeed a bad match, then most likely you would realize your incompatibility early in the process.
People need to at least understand what's being done and they need to give consent before it happens. Otherwise, you're literally toying with people's lives. And in this case it's not in some insignificant way: you're manipulating their romantic and sexual endeavors.
It's actually far, far more invasive than what Uber did as they described it in the blog post.
That's a completely different category of life violation though. Imagine instead that your bank was lying to you about your account balance, modifying it to be plus or minus 3% of the actual balance. Without your consent or knowledge. All to conduct a "psychology/market experiment".
Nothing in this post involved distortion of customer data. They just linked up transcation time/date and geo-location data. Then did some simple math. It's not out of the question that your payment processor could replicate this analysis...Once your credit card processor cuts a deal to geo-tag your purchase history. Of course almost all fixed POS hardware is geomapped, and the mobile stuff is trackavle, so that's not much of a stretch.
The uber post almost certainly did not violate anyone's privacy. They ran a bunch of aggregate queries that probably dropped any pii pretty early on. They did not publish a list of riders who took a ride of glory.
(I say they "probably dropped PII" because when you do work of this sort, PII is boring data that slows down your calculations.)
Similarly, what's wrong with observing a correlation between welfare checks and prostitution? It's an interesting observation. It's potentially useful for public policy and fighting poverty (at least American style relative poverty), though of course a more detailed investigation needs to be done.
There's a difference between this type of post and a post by OkCupid. OkCupid is a dating platform and their blog posts are net-positives for their users. What should I say in my first opening message? What do I wear in a picture to attract a mate?
By contrast, it's simply not professional and reeks of juvenile behavior for Uber to be writing a post like this. Just because you have data and have these thoughts, doesn't mean you have to do the analysis and show the world. It doesn't help their users, it's not even that interesting, and it's not relevant to their value proposition as a business.
Feels to me like someone saw the success of OKCupid and their content marketing strategy and tried to shoehorn in something similar with whatever data they had with less than stellar results.
Yes. And when considered in light of Hourdajian's statements about privacy and Uber's data-policies, it is rightly termed as "questionable" in the PDF that references this article.
I think the problem is context. Had uber been a really fun company, we would laugh at insights into our very being.
But since they are accused of trying to dig up dirt on people, this is a chilling reminder that they are more than capable of doing that, and apparently quite willing.
But this was from 2012, when Uber was a "fun" company. They were doing on-demand valentines and mariachi bands. It's in line with a small startup (which they were), and doesn't present any sensitive information.
A limousine service that uses business records to work out when passengers are fucking and then writes articles about doing this, complete with fucking graphs and even some fucking maps, I think qualifies as pretty fucking creepy.
> Anyone (especially the HN crowd) should know they have the data, and if you think they're not carefully analyzing it behind the scenes (like every other tech company who has your data), I've got things to sell you.
That's the creepy bit. Who owns that data? I want to live in a world where I own my data, and it can't be used for creepy purposes like this, or to extract additional value through arbitrage based on asymmetric information availability.
Well In this digital age, we do not own the data even if we are generating the data. So I would not go to extreme telling companies not to use my data for any purpose but I would definitely like some assurances from them not to use for such creepy and unnecessary means. Time and time again, we are seeing companies abusing our data be it Facebook for manipulating news feed for experimentation or Uber for such nonsensical studies.
You would have much more control over your personal data in the EU. Companies are required to share it with you on request and are subject to limits on how long they can retain it.
Yeah, the blog post seems fine minus the last sentence. And if they'd simply removed the last sentence, I doubt anyone would have noticed.
The Streisand effect is so well-known that I'm surprised anyone would delete a blog post nowadays.
EDIT: I actually hadn't read the blog post in detail until now, which was more than a little dumb. I thought it was just an analysis of rides along with some neat heatmap images. I didn't realize it was about sexual datapoints.
Guys aren't really hitting it on the head. It's about the level of behavior that companies release analysis on.
See, there's another company that occasionally releases interesting data analytics: Google.
See: Word frequency over time, Predicting the spread of viruses from searches, etc.
The issue is that Uber is trying to explain motive and behavior at the individual level ("I know something about you!"). This is something that would be a definite no-no for Google. The cheekiness of the language certainly doesn't help either.
And yet, there exists a huge database of precise pickup/dropoff points and times for NYC yellow taxicabs that someone obtained using a Freedom of Information Act. http://www.andresmh.com/nyctaxitrips/
I think I'd rather my data only be available to a private company and their handful of engineers than the whole world.
I agree that the article, much like the similar OKCupid ones, is pretty interesting.
(though it's not very well-written, some analysis a bit iffy, and the guesswork towards the peaks and dips in the graph rather low-effort)
Creepy/evil maybe no, because the data is clearly anonymised. However the cringe is all over this article. OKCupid's stuff could easily be just as cringey, but they know it's important to steer clear from that. Also they're a dating site, if they wrote an article about data-mining one-night stands, that would make sense. Not so much for a taxi company, especially not in light of Uber's general attitude.
The final sentence of the article definitely crossed from "cringe" into "creepy" for me, though. In particular from someone called "Uber".
The PDF in which this article was referenced did so to illustrate the availability of this data for frivolous purposes, and is right to call it "questionable" when considered in light of Hourdajian's statements about privacy and Uber's data-policies.
To me, it's not even the use of data per se that is most creepy about this post. Really, the tone of the essay seems to revel in "having 'fun' with user data," as if a sophomore at a university wrote it.
I mean, I found the idea behind the post interesting: of course you can analyze trends in ridership to draw interesting conclusions. At the end of the day, however, it's a horrible idea to say "Hey, we know which of you are being 'frisky' and where!"
Perhaps with a different motivation, this post wouldn't be nearly as ruinous. How about ridership patterns of sick or socioeconomically disadvantaged people? That's the kind of data that can change lives for the better.
Since I know author (he had the desk next to me in the lab) I feel I should add something here. Let me just say that he was someone who had recently finished his PhD and was taking a summer data science sabbatical with Uber. I know that they were also working on how to optimize the distribution of cars to best serve a market but that would be pretty boring to talk about. I think what you are seeing in "having fun" with the data is actually more of the neuroscientist/psychologist coming out in him combined with his propensity for making science topics interesting and maybe a little sensational. This is the guy who also brought us zombie neuroscience remember (check it out if you aren't familiar). At best we are daily trying to infer subjects internal processing/motivations from the sorts of actual behavioral measures we collect in the lab. This is one where behavioral inferences are perhaps better than self-report. As to your other point, I can even see how knowing how many people are being frisky and where they live being used to improve public health policy although that was obviously not the point (and it being kept anon of course). But let's not forget that a cohort of uber riders is not a very uniform sample of the population and you are going to be somewhat limited in your ability to use it as a tool for social justice. He did a similar analysis although more in line with our lab work where he combined a big brain training app's anon user data with their state geography and its demographics to draw some interesting conclusions that you wouldn't have expected. These are just the application of techniques that we use in the lab to make inferences about brain activity and unravel its complexity and weave an understandable story. I am certainly not trying to defend Uber's other data privacy issues which are in the news and very concerning but I don't think this is one of them. I hope this helps a little knowing more of the backstory.
Agreed, it's all about the vibe. The writer feels like that uncle that conspicuously left openings for you to talk about your 18 year old love life in every conversation. And who was just a bit too enthusiastic the time you dated a cheerleader for a few months.
He acknowledges the questionable nature of his blog posts on his website:
> Between June 2011 and August 2011 I worked with my friends over at Uber as their data scientist, writing (what I thought were) amusing, data-driven blog posts (among other, more serious roles).
archive.org respect the robots.txt of the current website owner. This can mean that they have the data but choose not to give you access to them. I have seen cases in the past where a website I once frequented became defunct, then the domain expired, then someone parked a holding page on that domain including a robots.txt that keeps archive.org from displaying the old data (which do not even belong to the current owner of the domain!).
If they wanted to, there are a number of ways Uber could prevent archive.org from displaying that blog post. Many of these ways are due to the good faith under which archive.org operates (nobody is forcing them to respect robots.txt), and some even involve resorting to legal methods. But history is always mutable.
(Nothing but love on my end for archive.org, believe me! But I do want to point out the lengths that some people will go to alter the historical record).
They could just timestamp different versions of robots.txt (which they probably do already), and respect it depending on date (which is more of a hassle, because you have to build it in your UI logic).
That would not solve the problem they're trying to solve.
Let's say I post something that I shouldn't have posted -- insider stock information, nude photos, whatever. Perhaps something illegal for me to post. I need to make it go away.
I need to be able to create a robots.txt today which affects stuff I posted yesterday.
This is why archive.org respects the current robots.txt for access to past content.
In another deleted post [0] the author talks about using a name-to-gender API to look at ride locations by gender, which implies that these analyses were not done using anonymized data.
You have to start with the original data, which is obviously de-anonymized. Full data -> [gender, time, origin neighborhood, destination neighborhood] leaves you with a pretty anonymous dataset, and is all that would be required for this analysis.
Internal metrics teams nearly always have access to complete data. The issue is sharing non-anonymized data externally.
I agree it's possible that the name-to-gender mapping was done before the full ride data was handed over to this analyst. (Though just removing real names would still leave a lot to be desired in the anonymizing process).
However there's no mention in these posts of such safeguards, and subjectively the post reads more like the analyst is just fishing around in the full raw dataset of ride times, start and end locations, and names. To wit:
"What else can we learn? First, we can devise a way to statistically assess whether there are more women or men in a neighborhood than we’d expect. [...] We used Rapleaf’s Name to Gender API to assess the likelihood of a rider’s gender given their name, only accepting a match if the probability was >= 95%."
And in the original post, he categorizes rides as possibly related to a late-night hookup based on whether the destination and departure points for 2 rides are within 0.1 mi of each other.
>Internal metrics teams nearly always have access to complete data. The issue is sharing non-anonymized data externally.
I disagree pretty strongly with this. Do you think that your average Uber rider would be OK with Uber employees analyzing their ride patterns (with their real names attached) to try to figure out where and when they are having sex? Do you think Uber should allow such access to its employees by policy? (It seems we agree that writing a blog post about it is not a great idea.)
> Do you think that your average Uber rider would be OK with Uber employees analyzing their ride patterns (with their real names attached) to try to figure out where and when they are having sex?
I don't see how this is any different from Google analyzing search data to try and figure I'm pregnant. You could make the argument that "its a algorithm" but at one point someone had to sit down and build that model.
> Do you think that your average Uber rider would be OK with Uber employees analyzing their ride patterns (with their real names attached) to try to figure out where and when they are having sex?
Sure, as long as Uber isn't broadcasting that information with their name attached. The average person really doesn't care about (or understand the extent of) data analysis (from companies or the government) -- what they care about is public disclosure which may mean personal embarrassment or a lawsuit or other form of inconvenience. People who want to control all their data are hoping for a fantasy world where observations and inferences by third parties are magically made impossible. The reasonable thing to focus lawmaking efforts on is limiting legal forms of disclosure and standardizing safe storage requirements for the raw data -- indeed such laws already exist, with the HIPPA privacy rule perhaps being the best known in the US.
HIPAA's not a great example for you to use, since it does in fact limit access to protected information by employees (under a 'minimum necessary' standard) [0]. You can even serve time in federal prison for a violation without disclosing anything [1].
>People who want to control all their data are hoping for a fantasy world where observations and inferences by third parties are magically made impossible.
I think you are setting up a straw man here. What I suspect the average user expects is for their sensitive personal data to be dealt with in a professional and respectful way, with protections against abuse by rogue employees. There are plenty of companies who deal with private data and understand this well. Potatolicious had a comment on another Uber thread detailing the hoops an Amazon employee has to go through to get access private customer data [2].
Scrubbing these posts suggests that Uber realizes that they have a real problem, at least at the PR level. I wouldn't be surprised if they are also getting more serious about controls on internal access to ride data.
I think HIPPA is a great example precisely because it goes beyond "don't disclose this", it also regulates "safe storage requirements", whose purpose is ultimately to make unwanted disclosure (through breaches, rogue employees, etc.) less likely, of whatever scale. (e.g. my plaintext password for a service shouldn't ever be disclosed to even a single person.) I think we're in agreement about people generally expecting professionalism.
Personally, I think they do it with raw access to the database. https://www.uber.com/legal/usa/privacy under the heading "How do we use the information collected" says nothing about anonymizing rideshare data.
Yeah, frankly this doesn't seem all that bad compared to some of the words coming out of executives' mouths.
One could do an analysis like this while still working with anonymized data. Still a bit creepy, but not that different from reports and blog posts you see from other startups and tech companies.
Uber is suffering from a lack of credibility built up by by years of mild-to-moderate asshole behavior.
Nothing they've done so far, in isolation, are IMO worth the pitchforks being handed out in tech and mainstream consciousness right now, but taken as a whole it's pretty easy to see why people aren't willing to cut Uber any slack or give them the benefit of the doubt.
So yeah, this thing by itself isn't "that bad", but it's one piece of a large puzzle of Uber's misbehavior.
Off-topic: Because of situations like these, I'm surprised that part of the checklist when launching a PR blog is not: "Block googlearchive/archive.org robots"
There have been very, very few times when a company's webpage was down and I needed to go to google-archive or archive.org to refer to some innocuous information. However, the times that I've used those sites to gather evidence of possible whitewashing? Many, many times, in comparison.
OKCupid is a dating website which deliberately branded themselves as further on the "edgy" and "hookup" side of dating websites. Then you have POF somewhere in the middle, with eHarmony way on the other side, quite opposite of OKCupid.
I'm not sure why Uber would want to put themselves anywhere on that same scale (i.e. aligning your brand with notions of sex and one night stands). There's a time and a place for everything, and for edgy data analysis like this -- that "place" is edgy dating websites who want to be known for hooking up.
It's unprofessional and out of line with their brand image, obviously why the post got deleted. IMO this further validates all the bad press the media has been publishing about Uber.
It can be both, like the annoying coworker who won't shut up about how much sex he had over the weekend, or the guy who raves about his favorite scotch at an AA meeting.
A correct observation does not shield it from being inappropriate or in poor taste.
Sex is OkCupid's business, while it's not Uber's business. And yes, a lot of people find fault with their blog posts too. You'll see lots of comments like that on any related HN submission.
Interesting. In the more socially-aware fora I'm familiar with, what's called "tone policing" or "tone trolling" is seen as a major faux pas, an aggressive act designed to shut people out of the conversation.
Are you implying you agree with that? I always thought that accusations of tone policing was the most childish thing about that kind of forum/community.
The Boston anomaly and map is interesting. The requirements are a trip from 10pm-4am and a trip again 4-6 hours later. However, due to the MBTA shutting down around 12:30am and bars closing at 2am the criteria captures just about any partygoer.
So if you took an uber to some bar/club/friends at 10-11pm and again after 2am when all bars or the T is closed, you're likely counted. I doubt this represents customers having one night stands and is likely just a heat map. This is further explained by the small pocket in Somerville that is not accessible by the train, but by bus where people may opt for an uber.
That's not to say that there are no rides of glory or whatever the hell kids call it today.
Would google publish data that shows how searches for porn spike during different times of the day or times of the year, as if it's some "cool and hip and edgy!" insight?
I don't think so.
And for the same reason they don't (whatever reason that is), it would probably also be wise for Uber not to post stuff like this.
I really don't care, nor am I offended. I'm just speculating that Uber doesn't have the brightest team of execs and still have a lot of "growing up" to do.
Google have been fighting a public relations war for a long time now to not appear creepy or stalkerish. I can think of few things they could blog about to make people consider not using Google more than "we know when you're looking for porn".
Uber have not (yet?) been widely called out as being creepy the way Google have. But Uber have data that can be every bit as personal as your search history, and posts like these make it obvious that people at Uber are thinking hard about putting those data to use.
There's a lot lurking under what at first glance appears to be merely a poorly-considered sophomorish post.
Curious to see how companies deal with this kind of data because, ignoring the creep-factor, it is thoroughly interesting to see these sort of patterns emerge and the only way you can find out this kind of stuff is to track people.
It's the actions of the unscrupulous minority that ruin this for the rest of us. I personally believe that most of the time when companies say "We simply aren't that interested in you." they're probably telling the truth. Stats is pointless if you look at single points. It only takes one person to snoop on an ex or to blow everything up. Unfortunately you have to mitigate that risk, but proper database sanitisation before handing over to the analysts should be sufficient. Provided there is no overlap between the sensitive database and the one the analysts have access to there shouldn't be a problem.
I guess it's a side effect of becoming 'big' that you can no longer run these kind of public posts without looking extremely unprofessional.
So they blog about it in aggregate. It's not like they would know exactly who each of those riders were and would think about using that data for anything than the lulz. I'm sure this sort of data wouldn't be interesting for social engineering purposes in the hands of 'others' as well.
It is a bit unsettling that this information is out there, but I agree that it is fairly obvious that they have this data. And as long as they are not exposing the individuals, I don't see it as irresponsible to publish something like this.
There was a related story published recently, NYC Taxicab Dataset Exposes Strip Club Johns and Celebrity Trips
I think it's a lot of fun and they pulled some interesting data from it. There are far more pressing concerns in the world right now, people have sex sometimes NBD.
Published 2.5 years ago, removed in the past couple of days, after recent events. It's not like they've grown up over the years, it's that they got called on it.
Uber employees are the kind of people who kept a telescope in their bedroom window to peep on girls down the street. I'll never understand why they're still in business.
They're still in business because they are far better than the entrenched, governmentally-appointed taxi companies in any locality I've been in, except for perhaps Manhattan.
Manhattanite here. Taxis are still terrible for some things. While it's almost never impossible to find one, they still do the same tricks of "oh, the credit card machine is broken," and "I'll only drive uptown right now," and "I don't drive to JFK in the morning." I can't stand a lot of Uber's ethical practices, but I still prefer them over cabs.
Another Manhattanite here, I actually prefer cabs to Uber, though I prefer Lyft over both.
Uber drivers in the last year have become, without fail, become much worse at pathfinding than cab drivers. I've had drivers completely miss major turns or get lost while the meter's running. And more recently, I've noticed a pattern of behavior where I'd call an Uber and the car wouldn't even begin moving for > 5 minutes.
I'm not sure what that's all about, maybe they're waiting for Surge to kick in in the hopes of getting a fatter fare? Either way, I have not had an Uber arrive within the estimated time for over a year.
Lyft drivers on the other hand start moving right away after they're assigned.
> "oh, the credit card machine is broken," and "I'll only drive uptown right now,"
It sucks that you have to deal with it, but the solution is really simple. Just get in the cab, don't be a sucker and tell the driver where you're going through the window. If they balk, take out your phone and take a picture of their license at the back and tell them you're dialing 311. They will immediately fold and take you where you're going.
Ditto credit card - if the credit card machine is broken they are obligated to tell you at the beginning of the ride, and if they don't you can walk away for free.
Ditto the JFK thing - a cab cannot refuse a fare within city limits.
I've never seen a cabbie not fold like a house of cards when threatened with a 311 call. For all its warts the T&LC actually polices driver complaints pretty hard.
They provide a convenient, affordable service that people need. I doubt any significant number of their users care or even know about the company's shady and unethical doings.
Anyone (especially the HN crowd) should know they have the data, and if you think they're not carefully analyzing it behind the scenes (like every other tech company who has your data), I've got things to sell you. I personally think a tiny peek like this into the data, much like the usage posts that OKCupid, YouPorn, and others give, is neat.