Hacker News new | past | comments | ask | show | jobs | submit login
Harvesting LinkedIn data for fun and profit (cloudinvent.com)
216 points by stremovsky on Sept 11, 2019 | hide | past | favorite | 57 comments



This is a pretty tame use of the 2012 Linkedin breach. The breach also contained unsalted hashes which has mostly been cracked by now.

They all ended up in a huge collection (773 million records) containing email/password pairs from many different sources: https://www.troyhunt.com/the-773-million-record-collection-1...

With so many password variations for a user, you can do credential stuffing to crawl all the private accounts of an email to build a pretty complete profile of the person (not just correlate some linkedin profile like in this post). I am sure someone out there is already doing this for profit.


I use to (ab)use their Outlook Social Connector (OSC) from the now gone API (https://outlook.linkedinlabs.com/osc/people/details), they stopped it in 2015. I used it just to get names and profile images for easy onboarding (like gravatar).

There was a LSC-Signature header that was a sha1_hmac("POST%2Fosc%2Fpeople%2Fdetails$auth_token$unix_timestamp", "aa15bd5f089eb93a5b2b4a0e11443cb78e44f34d"); which I reversed from the Social Connector DLL, I never found it posted online, but others must have done the same.


I don't see the problem with collecting public profiles. They are, you know, public, and people entered their own data in the interest of propagating it.

nonpublic profiles (or nonpublic data from public profiles) from, say, FB would be different.


> and people entered their own data in the interest of propagating it

To humans, not machines. No one joined LinkedIn to be marketed to by random jabroni's or to be added to a CRM for BS intro emails that don't have opt out links despite being automated.


I don’t think that’s fair to say. I signed up for LinkedIn to publish data to the public that I’m comfortable publishing. I understand that this means random jabronis emailing me and put info out accordingly.

I’m not sure how many people feel like me or feel something else. But I don’t think it’s possible to say people post public profiles only for humans to read. I’m really glad my profile gets harvested by google and DDG and have a public email on purpose.


Yeah, we signed up to have that happen inside of the platform, not outside of it.


Emails are not exposed in public profiles..


That's true to a point. Most workplaces use a standardized format for email addresses. If you have somebody's name, you can send them official looking spam.


Just four letters: GDPR. And anyway: Just because something is published only does mean that you can collect, store etc. it however you like. That is pretty basic in my opinion.


What does GDPR have anything to do with the parent post.


The parent post explicitly says they don't have a problem with 3rd parties storing personal information hoovered up from LinkedIn.

The GDPR is very relevant here - restrictions around the storage and processing of personal information is the whole point of it.

While this might seem like a grey area (given the information is public), the GDPR is actually very clear here - you cannot store and process PI without the consent of those individuals.


Not in any way shape or form defending 'Harvesting LinkedIn data'- I think it's quite bad. But I'm getting a little concerned by all these 'but GPDR!' arguments I'm seeing out there.

The EU is not a world government, and the GPDR should not apply to non-EU citizens. Europe cannot regulate what I do here in America- the law is not simply applicable. (And I say this as someone that supports more tech company regulation here in the US!) Things like France trying to apply your 'right to be forgotten' to the entire world's Google search results are extremely troubling.

Don't apply your country/region's laws to non-citizens, please :)


America does this all the time though. By your reasoning, surely the DMCA should not apply to Youtube users from Europe?


(As an American) I don't think that America should have that much power, no. I'm also cautiously keeping an eye on the direction where Chinese regulation will go here too


GDPR protects the rights of EU residents (vs only citizens). As an American, if I reside in the EU, I am protected by GDPR.


While this does not apply to LinkedIn, GDPR does not protect you at all, even if you are a full-fledged EU citizen, on sites that do not intend to serve the EU market. Mere accessibility from the EU is not enough to prove this intent.

One could also argue that by having a fine structure that disproportionately affects small businesses (thus consolidating power, money, and personal data in the hands of a few large businesses), GDPR doesn't protect you even on those sites that are subject to it. Some might say that it is actually a privacy killer. But I'll leave that discussion for another day.


> a fine structure that disproportionately affects small businesses

That's simply untrue. From fines already levied we've seen small businesses getting fines of a few thousands, while BA is getting a fine of a couple of hundred million pounds.


The facts are not disputable, as they are contained in the plain text of GDPR for everyone to see. The legislation allows for fines of up to 4% of global revenues, or €20 million, whichever is greater. So the “Googles” of the world face fines of no more than 4% of a single year of revenue. Small businesses face potential fines that could be 100,000% (or more) of their annual revenue, because most businesses make far less than €20 million annually.

That seems like the very definition of disproportionate to me.


That's the maximum fine. The fine that would be levied in a particular case would be dependent on the circumstances of the case.

> Due regard should however be given to the nature, gravity and duration of the infringement, the intentional character of the infringement, actions taken to mitigate the damage suffered, degree of responsibility or any relevant previous infringements, the manner in which the infringement became known to the supervisory authority, compliance with measures ordered against the controller or processor, adherence to a code of conduct and any other aggravating or mitigating factor. The imposition of penalties including administrative fines should be subject to appropriate procedural safeguards in accordance with the general principles of Union law and the Charter, including effective judicial protection and due process.


EU law requires that all fines and penalties be proportionate. Fining a small business 100,000% of their annual revenue is clearly not proportionate. It's groundless FUD.

The proportionality is not in GDPR or any individual law, but set out in the framework treaty under which all EU laws function.


That's not accurate. The part that I think is causing you some confusion could be this section of chapter I

> (23) In order to ensure that natural persons are not deprived of the protection to which they are entitled under this Regulation, the processing of personal data of data subjects who are in the Union by a controller or a processor not established in the Union should be subject to this Regulation where the processing activities are related to offering goods or services to such data subjects irrespective of whether connected to a payment

...so if you're offering any goods or services to non EU citizens who are in the EU but you are a non EU company, GDPR still applies if the processing relates to offering them goods and services.

Note however:

> (22) Any processing of personal data in the context of the activities of an establishment of a controller or a processor in the Union should be carried out in accordance with this Regulation, regardless of whether the processing itself takes place within the Union

...and...

> (24) The processing of personal data of data subjects who are in the Union by a controller or processor not established in the Union should also be subject to this Regulation when it is related to the monitoring of the behaviour of such data subjects in so far as their behaviour takes place within the Union.

So monitoring of EU data subjects by non-EU companies and processing data relating to their activities in the EU are definitely covered by GDPR even if you don't intend to offer them goods and services.

Text above quoted from the English text of GDPR as at https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CEL...


You're actually ignoring the relevant part of recital 23:

"Whereas the mere accessibility of the controller’s, processor’s or an intermediary’s website in the Union, of an email address or of other contact details, or the use of a language generally used in the third country where the controller is established, is insufficient to ascertain such intention, factors such as the use of a language or a currency generally used in one or more Member States with the possibility of ordering goods and services in that other language, or the mentioning of customers or users who are in the Union, may make it apparent that the controller envisages offering goods or services to data subjects in the Union."

In other words, don't offer a site in EU languages, accept EU currencies, or ship to the EU and GDPR does not apply (unless you are based there).


> The GDPR is very relevant here

If the author of the parent post isn't in Europe, why is it relevant.


Perhaps I should have been clearer - it's very relevant here, if you are in the EU or do business in the EU.


Just a reminder that using data obtained illegal (e.g. the 2012 LinkedIn hack) is also illegal (depending on your jurisdiction, IANAL, et cetera...)


I don’t think that’s correct. As long as you didn’t collude in the illegal act, it’s legal in the entire US. Not sure about other countries.

This came up most recently in the Manning/ Snowden leaks as while the leaks were illegal, using the info is not illegal. There was a lot of press and here’s a decent post by a law professor explaining legality, https://jonathanturley.org/2016/10/17/cnn-it-is-illegal-for-...

Data is different from stolen property in that it isn’t property. Once data is made public it is no longer proprietary so it can be used legally even if it was originally obtained illegally. This is different if you were paying for non-public stolen data, but no one here is talking about that.


At least in the western United States, scraping is fine: http://cdn.ca9.uscourts.gov/datastore/opinions/2019/09/09/17... . LinkedIn was even a party to this case!

The UID->email data dump was not, AFAIK, legal though.


That is not at all what that ruling says. The ruling just upholds an injunction until the case is decided. It explicitly does not pupport to provide any precedent on the legality of scraping.


That's not exactly right. You're correct the ruling only upheld a preliminary injunction. Like you imply, that's a provisional remedy before trial, subject to change. But in practice the court is unlikely to change its views about how the CFAA operates. And rulings about preliminary injunctions are frequently cited as precedent.

Here, the opinion strongly suggested the CFAA does not prohibit scraping of publicly available data:

"It is likely that when a computer network generally permits public access to its data, a user’s accessing that publicly available data will not constitute access without authorization under the CFAA." (Opinion at 33.)

Since this was a preliminary injunction, this passage won't be binding on other courts; however, it certainly will be cited as persuasive precedent. So will more policy-oriented passages like the following:

"giving companies like LinkedIn free rein to decide, on any basis, who can collect and use data—data that the companies do not own, that they otherwise make publicly available to viewers, and that the companies themselves collect and use—risks the possible creation of information monopolies that would disserve the public interest." (Opinion at 36.)


I am no legal expert, but the judge explicitly warns against reading too much into him upholding the injunction:

> I emphasize that appealing from a preliminary injunction to obtain an appellate court’s view of the merits often leads to “unnecessary delay to the parties and inefficient use of judicial resources.” Sports Form, 686 F.2d at 753. These appeals generally provide “little guidance” because “of the limited scope of our review of the law” and “because the fully developed factual record may be materially different from that initially before the district court.”

The opinion does cite other 9th circuit decisions that imply that the 9th circuit believes that the CFAA does not prohibit scraping, but also explicitly notes that the CFAA is not the only relevant law.

> We note that entities that view themselves as victims of data scraping are not without resort, even if the CFAA does not apply: state law trespass to chattels claims may still be available

I don't see where my claims overstep what is laid out in that opinion.

Edit: The opinion does indicate that there is a good chance that the 9th circuit will eventually rule that public scraping is not covered by the CFAA, but even if the 9th circuit Court does make that ruling, that still would not mean that scraping is legal under other laws.


Your point that other laws might (or might not) apply is a good one that people should note. "Trespass" is one. LinkedIn apparently alleged violation of the DMCA too. It chose not to press those issues in the appeal, so the opinion isn't direct authority regarding those other laws.

Where I disagreed is with your statement that the case "does not purport to provide any precedent on the legality of scraping." It does. It is persuasive precedent that the CFAA does not bar scraping publicly available data.

The first quote you mention ("I emphasize...") is from one judge's concurring opinion, not the court's full opinion. Further, the "little guidance" part of that quote doesn't mean that the opinion provides "little guidance" in general. The concurring judge was making the point that the parties shouldn't have delayed a full trial while waiting for the appeal. By "little guidance," he meant that the appeal provides "little guidance" to these particular parties about how a full-fledged trial will play out.


Please show me where there is a definitive statement in the opinion that the CFAA does not apply to scraping?

The language is very consistent and careful about not doing that because the court did not rule on that matter.


Something doesn't need to be "definitive" to qualify as precedent. See the quote I mentioned above at page 33 ("It is likely..."). That's precedent.


"Precedent" has a specific meaning here, in the context of legal cases:

> In common law legal systems, precedent is a principle or rule established in a previous legal case that is either binding on or persuasive for a court or other tribunal when deciding subsequent cases with similar issues or facts.[1][2][3] Common-law legal systems place great value on deciding cases according to consistent principled rules, so that similar facts will yield similar and predictable outcomes, and observance of precedent is the mechanism by which that goal is attained.

There was no principle or rule established here regarding the CFAA, thus no precedent that must be considered by other courts.

Courts generally try to restrict their rulings to the minimal needed to decide any particular case.

If the court had made a ruling, they would not make a point of qualifying all the statements about the CFAA the way they did.


Sorry for just getting back to this.

The case did establish "principle[s]" that are "persuasive" for courts deciding subsequent cases. Put it this way: Say someone gets indicted for violating the CFAA by scraping a public site. You bet their attorneys will cite hiQ v. LinkedIn as persuasive precedent for dismissing the indictment. And the court, "when deciding" that case, absolutely will consider the Ninth Circuit's statement that it's "likely" that accessing "publicly available data will not constitute access without authorization under the CFAA."

Here's another point: When the Ninth Circuit decides a case, it chooses whether the decision is "published" or "unpublished." The Ninth Circuit rules expressly say that "unpublished" decisions are not precedent.

> Ninth Circuit Rule 36-3(a): "Not Precedent. Unpublished dispositions and orders of this Court are not precedent...."

Here, the Ninth Circuit chose to issue hiQ v. LinkedIn as a published case. If the Ninth Circuit wanted the case not to be precedent, it would not have done so, and easily could have made it "unpublished."


I'm pretty sure if you're using it for research purposes like the numerous other people out there and you can prove it's not malicious, you'll be fine.


Don’t keep those bombs like Capital One hacker Paige Thompson did.

You might hit an extra jailtime jackpot.


Where's the profit part? How did you monetize stolen data?


It's a figure of speech.


I think he is referring to the start up he was working for.

>During my work on the start-up, I developed techniques that allow me to collect and cross-reference a lot of personal data including data from LinkedIn.


By getting publicity he builds social capital, which he (presumably) can turn into real capital through his work.

From his bio "...leading the evolution of startups and enterprises to achieve the highest level of security and compliance."


Here's a friendly Python library that is ideal for this: https://github.com/tomquirk/linkedin-api


> Get rid of duodecimal profile ids. Obscurity is not a solution here.

I don't think it's meant to be a security element, but to disambiguate same name collisions, right?


This is true. The profile URL is customizable, and the profile id can be removed from the url.



I don't really understand the author's claim that...

https://il.linkedin.com/directory/people-a-1/

...contains links to all public LinkedIn profiles. I looked for a bunch of people I know with public profiles and they weren't in there (and neither was I).


This specific subdomain lists people in Israel (IL.linkedin.com). There are other subdomains for other countries.


I think I missed it, but how did he get their emails? That part I didn't understand as I was hoping LI didn't expose that...


>Searching on Google, I found the database from the LinkedIn 2012 hack. Each record had a user id and an email without additional information.

>The link to the LinkedIn user profile was missing and personal information was lacking. As a result, it was not very useful.

I think he is downplaying the value of that hacked database. Without it what would he have? userid and profile url combos...


You can also enumerate users based on phone numbers, you don't need the database in that case, 10k numbers per account, probably also somehow resettable but I haven't spent that much time on it because LinkedIn didn't find it an issue.


Ah right, just seems silly then, as he is relying on a hacked DB for the most private part.


this seems pretty useless...


He took two data sources (LinkedIn and LinkedIn data hack) and combined them to get first, last, email, linkedin profile ID. Imagine a spammer having millions of active email addresses with first/last.


> Imagine a spammer having millions of active email addresses with first/last.

Don't they already? There have been SOOO many breaches in this area that I rather doubt there are many active emails that don't have some publicly available dataset linking them to first and last names. The valuable thing here is linking that data to the LinkedIn profile ID.


> For the past 15 years I’ve been leading the evolution of startups and enterprises to achieve the highest level of security and compliance.

... serving up over unsecured http


That creepy. Good catch OP


This is clickbait and does not belong on the front page of HN. The author says he scraped public profile URLs and names. When you make your profile on social websites public, then you have chosen to...make them public. He then claims he has emails from LinkedIn, but those emails are from an old data breach, and he even admits that the emails are limited to those found in a 2012 data breach.

Finally, the title of this article says he did this for “fun and profit”. By the author’s own admission, the “profit” part is missing here. He claims the company “...went out of business without getting funding”.

So in other words, he has access to the main LinkedIn website, found a link to a database from an old data breach, and used to work for a now defunct company. None of that translates to “Harvesting LinkedIn data for fun and profit”.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: