Hacker News new | past | comments | ask | show | jobs | submit login
Building a Dark Web Crawler in Go (creekorful.me)
283 points by aadlani on Sept 23, 2019 | hide | past | favorite | 107 comments



First of all, it’s hidden sevices, not dark web.

Second, to anyone crawling hidden services or crawling over tor, please run a relay or decrease your hop. Don’t sacrifice other’s desperate need for anonymity for your $whatever_purpose_thats_probably_not_important. It could be some fun thing to do for you, but some people are relying on tor to use the free, secure and anonymous Internet.


Actually, the opposite is true.

People who actually need anonymity need to hide among traffic that is boring. If you reduce the number of hops your crawler is using, you're reducing the amount of boring traffic and making it easier to find the interesting people.

Running a relay in addition to using Tor in the normal way is a good idea, however, as it increases the bandwidth of the network.


In fact it is a bit more severe than that as you are effectively deanonymizing yourself. If everyone else is using a 3-hop circuit but your crawler is using just 2 hops, it wouldn't take much effort to isolate your activity in the network since you're effectively standing out.


There is plenty of traffic in which to hide already. Another bot making and breaking random connections 24/7 is of no additional help.


For one project, I agree, it makes no difference.

But if the received wisdom becomes "if you're not rebelling against an oppressive regime, you should only be using 1 hop" then the advice has real harmful effects.


Actually, the official term is Onion Services (https://2019.www.torproject.org/docs/onion-services.html.en check out the first paragraph)


It seems that they changed the name from hidden service to onion service.


That is correct. Tor Project has really struggled with negative connotations in the media with the so called "dark web" so I believe the terminology does play a part.


wow news to me, wonder how long that will take to catch on

onion like "layered" services


A polite suggestion, but this is not currently possible.

The Tor Project recently added a consensus flag which can globally disable single hop client connections as a DDoS mitigation approach. It is currently enabled. (DoSRefuseSingleHopClientRendezvous)


If I were to judge by the statistics of the heartbeat messages I have access to.. The number of people trying to create one-hop circuits is huge.


> First of all, it’s hidden sevices, not dark web

For the uninitiated, can you please explain the differences in what they are and how they're accessed?


Author here.

The differences are explained in the post. The dark web is a vast groups of services that cannot be accessed without using special software / proxy.

The hidden services are service running on the TOR network and accessed using a browser that use the TOR proxy.

They are a type of dark web services, but not the entirety


It's now just "Tor" and when accessing hidden services it isnt really a proxy. The Onion Router acronym went away back with the vidalia proxy. But i do miss the oldschool torbutton. It was fun.


the proxy which I refer is the local SOCKS proxy used by tor browser / your applications to be routed through the network.


> The dark web is a vast groups of services that cannot be accessed without using special software

This takes me to the 1990's.... "groups of services that cannot be accessed without using special software" definitely matches NNTP, FTP, SMTP, SSH, HTTP, etc.


hell I still have servers that won't do anything for me without https://


gopher://


Thank you.

> The dark web is a vast groups of services that cannot be accessed without using special software / proxy.

Can you name a few for examples?



Must one rely on appeals such as this (i.e. a cultural solution) or does tor have a technological solution to the problem you're describing?


They tried to enforce users to become a relay when using as a client but relays should be stable servers so the plan was dismissed.


Could you explain why relays should be stable servers?


Simple. Imagine you are using tor and one of the relay has problems(unstable, high latency or packet loss). You don’t know which of the 3 hops failed. We have no option but to build a new circuit and we don’t want that. That’s why tor needs stable, trustworthy relays.

From tor blog:

A new relay, assuming it is reliable and has plenty of bandwidth, goes through four phases: the unmeasured phase (days 0-3) where it gets roughly no use, the remote-measurement phase (days 3-8) where load starts to increase, the ramp-up guard phase (days 8-68) where load counterintuitively drops and then rises higher, and the steady-state guard phase (days 68+).

https://blog.torproject.org/lifecycle-new-relay


> We have no option but to build a new circuit and we don’t want that.

That's true, you never want to rebuild the circuit. But it strikes me that the idea that this is avoidable falls into at least two of the Eight Fallacies of Distributed Computing[1], namely "The Network Is Reliable" and "Topology Doesn't Change".

If we instead assume that the network isn't reliable, and topology does change, then instead of eliminating unreliable nodes and being conservative with changes to the topology, we would focus on reducing the costs of rebuilding a circuit so that network unreliability and topology changes aren't disastrous.

But it sounds like the Tor team has instead decided to bolster these assumptions, to make them less of assumptions; trying to make the network as reliable as possible and trying to make the topology change as little as possible.

I don't mean this to be a harsh criticism of the Tor team. I'm an outsider, and beyond an uncompromising privacy constraint, I don't know all the constraints Tor was built under. I'm sure the tradeoffs made by the Tor team make sense within the context of their constraints. Obviously, the Tor network works well enough to have a large user base, so they have provided a good-enough solution.

But I wonder if changes could be made to Tor's design in the future which would allow quicker adding and removing nodes, and handle network reliability issues better, so that Tor would be faster.

One possibility which stands out to me is to pool circuits and load-balance between them, so that if a circuit begins to have issues, you still are connected along other circuits while you build a new circuit to replace the unreliable one. This possibly would run into issues where correlate could correlate traffic from different circuits to unmask clients, so you'd have to be careful, but I'm not sure these problems would be insurmountable.

[1] https://en.wikipedia.org/wiki/Fallacies_of_distributed_compu...


Yeah, your suggestion sounds good.

But remember: What tor is doing is hard. They are doing complex crypto, networking, security.... The hard stuff. The real stuff. Torproject is a nonprofit organization with limited capabilities. They are doing their best. It took 3 years to design and implement dos mitigation techniques, for example.

Your proposed plan could take over 10 years, even for a well funded corporation. It might take time and fail. It might create huge vulnerability due to code complexity. Afaik, tor can’t risk that.


Agreed, as I said, "I'm sure the tradeoffs made by the Tor team make sense within the context of their constraints."


Well if we made a BLOCKCHAIN where you had to give a certain amount of bandwidth for torcoins then cash them in for bandwidth you could use...

I mean, it's a dumb idea, but until you force people to contribute, most won't. I've looked at doing it, but didn't want to deal with legal headaches b/c creepy pedos use tor, too.


I am no TOR expert, but how does decreasing the amount of hops or running your own relay decrease the privacy of other people?


Running large requests over many hops takes up limited Tor resources so it makes using Tor slower/harder for other people. Decreasing the amount of hops means that you use less resources, running your own relay means that you provide resources to others.


I don't think the other reply you got answered your question.

The thought is that other people are hiding in your noise. Make less noise, other people stand out more.


i just like how 'dark web' turned into 'Tor' at some point :'). there's tons of others... :s guess ppl forgot


A short list of those tons would be wonderfully helpful to share. :)


From the top of my head:

I2P Freenet Tor Zeronet


I think by a technical definition, every company intranet would count (minus some poor security configurations). But that doesn't really seem to be a fair companion to the more commonly discussed dark web.


> other’s desperate need for anonymity

Can somebody list some positive, legitimate, not illegal uses to desperately be anonymous?


The "not illegal" part is the catch here. Something can be illegal but still legitimate if the laws are illegitimate. Someone trying to exercise freedom of speech or the press under an oppressive regime would need anonymity to avoid being jailed or killed.


I think this is an important cultural reframing that needs to occur, sooner rather than later.

Most people, when they hear of things like "the dark web" and cryptocurrency think about the massively publicized instances of drug trafficking and ordering a hit on someone.

It's going to take a lot of work to reframe the utility and purpose of them to a more universal, humanitarian angle.

People in this world live in oppressive circumstances. This should be viewed as a step toward helping them not be systemically silenced.


Under the idea of legitimate and illegitimate use, why wouldn't drug trafficking be legitimate? It gets drugs off the streets, decreases violence compare to street level drug dealing, increases safety (while the reputation of online sellers isn't a great metric, we are talking relative to the person on the street corner), and generally involves only adults.

If one is willing to argue that the US government throwing someone in a cage because the grew or bought the wrong plant is legitimate, then I don't see how they have any standing to complain about China doing something for someone who held up the wrong sign at a protest.


I suspect that since illicit drug trafficking has a strong social stigma, it may not be the best thing to lead with. It can, however, be discussed with nuance in a way that could change minds. I definitely think there's a lot of legitimacy to what you're saying.

Reminds me a bit of what you see with how some societies approach drug addiction. Providing a safe space with clean needles vs throwing in a prison. There's a lot to think about.

And I think we've seen some of that with the marijuana legalization across the US. The state adoption had strong initial resistance, but public opinion began to shift once it got out of the shroud of stigma and moral enforcement.


Aside from situations like China where state censorship of history and news are an actual thing, or in situations where whistleblowers need to protect their identity... Some people just want privacy.

Give me a solid reason for why you want corporations and governments to have access to detailed records of everything you do online.

There's value in that data to certain groups of people and we may not like what the future looks like once that value is tapped to its potential.


Anonymity allows you to sow an action without reaping the societal karma of the action.

In a good, free society, maybe anonymity isn’t important.

But in a bad society, one in which collaboration on a cause is punished, but each individual desperately wants to collaborate and change something fundamental...

Anonymity allows the planning of synchronized action.

——

Mass or targeted misinformation also threatens the planning of synchronized action.


> In a good, free society, maybe anonymity isn’t important

There is something of a Catch 22 here I think. A society in which it's difficult or costly to be do something anonymously is essentially a society with total surveillance.

And it seems intuitive that a surveillance state is not good or free.

So there is an argument you can make that good and free societies should allow anonymity even if they are the sort of society where it is least needed.


Journalism, whistleblowing, accessing censored information, preventing stalkers from tracking you, etc.


How exactly can a stalker track me online if I simply stop logging in to services? Honest question, because I don't know why Tor would be any better than simply browsing in incognito mode


IP addresd is the big one, but there are other things that let you narrow down users on the same IP or a user switching between IPs, like tracking cookies, identification of which subset of hardware your GPU falls into based on how it renders some WebGL stuff (which can sometimes allow identification of a specific model of phone, especially when combined with other fingerprinting methods), specifics of screen size, what plugins/extensions you have installed at specific version numbers, etc. Tor only directly addresses the IP point, but the Tor browser should be disabling that other leaky browser stuff as well. I think they were accidentally leaking IPs through WebRTC a while back, or something like that, and I'm sure there will be more issues going forward.


You still are sending packets over your router, to your ISP, out into the internet, and to the destination server. You leave fingerprints everywhere (browser, os, resolution, fonts, enabled features, cookies, etc.) Forever cookies, DNS cookies. The list goes on.

You are being tracked at the very least as an abstract person. If any of the above fingerprints are linked to a real identity (logging in just once even, or posting your email on a forum) then you are now being tracked even logged out.

If you use Tor and log into services it has no benefits. Tor, the browser and other distributions, will still leave fingerprints but they will no longer be unique and match only you but everyone using tor.

Tor, the protocol, will hide that you are the one receiving or sending packets.

"why Tor would be any better than simply browsing in incognito mode"

Incognito mode does nothing to hide packets or source/destination you are communicating to. Your ISP could literally pull up all non-https sites you visited along with their content, assignable to you, airstrike, as a person. Tor would block this.


Would you mind posting your full name and personal details here please now? If not, then is seems you are also (desperate) in need of being anonymous right now...


Circumventing censored websites is a legitimate use. VPNs are more popular but the thing is that the VPN obviously knows who you are and what you're browsing. Tor makes you anonymous to the middleman as well. Tor wouldn't know if you're watching porn but a VPN would. By the way, that's not a recommended use of tor since you consume a decent amount of bandwidth (:P)


The whole idea is to enable it for positive legitimate uses that are illegal. If it isn't illegal, then you really don't need that level of anonymity and there are simpler technical solutions that don't give up as much quality.


Whistleblowing for one, though legality is questionable I suppose. To me this falls under the more general premise of fighting back against a (perceived or otherwise) tyrannical govt or organization, a very grey legal area.


Disclaimer: I have rather small experience with Golang and just skimmed the crawler code.

From what I could see, author made effort to make the crawler distributed with k8s (which I don't is needed considering there are only approximately 75 000 onion addresses) using modern buzzword technology, but from what I could see the crawler itself is rather simplistic. It doesn't even seem to index/crawl relative urls, just absolute ones.


Assume 100 pages on each onion address (it’s probably power-law but let’s just assume that’s the mean). Latency with Tor is super high. Assume average of 5s to load a single page. This is generous because tail latency will probably dominate mean latency in this setting.

These things can happen in parallel but let’s also assume no more than 32 simultaneous TCP connections per host through a Tor proxy.

So we’re looking at ~75k1005/32 seconds = 14 days to run through all of them. You may not need to distribute this but there are situations (e.g. I want a fresh index daily) where it is warranted.


Author here. I'm fairly new to Golang too and it's my first project.

Regarding the number of onion addresses available you are wrong. Addresses are encoded in Base32 which means there are 32 characters available. So there are 32^16=1.208925819614629174706176×10^24 addresses available.

Not taken but available.

I agree with the fact that the crawler is really simplistic. But the project is new (2 months I think) and has to evolve. You can make a PR If you want to help me to improve it!


"Addresses are encoded in Base32 which means there are 32 characters available. So there are 32^16=1.208925819614629174706176×10^24 addresses available."

As a defense against the parent comment, though, this proves way too much. It doesn't matter how much k8s you throw at that, you're never going to so much as find your first site, if you're looking at the problem that way.

That's not really a relevant number here.


Offtopic nitpick:

>Addresses are encoded in Base32 which means there are 32 characters available. So there are 32^16=1.208925819614629174706176×10^24 addresses available.

I sorta understand what you mean, technically it's 32 characters per position (5 bits), and 16 positions. In v2 .onion addresses, that is.

v3 ones [1] are 56 positions, but not all the bits are used for addressing, so the same formula wouldn't quite work to calculate real theoretical capacity. IIRC someone already made site which generates unlimited links to v3 addresses (without having them lead to anywhere, of course).

[1] https://trac.torproject.org/projects/tor/wiki/doc/NextGenOni...


> IIRC someone already made site which generates unlimited links to v3 addresses (without having them lead to anywhere, of course)

V3 addresses are just ed25519 pub keys and a couple byte changes. You can use Go libraries like Bine [0] to generate as many V3 (or V2) addresses as you want from keys.

0 - https://godoc.org/github.com/cretz/bine/torutil#OnionService...


I think 75000 comment is coming from stats[1].

[1] https://metrics.torproject.org/hidserv-dir-onions-seen.html


I assure you that there are less than 10k unique onion addresses. This is a huge overkill to have a distributed system to crawl something this small.

edit: onion services, not addresses


I'd be concerned that the DB is going to contain some pretty nasty stuff that might be hard to explain in front of a judge.


A crawler of the surface web will have this problem too.


If you avoid storing images, is there any other items you could be liable for?


The URLs to those items perhaps? I do recall relatively old-ish debate about the legality of Google et al having "links" to illegal content, despite not hosting/storing them.


You are right. That's why it's an educational project and not a public search engine


By the way, Ahmia publishes a blacklist with the hashed addresses for every onion URL that they have discovered to host abusive content. You could use that blacklist to filter out those sites so you don't even crawl them and also to periodically purge any matching URL that may have already made it into your index.


IANAL but "educational project" won't fly in court, and nor should it.


Programs like https://www.hacksplaining.com/ exist purely as educational programs that teach you to exploit known flaws in web security and have no issue with the law.


Right, but possession of those items do not constitute a violation of law. Whereas, the possession of child exploitation material does. No matter the reasoning.

I would tread lightly crawling the dark web. There are cases where the FBI has admitted to running services on TOR, to collect IP addresses:

https://www.wired.com/2013/09/freedom-hosting-fbi/


> Right, but possession of those items do not constitute a violation of law. Whereas, the possession of child exploitation material does. No matter the reasoning.

What about when the FBI/CIA does it? Genuine question.


No one watches the watchmen.


There is a legal exception for legitimate law enforcement activities.


But it is totally possible to host your own server with flawed security that you are able to legally hack. It might be rare for that to be the actual use case, but it is totally a possibility. With the illegal material being discussed here, there is no such equivalent no matter how out of the box a justification one is willing to aim for.


I don't host a Trandoshan instance neither give access to a database of results. I Only provide access to the source code.

Why should I face legal problems?


Probably okay to have the source code to the engine.

However, if you have used the the system which creates a database of questionable dark web links on your machine, that could be tricky to explain... and easy to implicate


Because some eager police detective or DA might read your article, raid you and find your personal instance/DB full of nasty stuff. Some of the nasty stuff will not only be illegal to distribute, but actually illegal to possess at all. Child abuse stuff for example.

I am guessing you have some personal instance you use at least for testing/"education", right?


Please stop the FUD, or point to an example of a software dev getting contacted about their software being used by a third party to exploit children


As others have pointed out, that's not what he said, but since you asked:

https://www.npr.org/sections/alltechconsidered/2016/04/04/47...


There is a bunch of stories like that.

E.g. https://www.ccc.de/en/updates/2018/hausdurchsuchungen-bei-ve...

> On June 20th, board members of the „Zwiebelfreunde“ association in multiple German cities had their homes searched under the dubious pretence that they were „witnesses“ while their computers and storage media were confiscated.


That's something very different though. Exit nodes are providing a service and are, for all intents and purposes, the only visible client on the clearnet (and might not even be involved: there's nothing stopping you from running a private proxy on the same machine you run your exit node on). TOR-developers that do not run exit nodes but contribute to TOR typically don't get searched, at least to my knowledge.

Content that's illegal to possess is a different issue, though I'm sure they'd make for an interesting case because a crawler downloading, saving and parsing an HTML page isn't as clear cut as a human evaluating and deciding what to download and store. "The suspect has the hard- and software necessary to download this content" shouldn't be enough to convince a judge to issue a search warrant, but then again, judges probably have very little technical knowledge.


The Zwiebelfreunde raids were not because of their TOR (hi dewey) activities, but rather they collected donations for the riseup email service.

If the police can convince a judge to raid the board members and their families of registered club just because they, among many other things, collected some donations for an US org, then some overzealous police detective or DA going after some dev who made a webcrawler for the "dark web" and is probably in possession (knowingly or not) or illegal content isn't much of a stretch either.


They can, but I'm not so sure that "may or may not possess illegal content" is enough for a search warrant - it's true for most of the population after all (running an exit node or collecting funds on behalf of a third party on the other hand is true only for a tiny fraction of the population). Granted, the chances are somewhat higher for IT people and higher still for people that write crawlers, but "we think he might, it's not impossible that he doesn't" is a bit thin, and unless they're trying to go after you for unrelated reasons, DAs don't love to have their asses handed to them by judges.


In many jurisdictions around the world it is enough. In Germany you need a "begründeter Anfangsverdacht" (reasonable initial suspicion) and what's reasonable is essentially up to the judge signing the warrant.

Hell, they used to raid people accused by third parties of copyright infringement (for private personal use), about a decade back or one and a half, but thankfully that stopped now. They would come early in the morning, present you with a warrant that said "based on evidence provided by <third party>..." (i.e. somebody somehow collected an IP address you might have used off of some file sharing swarm), take all your shit and scare your neighbors and quite often your parents because they raided a lot of minors too.

I know two people who this happened to personally. One guy wanted his stuff back, to which the DA replied that if they got to keep his stuff they would drop the case (I kid you not), and the other guy had his stuff returned about 2 years later, except for his HDDs. And his stuff not only included computers, CDs, DVDs, a printer, but they had actually seized books... paper ones... wat. Neither was convicted of any crimes in the end (IIRC they both had the thing dropped because "minor offense" not worth pursuing).

Turns out that a little googling of the German internet around that time turned up a lot of similar cases and some people claiming the police and DAs did that to get new computers "cheap"...


What are you talking about?

The OP wrote a crawler and used it to crawl Tor. Depending on where they live, accessing the content might be illegal, and storing some of the content in your computer might be illegal as well.

Law enforcement might be monitoring some domains, or have set up some honeypots that the OP might crawl automatically.

You don't want to end up in court having to argue about why your computed accessed some child pornography and downloaded it, and trying to explain to a jury that you did not did those things, but that the crawler that you programmed to do those things did.

Sure, nobody might end up raiding the OPs home, and even if they do, the OP might be able to successfully survive a jury. But just having to go through that might suck.

If the OP only wrote the software and never used it, then they are fine. But from the article, they did use it, so who knows where the crawler landed. Chances are nowhere good.


I didn't talk about the dev getting contacted about third parties abusing the software, but about the dev keeping a DB of indexed content for development/testing/"education" that would most likely include illegal-to-possess content.

And that some eager police people like to "inconvenience" people connected to TOR somehow isn't exactly new, either. E.g. there have been multiple raids against TOR exit node operators in different countries around the world in the past, even when the police was fully aware it was a TOR exit node that did not store information.

Maybe I'm just too paranoid - then again I used to run a TOR exit node myself and had a bunch of less than pleasant run-ins with the police, tho thankfully no raids.



Unless you’re Pete Townshend.


To anyone experimenting with such stuff, take care and don't make your services publically available. Especially the dark web is full with highly illegal content such as child pornography and in some jurisdictions even "involuntary possession" such as in browser caches may be enough to convict you.


Do you think I should add a license in Github to mention that? To protect me and the users who will use the crawler?


yes.


I’ve been pretty surprised at how big hidden services have become

Dread, the dark net reddit, is surprisingly vibrant

I think its weird that people almost don't want to hear positive stories about dark net.

It’ll be funny when news articles and romcoms just start “forgetting” to qualify their plot piece with the “its scary” trope


I thought dread was dead?


Its not, hit up dark fail for the onion link to dark fail and browse the latest onion links


So is bitcoin they tell me...


Crawlers are fun!

If you're new to the field and want something that's easy to set up & polite, I strongly recommend Apache Storm Crawler (https://github.com/DigitalPebble/storm-crawler).


A well written article with lot of technical details. Well done.

However, I'm wondering what would be a good practical purpose of crawling dark web.


Thank you!

There's no practical purpose for the crawler. It's more an educational project than anything.


Weird, for some reason your comments are being instantly marked as "dead." I think there's some kind of filter that's tripping out for your account since it's new. I vouched for your two comments so hopefully everyone can see them now, but an admin (i.e. dang?) will need to look into this for a longer term solution.


Thank you sir. Actually my other comments are invisible too. That's weird.


I did the same in Racket when I made a Tor search engine. Here's the source code of the crawler!

https://github.com/torgle/torgle/blob/master/backend/torgle....


Any http-aware software that supports socks proxies can access information on hidden services, so any crawler can do it. I fail to see what is novel about that, except that it uses k8s and mongo and a catchy blog title.


So how well would this thing work? What I am asking is what percentage of all the tor hidden service sites out there would get detected by it?



Sounds like a recipe to score yourself a free FBI visit


Generally the FBI doesn't give a hoot until you start distributing illegal stuff....


What does suck is being put on IP blacklists by various providers for merely running a Tor relay, not an exit node. There are several websites I can only access through VPN because of my IP is associated with running a relay.


Go is a horrible language in which to write a crawler. The main problem is that NLP and machine learning code simply isn't as prevalent and robust as it is in Java and Python.


Go is great for a crawler. What does NLP and ML have to do with crawling?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: