A couple of months ago I processed all metadata from the Common Crawl project for all indexed domain names. This was about 10TB of metadata and resulted in 26 million domain names. EC2 costs were only about 10$ to process this. If anyone is interested, let me know.
This was a fun Saturday afternoon project to do actually. I spawned a single c4.8xlarge (10 Gbit) instance in US-EAST-1 (next to where the Common Crawl Public Data Set lives in S3) and downloaded 10TB spread over 33k files in +/- 30 simultaneous 'curl | ungzip | grep whatever-the-common-crawl-url-prefix-was' pipelines and got a solid 5Gbit transfer speed.
The bottleneck was userland CPU, so probably the gzip processes. Took about 5 hours. Cutting of the path and piping through sort and uniq took another hour or so.
The only costs were the price/hour of running the EC2 instance. Network costs were 0 as you're only transferring data to your instance.
In my experience, and I suppose depending on the data, I've found that grep is often the bottleneck for data pipeline tasks like you describe. The silver searcher (https://github.com/ggreer/the_silver_searcher) is, in my experience, about 10x faster than grep for tasks like pulling out fields from json files. It's changed my life.
pv (pipe viewer, http://www.ivarch.com/programs/pv.shtml) and top are pretty handy to measure this kind of thing. You should be able to see exactly which process is using how much CPU, and what your throughput is.
Synchronicity :-) Just this week I scraped and analyzed only the homepages from the Alexa top 1 million [1] and majestic million [2]. I used rqworker[3] and and a fleet of mini-servers from scaleway[4] to do the scraping. Some results:
- 1.8MM successful scrapes
- 25GB total size (stored in postgres)
- 1 server to host redis and postgres
- 9 physical servers for workers (4x core ARM servers) = 36 total cores
- Peak rate of ~ 100 reqs/second achieved across all workers (36 physical cores total)
- I saw that I could oversubscribe workers to cores by a factor of 2x (72 total workers) to achieve ~75% utilization on each server
All in all it was a fun project that provided quite a bit of learning. There are a lot of levers to pull here to make things faster, robust, and more portable. Next steps are to optimize the database and wrap a simple django app around it for exploring data.
Or maybe push it futher and try my hand at these 26MM domains?
My intentions is that I want a complete list of .no (Norwegian) domains, and http://norid.no refuse to give out that list to anyone.
I would like to be able to continuously check the Norwegian IP-space for compromised sites, because it would be interesting to see. Of course doing this on a bigger scale would be cool as well.
A warning about parsing zone files... the grammar is deceptively tricky.
While TLD registries will probably provide you with files in a sane subset[0] of that specified in RFC 1035, there are a number of things that will NOT work in general:
- Splitting the file in to lines (paren-blocks and quoted strings can span lines, strings can contain ';' etc).
- Splitting the file on whitespace (it's significant in column 1 and inside strings)
- Applying a regex (you'll need lookahead for conditional matching and it'll get ugly fast)
Don't go down the road of assuming it's a simple delimited file.
I use the BIND tool named-compilezone to canonicalize zone files, which allows me to apply simple regex parsing, because I can assume one record per line, all fields present, and no abbreviated names. Main disadvantage is it is not very fast.
I had been downloading the zone file for .PK domains on daily bases until they blocked the zone transfers. Based on comparison of these daily zone files I managed to publish the statistics [1] and also broke the news about hacked .PK domains [2] which was picked up by all leading tech blogs and news agencies.
Currently, I cannot find a way to get the zone file even by officially requesting the registry manager.
What if someone were to maintain an unofficial list with one domain per line, freely available as a daily torrent or served directly? Would there be a rights problem with mirroring and filtering ICANN data?
A lot of tlds don't provide zone files unless you are a registrar. They would probably not be happy if someone put those out to the pubic. For com and the likes they would probably not care as much.
The best public list of domains I have found is the Project Sonar DNS (ANY) scans. I don't know how they do it, but their scans are pretty complete, at least for.dk domains, which are the ones I use.
Their download speeds are often bad sadly, they really should provide a torrent. But it's free and they have some really interesting datasets, so it's worth the wait.
Unfortunately, as part of the application you are compelled to sign forms promising that you won't make "significant" parts of the zone file publicly available in any way (at least this was my experience when applying to Verisign for .com and .net zone file access).
does GitHub need any publicity? honest question as I've found that anyone who would ever use the functionality GitHub provides is already very aware of git and GitHub.
people shift between services like github and bitbucket and alternatives all the time. Perhaps not often on an individual basis, but at any one time many people are deciding where to put their stuff.
Almost anything that gets the name of a particular service bumped up to the top of someone's consciousness for a little while will shift some of those decisions toward that service.
This is why even the world's most popular brands (Apple, Coke, etc) never stop spending money on marketing / PR :)
It could be used to create a competing system. ICANN would never allow this. If someone tried to put this together, I think they would quickly find their access to the data revoked.
FWIW, a TLD zone file does not contain every registered domain name, just those with DNS records. There is typically a good amount of domain names registered but without records, for reasons such as reserved names, malicious content takedowns, etc.
Exactly. The title "How to Download a List of All Registered Domain Names" is not correct.
For example in .com .net if a registrar puts a domain "on hold" that pulls it from the zone file. As such it will not appear in any zone file download. This is in fact a way for a company to keep a domain under wraps. As such:
register domain name (in .com )
pull the name servers
The name won't be in any zone file. And will be off the radar until it has nameservers or until someone guesses to see if it's been registered.
(Note other registries work differently (or the same) the above is specific to .com and .net)
we're in the process of getting all the zone files we can to reduce the amount of DNS requests we have to do, but the real kicker is whois databases, for example AFNIC asks for 10K€ to get access to a copy of the database...
I've wondered about this previously as I run my own blacklists for $work's mail servers, thinking about how I could slightly "penalize" brand new domain names and such, correlating "spammy" domains with certain nameservers and such.
No. The list of available domains is the list of all possible domains less the list of registered domains. The list of registered domains is vastly smaller than the list of possible domains. The list of available domains would mostly consist of junk nobody would be interested in.
Yeah. My point was, he wrote an article titled "How to Download a List of All Registered Domain Names", and then didn't even mention the existence of cctlds. Which is like writing an article titled "How to learn how to speak every language", and then pretending there are only 10 languages in existence.
There are very good reasons for this data being closed, not least of which is that allowing zone transfer by arbitrary individuals is an excellent way of allowing your DNS server to be DOS'd.
I'm not sure what you mean. Do you know what a zone transfer is? If you wanted to get a list of the domains and records published in a nameserver, you would perform a zone transfer. Because that can amount to quite a bit of information being transferred, if a nameserver allows unrestricted zone transfers, that's a vector for a denial of service attack against that nameserver.
If you're a domain registry, your zone files are huge. Allowing arbitrary zone transfers could potentially put massive sustained strain on their DNS infrastructure. And thus because only a very small number of nameservers really need to be able to perform zone transfers against their nameservers, they're better off locking down the ability.
If you're running your own nameservers, then it's still worth locking down zone transfers for similar reasons. At the very least, it gives you a degree of defence in depth as you're giving attackers less of an opportunity to gather information on the structure of your network. If they could simply do a zone transfer to find out all the names in a given zone, then they don't have to do more costly brute force enumeration to guess at the hosts in the zone.
edit: available as torrent here: https://all-certificates.s3.amazonaws.com/domainnames.gz?tor...