Hacker News new | past | comments | ask | show | jobs | submit login
Ever wondered how many open FTP servers there are? (github.com/massivedynamic)
131 points by Lolapo on Sept 18, 2016 | hide | past | favorite | 92 comments



I used to leave a file called README on my public ftp directory, which contained only:

cat: README: No such file or directory

I'd occasionally get email from frustrated people who had trouble trying to read the README file, so I'd tell them to simply run "emacs README", and emacs would solve all of their problems. I don't know if my passive aggressive emacs evangelism ever worked, because I never heard back from them.


Reminds me of the Abbott and Costello skit "Who's on first" [1]

[1] https://youtu.be/kTcRRaXV-fg?t=63


hilarious. reminds me of many good old days tricks and fun.


I got my first internet account on my high school's VAX in 1992 in order to download Dr. Dobb's code from ftp.mv.com. The VAX had a help file installed which was a list of thousands of anonymous FTP servers, most of which requested that you please not use anonymous FTP during business hours to avoid overloading them; shortly I was downloading all kinds of things from wuarchive.wustl.edu (which had a LOT of stuff) and WSMR-SIMTEL20.army.mil, which worried my dad. I bound some function keys on my terminal to different internet commands (FTP, TELNET) and internet sites (like those I mentioned) so I could FTP to wuarchive or TELNET to HPCwire with two keystrokes.

At some point I got hold of Scott Yanoff's list of interesting Internet services (capitalizing Internet was still justifiable at the time) and learned about the Weather Underground, Archie, HPCWire, and this new thing called the "World Wide Web" — I started telnetting to a server at the University of Kansas where I could use Lynx, and it seemed pretty clear that this was going to be a big deal, because of how enormously much easier it was than downloading text files over FTP.

So not only have I wondered how many open FTP servers there are, my exploration of the internet pretty much started with a list of them.

Nowadays I occasionally look for FTP servers because they tend to be less of a pain in the ass for downloading stuff than HTTP servers — you can usually get a full list of what they have, and they never interrupt you with CAPTCHAs. It's kind of like a real-world "shibboleet" — I guess sometimes assholes push mandates for CAPTCHAs and whatnot on a company's technical folk, but they leave FTP open because the assholes don't know about it.

If you're wondering how many open HTTP servers there are, Netcraft does a pretty good monthly survey.


In my youth, I probably downloaded hundreds of gigabytes from wuarchive.wustl.edu, ftp.cdrom.com, metalab.unc.edu, and the like. Over a dial-up modem, no less.

Before I had local Internet access (i.e., long distance calls), I used the various "FTP by e-mail" services with my free "Juno" e-mail account (they had toll-free numbers for access!).


"hundreds of gigabytes"? Do you mean megabytes? Don't know about you, but when I was downloading stuff from those sites, downloading "hundreds of gigabytes" would have literally have taken a few years at minimum, many decades more realistically.


200 gigabytes at 57600 bits per second is a bit under 46 weeks. They could usually sustain that bandwidth; the bottleneck was usually the modem.


When I was using dialup speed in the 90s, I certainly didn't have any gigabyte storage.


One of my favorite ways to gauge "how far we've come" is that I remember connecting to a BBS in Europe in the early 90's that boasted 30 gigabytes of downloadable files. That would have filled up about 20000 floppy disks, and barring any kind of creative phreaking would have cost hundreds of thousands of dollars to download.


Ah, I was at 2400 bps until I had a university ethernet connection.


Anybody remember when modems cost $1 per baud?


"capitalizing Internet was still justifiable at the time"

Given the Internet is a particular thing, shouldn't it still be used as a proper noun?


AP Style Alert. Internet and web are no longer to be capitalized. Check the googles.


I don't work for the AP so I will continue to capitalize it.


They are 100% wrong and should not be encouraged.


Language mostly evolves by usage, not mandate.


That isn't the rule in English; the telephone system is also still a particular thing, as is the ocean, the sky, and the air.

What would you think if someone asked you to call them "on the Telephone"?


I think he meant that the name 'Internet' refers to a unique thing, the working global communication network connecting smaller networks of computers. The word 'internet', if it is to be accepted as a word with stable meaning in written language, should then not refer to the unique global network, but to something general, less unique, like the information service the 'Internet' is known to enable. If I saw the name 'Telephone system' with capital T in the middle of a sentence it would be curious, since such a name, in contrast to 'Internet' is extremely uncommon. Nevertheless, if the writer meant a unique system that has this name 'Telephone system', it would be correct to capitalize. However, when somebody asks me "call me on the telephone", they do not refer to unique global telephone system, they just refer to common service, like they would if they said "send it to me by mail" or "you can find it on the internet".


It is. And it's fairly straightforward:

https://en.wikipedia.org/wiki/Capitalization_of_"Internet"


I have a fun story about open FTP servers.

When we were undergrad students, a friend and I wondered exactly that same question about FTP servers. So we wrote a script that tested random IPv4 for FTP servers using nmap and then attempted to connect.

To our surprise we found quite a few. Most of them were small or not writable, but at least one of them was writable, and it had hundreds of GB of free space. So we connected to it, and then saw that it had huge data files containing what seemed like random junk.

Then, we tried to ssh to the box, but no ssh server was listening. So we tried to telnet in. That worked but we were prompted:

    Enter password: 
Ah, too bad, we will never guess what the password is… Mh, let's try "123456" just in case… But then, instead of logging us in or telling us that we entered the wrong password, it simply said:

    Re-enter password:
Mh? So we re-entered "123456".

    The password has been set.
Haha! The owner must never have set this up. And then we were in. It was a very strange system with a few commands with standard names but non-standard behaviors, and a very minimalistic shell. By poking around and searching the web, we understood that we were actually connected to a surveillance camera and that the big files were probably parts of video.

Poking around more gave us access to other such cameras in the same network and more importantly to a web interface from which we could see the videos streaming live. We saw offices and people doing stuff like taking the garbage out (I don't know why but that is one of the more precise image I can recall ^^). The only distincive thing I remember was a big sign saying "Miami Fitness Club".

After that we never did anything about it as we had other things to do, but I kind of cherish this story as a nice souvenir of my first year at the ENS.


I remember doing something similar except it was with my cable provider, when they first came into town.

Since all clients would basically be connected to a LAN, as soon as I found I could port scan random users I started doing it.

Of course there were a lot of businesses on the network that apparently used FTP to move files around but were unsecured.

I spent the better part of a couple of weeks just going through the data (never actually downloaded anything since I knew it could land me in trouble with my parents).

Like you said, I also cherish this story since it was a rare peak at other peoples lives... without even knowing a 12 year old had access to all their business files.


@Home was wide open around 1999ish. My friend would send random (if you can call goatse.cx random) print jobs to various open SMB printer shares. Ah, to be young again.


  $ lz5
  -bash: lz5: command not found
  $ brew install lz5
  ==> Auto-updated Homebrew!
  Updated Homebrew from b5a6b4e to 7926114.
  Error: No available formula with the name "lz5" 
  ==> Searching for similarly named formulae...
  Error: No similarly named formulae found.
  ==> Searching taps...
  Error: No formulae found in taps.
So, rather than complain about this, how do I do something to fix it? Can I add lz5 as a tap to Homebrew?

EDIT: Let me try this again, in a more productive way.

lz5 does not currently exist in Homebrew. I'd like to fix this. I've never done this before. Does anyone have advice? Is it as simple as forking lz5, then adding a tap to Homebrew?

Thanks. And as a note, this edit occurred after specialp's replies. They were right: this originally wasn't a productive comment.



Ah, it'd be a cask? Cool, thanks.


https://github.com/Homebrew/homebrew-core/blob/master/.githu... Is a better link. It would just be a new formula in homebrew-core


Yes it doesn't exist in a binary package for OSX so thus it does not exist.


Again: Rather than complain, what can I do to make this a little easier for OS X people?


Yes after you edited your comment


I wonder how many of those are just mirrors of Linux distros and other open-source software, how many have more interesting things (including software), and how many of those were deliberately configured to be open for sharing. There is the somewhat-well-known filesearch.ru if you want to look for things on this non-HTTP part of the Internet. (If I remember correctly, Google used to index FTPs too and you'd get plenty of results with the right queries, but that seems to have mostly and silently disappeared...)


Actually most of them. It's pretty easy to eliminate them if you have the hostname entry from the scan and cross-reference with public distro mirror lists. Also exclude any edu servers and names of universities etc. If you want to look for actual files, there's hardly anything better than http://filemare.com. Crawlers like Napalm (http://www.searchftps.net/) focus more on servers that are meant to be public. Using filemare one can find the interesting things ;-)


There's still a uprising number of niche ftp site around: MSX, Demoscene, etc. Mostly older scenes that pre-date the WWW and are still around. I think it might be useful to test every port on every ip to see what happens protocol-wise. Limiting to just common ports is probably missing lots of cool things.


How many ports are there?


According to Reserved IP addresses there are 588,514,304 reserved addresses and since there are 4,294,967,296 (2^32) IPv4 addressess in total, there are 3,706,452,992 public addresses. There are 65536 ports, so one would have to scan all those ports on all those addresses.


> I wonder how many of those are just mirrors of Linux distros and other open-source software

Mine are, at least.


S/he talks about it as if it's something bad, something unfixed. The whole thing sounds like it but in particular "[to be excluded] go fix your shit".

I agree that if you don't want people to access it, you should secure it. Yet not all these servers are accidentally open: my ftp on 80.100.131.150 (I assume it's in there) hosts a copy of Damn Small Linux because all downloads were extremely slow or broken at the time.


Agreed. Having an open FTP server with read-only access is no more problematic than having an open HTTP server.


I remember back in the 90s, people would hunt for FTP servers that allowed anonymous writes. Companies that didn't know how to secure servers would suddenly be hosting a ton-o-warez.


I used to 'accidentally' 'misconfigure' my servers and watch what would show up. When a new group found such a server, it eas usually first used for 'internal distribution' of thr latest releases, especially if it was an fxp-capable server. After a while when it got more busy (hitting rate limits, disk full) it was downgraded to 'end user distribution'.

So I got to watch the latest leaked movies without having to directly deal with (spend time on) being 'inner circle' :)


And other things. There was a subculture called the "WA" for a while that took over a handful of open FTP servers and posted nonsensical poetry during the 90s -- kind of a precursor to the FSM and a contemporary to the "Church of the Sub-genius"[1]. I've seen printouts of their "work" proudly posted on numerous college dorms during the mid to late 90s. The WA never became as big as either of the other satire religions, but I definitely remember their presence on a handful of tech-school ftp sites.

1 - https://en.wikipedia.org/wiki/Church_of_the_SubGenius


They could have made the world's first distributed remote file storage, parcelling out customer's files across 20 different servers. They just had to back up each file to multiple ftp sites and add new ones as their actual owners rooted them out. I'm sure availability would be fantastic, and the costs of running not-your-servers are pretty darn low.


Especially hidden down some deep directory structure, some directories even named .. and . to try to hide themselves.


I recall other folder names that would presumably make windows machines throw up when you clicked on them. Stuff with "LPT:" or "COM:" in the names.


Warez ftp sites used to use high ascii in the file names. Used to have that memorized.


> would suddenly be hosting a ton-o-warez.

which could then be found by searching for "index of"...


It's been a while but I believe that's a response from Apache server not ftp.


... if only we had had search engines then. :)


I used to telnet archie.mcgill.ca to search the collected index of archive sites, back in 1991 or so.

https://en.wikipedia.org/wiki/Archie_search_engine


That still happens.


And then you could use FXP to transfer to another "pub" Usually most open FTPs had their /pub public writable and thus they were called "pubs"


Side note: "xz -9 -e" compresses the file to 3,296,864 where as "lz5 -15" only compresses the original file to 4,643,261 bytes. The xz compressed file is 29% smaller.

Even "gzip -9" compresses the file to 4,035,858.

So I wonder why lz5 was chosen for compression.


There's more low-hanging fruit before picking a different compressor, too. For instance, sorting the data:

    $ zcat openftp4_all_20160918.gz | sort | gzip -9 > sorted.gz

    $ ls -lha *.gz
    -rw-r--r--+ 1 mappu mappu 3.9M Sep 18 18:17 openftp4_all_20160918.gz
    -rw-r--r--+ 1 mappu mappu 2.2M Sep 18 18:19 sorted.gz
`zpaq` should out-compress xz on the unsorted file, too, but i haven't tried it.


How long did that "xz -9 -e" take compared to "lz5 -15" or "gzip -9"?


Numbers on my machine, ordered by compression duration:

    Compressor    Duration    Size
    cat           0.0s        11385584 [*]
    gzip -1       0.2s        4918778 [*]
    bzip2 -1      0.9s        3334653
    bzip2 -9      0.9s        3122347 [*]
    xz -1         1.1s        4222016
    gzip -9       2.4s        4085818
    lz5 -15       6.9s        4643261
    xz -9 -e      10.7s       4033589
    zpaq -m5      34.2s       2655834 [*]
A [*] indicates that no better compressor was faster. Test methodology was `cat | $compressor > output` run 3-4 times to get an average.

I'm surprised by how bzip2 turned out.


I recommend using Shodan for this kind of stuff.

https://shodan.io


Scanning the IPv4 space. I know there are many different projects that do it. I was thinking about how I would do this today. I believe the first step would be to arrive at all the IPv4 blocks (/22, etc), then do a calculation to arrive at the address of each based on the prefix. Then in an array of threads or so, try to connect(2) to the address on some type of service in a timeout handler. If it succeeds, then consider that address as up. I would consider doing this in an async loop with epoll(7), so that many connections could be attempted at once and improve through put.

Anyway. nmap can probably do this and is a great tool


You wouldn't want to use nmap for WAN. nmap is really great if both the scanning and target nodes are somewhat distrbuted across a LAN. Internet mapping is done using Z{map,grab} (censys.io) or masscan. Reduce using the Censys or Shodan search.


If you want to have your scan done in a reasonable amount of time, you probably want to use raw packets, and not connect.

If you don't mind all the flak you'll get, just send a SYN on the port you care about to each IP (maybe skip rfc1918, multicast and reserved addresses; or only send to addresses included in bgp announcements). If you send one packet to each addresses, including the addresses you should probably skip that's 4 Billion packets; if you do it at 1M pps (should fit on a 1Gbps ethernet connection), that's less than two hours.


That's what masscan automates. To fully utilize ms, one requires a friendly ISP. And even then, they can only be ms-friendly as long as their peers are. If you annoy the peers enough, they'll just drop you (http://www.sudosecure.com/ecatels-harboring-of-spambots-and-...). Many datacenters classify port-scanning as an offensive action, even with low package thoughput.

Pulling data from research servers (such as Censys), reducing and then scanning is always a good idea.



The submission links to a blog post on how the data was retrieved: http://255.wf/2016-09-18-mass-analyzing-a-chunk-of-the-inter...

> For this little experiment, I’ve setup a single KVM instance, running a single 2GHz vCore with 2GIB of RAM and 10GiB of HDD space. This is sufficient. Probing for ftp access is an extremely CPU-intensive task. You are going to hit bottlenecks in this order: > > CPU > Memory > a whole lot of nothing > network > > While the rescan was running, only about 1 to 2kpps were exchanged, while the CPU was pinned at 100%.

So this means his setup spent about 1-2 million clock cycles per probe. That's a lot!

I suppose this is because he runs the probe script once per IP address? I suspect that an implementation which would stay in-process would be at least an order of magnitued faster.


Sure. Faster even with a better scheduler. I just wanted to show how the simplest and most redneck way still finishes in a reasonable amount of time. :-)


I was amazed how fast that went. Was fully expecting the story to unfold with how you rented out 100 AWS servers to complete the task, instead it was just one computer and only took hours.


It's all about reducing data offline before throwing the kitchen sink at the internet.



> Search Type: Sub String Exact Regular Expression

Google has an advantage in terms of index size, but definitely lose in terms of precision. The way it munges queries is a bit unsettling at times, and even the "exact" option doesn't seem to always work the way it should.


Have you tried verbatim mode? (Thanks https://news.ycombinator.com/item?id=12046056)


That is a huge improvement over the normal search, but unfortunately it's not truly verbatim. For example, searching for `self` with the backticks turns up results for just self with no backticks. It works with some symbols but not others, not sure why.


Now this DOES bring on the nostalgia... :-)

I recall there used to be a veronica too (Very Easy Rodent Oriented Network something or the other....), then a jughead too ...


I remember using the UI through telnet archie.mcgill.ca. It showed percent progress. The location used a 32-bit signed int. When the archive exceeded 2GB, the percentage became a negative count to 0% rather than count-up to 100%.


Are there any legal implications for doing this? Was going to do something similar with Redis and MongoDB.


This might be a useful starting point:

https://blog.shodan.io/its-still-the-data-stupid/

Shodan crawls for most nosql/ queueing software including mongodb and redistribution.

And related to OP: we also crawl for FTP and attempt anonymous as well as a few other things to better understand FTP deployments on the Internet.


Related to your related: Nice. Is this research public or commercial?


All the data is searchable for free on https://www.shodan.io


I'm not the person you replied to but the "we" refers to shodan.io.


Anonymous read access was once akin to a website. You could host a newsletter or some source code. An open FTP server doesn't necessarily mean a misconfiguration.

This is not the same for something like redis or mongo. You could try and prove otherwise one day to a judge but that would be your battle and I don't suggest it.


People do internet-wide scans all the time, for black hat and white hat reasons. I'm not aware of anyone doing this for research purposes ever having seen any legal consequences. But you usually get some angry emails to your abuse address.

Of course this doesn't mean that some court somewhere may think this is illegal. But it's a common and widespread practice.


I'm wondering where the legal boundary lies though, port scanning might be a grey area but mostly fine.

Connecting to an FTP service (i.e. logging in), even just for 5 or so seconds....I'm not so sure.

Especially when big companies with lax security and aggressive lawyers might see this is as a "hacking attempt"


The law (in the US) pretty much defines hacking as accessing a system you don't have authorization to access. Very vague. Even a second of access to something you don't have permission to do is illegal.


If your question is concerning US or German (/EU) law and how it affects scanning, you can drop me an email (minxomat@gmail.com) and I can give you some insights.


http://ftpsearch.ntnu.no used to index many of them


That would be fun to use together with fun side project [RandomFtpGrabber](https://github.com/albertz/RandomFtpGrabber) which will download random stuff from a list of FTPs.


"All in all, there were exactly 18,454,087 things that responded to a banner fetch... The JSON file is about 4GiB." where can I download this JSON file? I only see the list of ips available.


I specifically left the mass scanning part out. If you want to experiment with scan data without scanning yourself, a good place to start is https://censys.io.


> consider piping the list through shuf each time you try something new. You know why.

No, I don't know why. Can someone explain? (without being condescending, preferably)


It's not entirely clear to me either, though I would assume it's so that the first server in the list doesn't get constantly hammered by a hundred people going through the list sequentially.


hmm. Yeah that makes sense.

Would be better if they simply stated that instead of playing insinuation based guessing games.


How many of them are honeypots, intentionally left open?


How many are there with write access?

And how do they protect against spam?


any stats on how many servers are running Pure-FTPd vs. Paradise FTP vs. some other software?


There are ~300,000 Pure-FTPd instances: https://www.shodan.io/search?query=product%3Apure-ftpd+port%...

Not sure how to fingerprint Paradise FTP (searching for "Welcome to Paradise" also returned some non-relevant results) but there aren't many that contain "paradise" in their welcome banner.

Shodan also fingerprints lots of other FTP software (check out the "Top Products" section):

https://www.shodan.io/report/WHJBsZqV

ProFTPD is by far the most popular choice at the moment.

Note: Doing a search that uses a filter (ex: "product") requires a free Shodan account. None of the above require paid access, you just need a free account.


I can see having an open anonymous FTP that was read-only as a way of serving some files that needed to be downloaded and didn't contain any sensitive information. It's no different from providing a URL to them.

Or even a write-only one so people could deposit data.

The only real problem is a read-write one where people can use it to exchange information.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: