Ever wondered how many open FTP servers there are?

DonHopkins · on Sept 18, 2016

I used to leave a file called README on my public ftp directory, which contained only:

cat: README: No such file or directory

I'd occasionally get email from frustrated people who had trouble trying to read the README file, so I'd tell them to simply run "emacs README", and emacs would solve all of their problems. I don't know if my passive aggressive emacs evangelism ever worked, because I never heard back from them.

anchpop · on Sept 18, 2016

Reminds me of the Abbott and Costello skit "Who's on first" [1]

[1] https://youtu.be/kTcRRaXV-fg?t=63

jb1991 · on Sept 18, 2016

hilarious. reminds me of many good old days tricks and fun.

kragen · on Sept 18, 2016

I got my first internet account on my high school's VAX in 1992 in order to download Dr. Dobb's code from ftp.mv.com. The VAX had a help file installed which was a list of thousands of anonymous FTP servers, most of which requested that you please not use anonymous FTP during business hours to avoid overloading them; shortly I was downloading all kinds of things from wuarchive.wustl.edu (which had a LOT of stuff) and WSMR-SIMTEL20.army.mil, which worried my dad. I bound some function keys on my terminal to different internet commands (FTP, TELNET) and internet sites (like those I mentioned) so I could FTP to wuarchive or TELNET to HPCwire with two keystrokes.

At some point I got hold of Scott Yanoff's list of interesting Internet services (capitalizing Internet was still justifiable at the time) and learned about the Weather Underground, Archie, HPCWire, and this new thing called the "World Wide Web" — I started telnetting to a server at the University of Kansas where I could use Lynx, and it seemed pretty clear that this was going to be a big deal, because of how enormously much easier it was than downloading text files over FTP.

So not only have I wondered how many open FTP servers there are, my exploration of the internet pretty much started with a list of them.

Nowadays I occasionally look for FTP servers because they tend to be less of a pain in the ass for downloading stuff than HTTP servers — you can usually get a full list of what they have, and they never interrupt you with CAPTCHAs. It's kind of like a real-world "shibboleet" — I guess sometimes assholes push mandates for CAPTCHAs and whatnot on a company's technical folk, but they leave FTP open because the assholes don't know about it.

If you're wondering how many open HTTP servers there are, Netcraft does a pretty good monthly survey.

jlgaddis · on Sept 18, 2016

In my youth, I probably downloaded hundreds of gigabytes from wuarchive.wustl.edu, ftp.cdrom.com, metalab.unc.edu, and the like. Over a dial-up modem, no less.

Before I had local Internet access (i.e., long distance calls), I used the various "FTP by e-mail" services with my free "Juno" e-mail account (they had toll-free numbers for access!).

dmd · on Sept 18, 2016

"hundreds of gigabytes"? Do you mean megabytes? Don't know about you, but when I was downloading stuff from those sites, downloading "hundreds of gigabytes" would have literally have taken a few years at minimum, many decades more realistically.

kragen · on Sept 18, 2016

200 gigabytes at 57600 bits per second is a bit under 46 weeks. They could usually sustain that bandwidth; the bottleneck was usually the modem.

nissehulth · on Sept 18, 2016

When I was using dialup speed in the 90s, I certainly didn't have any gigabyte storage.

13of40 · on Sept 18, 2016

One of my favorite ways to gauge "how far we've come" is that I remember connecting to a BBS in Europe in the early 90's that boasted 30 gigabytes of downloadable files. That would have filled up about 20000 floppy disks, and barring any kind of creative phreaking would have cost hundreds of thousands of dollars to download.

dmd · on Sept 18, 2016

Ah, I was at 2400 bps until I had a university ethernet connection.

DonHopkins · on Sept 18, 2016

Anybody remember when modems cost $1 per baud?

Esau · on Sept 18, 2016

"capitalizing Internet was still justifiable at the time"

Given the Internet is a particular thing, shouldn't it still be used as a proper noun?

akshatpradhan · on Sept 18, 2016

AP Style Alert. Internet and web are no longer to be capitalized. Check the googles.

Esau · on Sept 18, 2016

I don't work for the AP so I will continue to capitalize it.

acheron · on Sept 18, 2016

They are 100% wrong and should not be encouraged.

walterbell · on Sept 18, 2016

Language mostly evolves by usage, not mandate.

kragen · on Sept 18, 2016

That isn't the rule in English; the telephone system is also still a particular thing, as is the ocean, the sky, and the air.

What would you think if someone asked you to call them "on the Telephone"?

effie · on Sept 18, 2016

I think he meant that the name 'Internet' refers to a unique thing, the working global communication network connecting smaller networks of computers. The word 'internet', if it is to be accepted as a word with stable meaning in written language, should then not refer to the unique global network, but to something general, less unique, like the information service the 'Internet' is known to enable. If I saw the name 'Telephone system' with capital T in the middle of a sentence it would be curious, since such a name, in contrast to 'Internet' is extremely uncommon. Nevertheless, if the writer meant a unique system that has this name 'Telephone system', it would be correct to capitalize. However, when somebody asks me "call me on the telephone", they do not refer to unique global telephone system, they just refer to common service, like they would if they said "send it to me by mail" or "you can find it on the internet".

AckSyn · on Sept 18, 2016

It is. And it's fairly straightforward:

https://en.wikipedia.org/wiki/Capitalization_of_"Internet"

p4bl0 · on Sept 18, 2016

I have a fun story about open FTP servers.

When we were undergrad students, a friend and I wondered exactly that same question about FTP servers. So we wrote a script that tested random IPv4 for FTP servers using nmap and then attempted to connect.

To our surprise we found quite a few. Most of them were small or not writable, but at least one of them was writable, and it had hundreds of GB of free space. So we connected to it, and then saw that it had huge data files containing what seemed like random junk.

Then, we tried to ssh to the box, but no ssh server was listening. So we tried to telnet in. That worked but we were prompted:

    Enter password:

Ah, too bad, we will never guess what the password is… Mh, let's try "123456" just in case… But then, instead of logging us in or telling us that we entered the wrong password, it simply said:

    Re-enter password:

Mh? So we re-entered "123456".

    The password has been set.

Haha! The owner must never have set this up. And then we were in. It was a very strange system with a few commands with standard names but non-standard behaviors, and a very minimalistic shell. By poking around and searching the web, we understood that we were actually connected to a surveillance camera and that the big files were probably parts of video.

Poking around more gave us access to other such cameras in the same network and more importantly to a web interface from which we could see the videos streaming live. We saw offices and people doing stuff like taking the garbage out (I don't know why but that is one of the more precise image I can recall ^^). The only distincive thing I remember was a big sign saying "Miami Fitness Club".

After that we never did anything about it as we had other things to do, but I kind of cherish this story as a nice souvenir of my first year at the ENS.

saganus · on Sept 18, 2016

I remember doing something similar except it was with my cable provider, when they first came into town.

Since all clients would basically be connected to a LAN, as soon as I found I could port scan random users I started doing it.

Of course there were a lot of businesses on the network that apparently used FTP to move files around but were unsecured.

I spent the better part of a couple of weeks just going through the data (never actually downloaded anything since I knew it could land me in trouble with my parents).

Like you said, I also cherish this story since it was a rare peak at other peoples lives... without even knowing a 12 year old had access to all their business files.

tthayer · on Sept 18, 2016

@Home was wide open around 1999ish. My friend would send random (if you can call goatse.cx random) print jobs to various open SMB printer shares. Ah, to be young again.

sillysaurus3 · on Sept 18, 2016

  $ lz5
  -bash: lz5: command not found
  $ brew install lz5
  ==> Auto-updated Homebrew!
  Updated Homebrew from b5a6b4e to 7926114.
  Error: No available formula with the name "lz5" 
  ==> Searching for similarly named formulae...
  Error: No similarly named formulae found.
  ==> Searching taps...
  Error: No formulae found in taps.

So, rather than complain about this, how do I do something to fix it? Can I add lz5 as a tap to Homebrew?

EDIT: Let me try this again, in a more productive way.

lz5 does not currently exist in Homebrew. I'd like to fix this. I've never done this before. Does anyone have advice? Is it as simple as forking lz5, then adding a tap to Homebrew?

Thanks. And as a note, this edit occurred after specialp's replies. They were right: this originally wasn't a productive comment.

sepbot · on Sept 18, 2016

https://github.com/Homebrew/brew/blob/master/Library/Homebre...

sillysaurus3 · on Sept 18, 2016

Ah, it'd be a cask? Cool, thanks.

reaperhulk · on Sept 18, 2016

https://github.com/Homebrew/homebrew-core/blob/master/.githu... Is a better link. It would just be a new formula in homebrew-core

specialp · on Sept 18, 2016

Yes it doesn't exist in a binary package for OSX so thus it does not exist.

sillysaurus3 · on Sept 18, 2016

Again: Rather than complain, what can I do to make this a little easier for OS X people?

specialp · on Sept 18, 2016

Yes after you edited your comment

userbinator · on Sept 18, 2016

I wonder how many of those are just mirrors of Linux distros and other open-source software, how many have more interesting things (including software), and how many of those were deliberately configured to be open for sharing. There is the somewhat-well-known filesearch.ru if you want to look for things on this non-HTTP part of the Internet. (If I remember correctly, Google used to index FTPs too and you'd get plenty of results with the right queries, but that seems to have mostly and silently disappeared...)

summarity · on Sept 18, 2016

Actually most of them. It's pretty easy to eliminate them if you have the hostname entry from the scan and cross-reference with public distro mirror lists. Also exclude any edu servers and names of universities etc. If you want to look for actual files, there's hardly anything better than http://filemare.com. Crawlers like Napalm (http://www.searchftps.net/) focus more on servers that are meant to be public. Using filemare one can find the interesting things ;-)

bane · on Sept 18, 2016

There's still a uprising number of niche ftp site around: MSX, Demoscene, etc. Mostly older scenes that pre-date the WWW and are still around. I think it might be useful to test every port on every ip to see what happens protocol-wise. Limiting to just common ports is probably missing lots of cool things.

andai · on Sept 18, 2016

How many ports are there?

aeinstein1 · on Sept 18, 2016

According to Reserved IP addresses there are 588,514,304 reserved addresses and since there are 4,294,967,296 (2^32) IPv4 addressess in total, there are 3,706,452,992 public addresses. There are 65536 ports, so one would have to scan all those ports on all those addresses.

jldugger · on Sept 18, 2016

> I wonder how many of those are just mirrors of Linux distros and other open-source software

Mine are, at least.

lucb1e · on Sept 18, 2016

S/he talks about it as if it's something bad, something unfixed. The whole thing sounds like it but in particular "[to be excluded] go fix your shit".

I agree that if you don't want people to access it, you should secure it. Yet not all these servers are accidentally open: my ftp on 80.100.131.150 (I assume it's in there) hosts a copy of Damn Small Linux because all downloads were extremely slow or broken at the time.

a3_nm · on Sept 18, 2016

Agreed. Having an open FTP server with read-only access is no more problematic than having an open HTTP server.

djsumdog · on Sept 18, 2016

I remember back in the 90s, people would hunt for FTP servers that allowed anonymous writes. Companies that didn't know how to secure servers would suddenly be hosting a ton-o-warez.

roel_v · on Sept 18, 2016

I used to 'accidentally' 'misconfigure' my servers and watch what would show up. When a new group found such a server, it eas usually first used for 'internal distribution' of thr latest releases, especially if it was an fxp-capable server. After a while when it got more busy (hitting rate limits, disk full) it was downgraded to 'end user distribution'.

So I got to watch the latest leaked movies without having to directly deal with (spend time on) being 'inner circle' :)

bane · on Sept 18, 2016

And other things. There was a subculture called the "WA" for a while that took over a handful of open FTP servers and posted nonsensical poetry during the 90s -- kind of a precursor to the FSM and a contemporary to the "Church of the Sub-genius"[1]. I've seen printouts of their "work" proudly posted on numerous college dorms during the mid to late 90s. The WA never became as big as either of the other satire religions, but I definitely remember their presence on a handful of tech-school ftp sites.

1 - https://en.wikipedia.org/wiki/Church_of_the_SubGenius

ryankupyn · on Sept 18, 2016

They could have made the world's first distributed remote file storage, parcelling out customer's files across 20 different servers. They just had to back up each file to multiple ftp sites and add new ones as their actual owners rooted them out. I'm sure availability would be fantastic, and the costs of running not-your-servers are pretty darn low.

ChrisClark · on Sept 18, 2016

Especially hidden down some deep directory structure, some directories even named .. and . to try to hide themselves.

rb2k_ · on Sept 18, 2016

I recall other folder names that would presumably make windows machines throw up when you clicked on them. Stuff with "LPT:" or "COM:" in the names.

gscott · on Sept 18, 2016

Warez ftp sites used to use high ascii in the file names. Used to have that memorized.

leeoniya · on Sept 18, 2016

> would suddenly be hosting a ton-o-warez.

which could then be found by searching for "index of"...

Timethy · on Sept 18, 2016

It's been a while but I believe that's a response from Apache server not ftp.

jlgaddis · on Sept 18, 2016

... if only we had had search engines then. :)

dalke · on Sept 18, 2016

I used to telnet archie.mcgill.ca to search the collected index of archive sites, back in 1991 or so.

https://en.wikipedia.org/wiki/Archie_search_engine

api · on Sept 18, 2016

That still happens.

specialp · on Sept 18, 2016

And then you could use FXP to transfer to another "pub" Usually most open FTPs had their /pub public writable and thus they were called "pubs"

jftuga · on Sept 18, 2016

Side note: "xz -9 -e" compresses the file to 3,296,864 where as "lz5 -15" only compresses the original file to 4,643,261 bytes. The xz compressed file is 29% smaller.

Even "gzip -9" compresses the file to 4,035,858.

So I wonder why lz5 was chosen for compression.

mappu · on Sept 18, 2016

There's more low-hanging fruit before picking a different compressor, too. For instance, sorting the data:

    $ zcat openftp4_all_20160918.gz | sort | gzip -9 > sorted.gz

    $ ls -lha *.gz
    -rw-r--r--+ 1 mappu mappu 3.9M Sep 18 18:17 openftp4_all_20160918.gz
    -rw-r--r--+ 1 mappu mappu 2.2M Sep 18 18:19 sorted.gz

`zpaq` should out-compress xz on the unsorted file, too, but i haven't tried it.

jlgaddis · on Sept 18, 2016

How long did that "xz -9 -e" take compared to "lz5 -15" or "gzip -9"?

mappu · on Sept 18, 2016

Numbers on my machine, ordered by compression duration:

    Compressor    Duration    Size
    cat           0.0s        11385584 [*]
    gzip -1       0.2s        4918778 [*]
    bzip2 -1      0.9s        3334653
    bzip2 -9      0.9s        3122347 [*]
    xz -1         1.1s        4222016
    gzip -9       2.4s        4085818
    lz5 -15       6.9s        4643261
    xz -9 -e      10.7s       4033589
    zpaq -m5      34.2s       2655834 [*]

A [*] indicates that no better compressor was faster. Test methodology was `cat | $compressor > output` run 3-4 times to get an average.

I'm surprised by how bzip2 turned out.

danielrm26 · on Sept 18, 2016

I recommend using Shodan for this kind of stuff.

https://shodan.io

ryanmccullagh · on Sept 18, 2016

Scanning the IPv4 space. I know there are many different projects that do it. I was thinking about how I would do this today. I believe the first step would be to arrive at all the IPv4 blocks (/22, etc), then do a calculation to arrive at the address of each based on the prefix. Then in an array of threads or so, try to connect(2) to the address on some type of service in a timeout handler. If it succeeds, then consider that address as up. I would consider doing this in an async loop with epoll(7), so that many connections could be attempted at once and improve through put.

Anyway. nmap can probably do this and is a great tool

summarity · on Sept 18, 2016

You wouldn't want to use nmap for WAN. nmap is really great if both the scanning and target nodes are somewhat distrbuted across a LAN. Internet mapping is done using Z{map,grab} (censys.io) or masscan. Reduce using the Censys or Shodan search.

toast0 · on Sept 18, 2016

If you want to have your scan done in a reasonable amount of time, you probably want to use raw packets, and not connect.

If you don't mind all the flak you'll get, just send a SYN on the port you care about to each IP (maybe skip rfc1918, multicast and reserved addresses; or only send to addresses included in bgp announcements). If you send one packet to each addresses, including the addresses you should probably skip that's 4 Billion packets; if you do it at 1M pps (should fit on a 1Gbps ethernet connection), that's less than two hours.

summarity · on Sept 18, 2016

That's what masscan automates. To fully utilize ms, one requires a friendly ISP. And even then, they can only be ms-friendly as long as their peers are. If you annoy the peers enough, they'll just drop you (http://www.sudosecure.com/ecatels-harboring-of-spambots-and-...). Many datacenters classify port-scanning as an offensive action, even with low package thoughput.

Pulling data from research servers (such as Censys), reducing and then scanning is always a good idea.

jlgaddis · on Sept 18, 2016

masscan: https://github.com/robertdavidgraham/masscan

sigill · on Sept 18, 2016

The submission links to a blog post on how the data was retrieved: http://255.wf/2016-09-18-mass-analyzing-a-chunk-of-the-inter...

> For this little experiment, I’ve setup a single KVM instance, running a single 2GHz vCore with 2GIB of RAM and 10GiB of HDD space. This is sufficient. Probing for ftp access is an extremely CPU-intensive task. You are going to hit bottlenecks in this order: > > CPU > Memory > a whole lot of nothing > network > > While the rescan was running, only about 1 to 2kpps were exchanged, while the CPU was pinned at 100%.

So this means his setup spent about 1-2 million clock cycles per probe. That's a lot!

I suppose this is because he runs the probe script once per IP address? I suspect that an implementation which would stay in-process would be at least an order of magnitued faster.

summarity · on Sept 18, 2016

Sure. Faster even with a better scheduler. I just wanted to show how the simplest and most redneck way still finishes in a reasonable amount of time. :-)

bemmu · on Sept 18, 2016

I was amazed how fast that went. Was fully expecting the story to unfold with how you rented out 100 AWS servers to complete the task, instead it was just one computer and only took hours.

summarity · on Sept 18, 2016

It's all about reducing data offline before throwing the kitchen sink at the internet.

barefootcoder · on Sept 18, 2016

Good old archie... http://archie.icm.edu.pl/archie-adv_eng.html

userbinator · on Sept 18, 2016

> Search Type: Sub String Exact Regular Expression

Google has an advantage in terms of index size, but definitely lose in terms of precision. The way it munges queries is a bit unsettling at times, and even the "exact" option doesn't seem to always work the way it should.

daurnimator · on Sept 18, 2016

Have you tried verbatim mode? (Thanks https://news.ycombinator.com/item?id=12046056)

mikeash · on Sept 18, 2016

That is a huge improvement over the normal search, but unfortunately it's not truly verbatim. For example, searching for `self` with the backticks turns up results for just self with no backticks. It works with some symbols but not others, not sure why.

webtechgal · on Sept 18, 2016

Now this DOES bring on the nostalgia... :-)

I recall there used to be a veronica too (Very Easy Rodent Oriented Network something or the other....), then a jughead too ...

dalke · on Sept 18, 2016

I remember using the UI through telnet archie.mcgill.ca. It showed percent progress. The location used a 32-bit signed int. When the archive exceeded 2GB, the percentage became a negative count to 0% rather than count-up to 100%.

heydonovan · on Sept 18, 2016

Are there any legal implications for doing this? Was going to do something similar with Redis and MongoDB.

achillean · on Sept 18, 2016

This might be a useful starting point:

https://blog.shodan.io/its-still-the-data-stupid/

Shodan crawls for most nosql/ queueing software including mongodb and redistribution.

And related to OP: we also crawl for FTP and attempt anonymous as well as a few other things to better understand FTP deployments on the Internet.

summarity · on Sept 18, 2016

Related to your related: Nice. Is this research public or commercial?

achillean · on Sept 18, 2016

All the data is searchable for free on https://www.shodan.io

jlgaddis · on Sept 18, 2016

I'm not the person you replied to but the "we" refers to shodan.io.

MichaelRenor · on Sept 18, 2016

Anonymous read access was once akin to a website. You could host a newsletter or some source code. An open FTP server doesn't necessarily mean a misconfiguration.

This is not the same for something like redis or mongo. You could try and prove otherwise one day to a judge but that would be your battle and I don't suggest it.

hannob · on Sept 18, 2016

People do internet-wide scans all the time, for black hat and white hat reasons. I'm not aware of anyone doing this for research purposes ever having seen any legal consequences. But you usually get some angry emails to your abuse address.

Of course this doesn't mean that some court somewhere may think this is illegal. But it's a common and widespread practice.

djhworld · on Sept 18, 2016

I'm wondering where the legal boundary lies though, port scanning might be a grey area but mostly fine.

Connecting to an FTP service (i.e. logging in), even just for 5 or so seconds....I'm not so sure.

Especially when big companies with lax security and aggressive lawyers might see this is as a "hacking attempt"

colejohnson66 · on Sept 18, 2016

The law (in the US) pretty much defines hacking as accessing a system you don't have authorization to access. Very vague. Even a second of access to something you don't have permission to do is illegal.

summarity · on Sept 18, 2016

If your question is concerning US or German (/EU) law and how it affects scanning, you can drop me an email (minxomat@gmail.com) and I can give you some insights.

awqrre · on Sept 18, 2016

http://ftpsearch.ntnu.no used to index many of them

albertzeyer · on Sept 18, 2016

That would be fun to use together with fun side project [RandomFtpGrabber](https://github.com/albertz/RandomFtpGrabber) which will download random stuff from a list of FTPs.

andrewfromx · on Sept 18, 2016

"All in all, there were exactly 18,454,087 things that responded to a banner fetch... The JSON file is about 4GiB." where can I download this JSON file? I only see the list of ips available.

summarity · on Sept 18, 2016

I specifically left the mass scanning part out. If you want to experiment with scan data without scanning yourself, a good place to start is https://censys.io.

repler · on Sept 18, 2016

> consider piping the list through shuf each time you try something new. You know why.

No, I don't know why. Can someone explain? (without being condescending, preferably)

sprocket · on Sept 18, 2016

It's not entirely clear to me either, though I would assume it's so that the first server in the list doesn't get constantly hammered by a hundred people going through the list sequentially.

repler · on Sept 18, 2016

hmm. Yeah that makes sense.

Would be better if they simply stated that instead of playing insinuation based guessing games.

pmontra · on Sept 18, 2016

How many of them are honeypots, intentionally left open?

amelius · on Sept 18, 2016

How many are there with write access?

And how do they protect against spam?

andrewfromx · on Sept 18, 2016

any stats on how many servers are running Pure-FTPd vs. Paradise FTP vs. some other software?

achillean · on Sept 18, 2016

There are ~300,000 Pure-FTPd instances: https://www.shodan.io/search?query=product%3Apure-ftpd+port%...

Not sure how to fingerprint Paradise FTP (searching for "Welcome to Paradise" also returned some non-relevant results) but there aren't many that contain "paradise" in their welcome banner.

Shodan also fingerprints lots of other FTP software (check out the "Top Products" section):

https://www.shodan.io/report/WHJBsZqV

ProFTPD is by far the most popular choice at the moment.

Note: Doing a search that uses a filter (ex: "product") requires a free Shodan account. None of the above require paid access, you just need a free account.

GirlsCanCode · on Sept 18, 2016

I can see having an open anonymous FTP that was read-only as a way of serving some files that needed to be downloaded and didn't contain any sensitive information. It's no different from providing a URL to them.

Or even a write-only one so people could deposit data.

The only real problem is a read-write one where people can use it to exchange information.