The History of the URL: Domain, Protocol, and Port

gumby · on July 8, 2016

This is a good article. A few nits:

It was the ARPANET (or the arpanet since most systems were case-insensitive in those days - Multics, and later Unix, were exceptions, not the rule) as in area's network, that used arpanet protocols like NCP. You do use "the" the first time in your article but seem to have dropped it after that.

CHAOSNET was just a LAN protocol like "ethernet" or pup -- also used what we call 10base2 "thicknet" coax. It was developed at MIT's AI Lab and was pretty much used only there and at a few institutions close to MIT like Symbolics and LMI.

In the NCP days routing was handled by IMPs (Interface Message Processors) which were not PDPD-11s, and when '11s were used they were smaller than the 11/70s which you used to illustrate the article (11/70s were the largest PDP-11s made -- still 16 bit unlike the 36-bit PDP-10s which were the mainstay of academic computer science in those days).

> In this era before ‘mail servers’, if my computer was off you weren’t sending me an email.

In that era few people had what you would consider a personal computer and more likely you logged into a timesharing system that had your mail along with everything else. So your statement is true, yet an anachronism. Even if you did have your own host, the upstream host (one earlier in the ! path) would have your message so you could consider it literally your mail server.

wahern · on July 9, 2016

DNS was never ASCII only, and I've never seen DNS software make that assumption--that "every piece of internet hardware from the last fourty years, including the Cisco and Juniper routers used to deliver this page to you [assumes ASCII]".

The essay links to RFC 1035 to support its claim of ASCII only, but RFC 1035 actually says is

"However, future additions beyond current usage may need to use the full binary octet capabilities in names, so attempts to store domain names in 7-bit ASCII or use of special bytes to terminate labels, etc., should be avoided."

and

"Although labels can contain any 8 bit values in octets that make up a label, it is strongly recommended that labels follow the preferred syntax described elsewhere in this memo, which is compatible with existing host naming conventions. "

Indeed, some country TLD servers were (and maybe still are) supporting non-punycoded UTF-8 directly.

Lookups are supposed to be case-insensitive, but it's always been verboten to actually modify the case of names in a DNS packet. A query reply is supposed to include the identical question name in an 8-bit clean manner. Indeed, some DNS clients will arbitrarily randomize the case of names to add an element of randomness to thwart DNS spoofing attacks. (If the answer isn't the same 8-bit name, you ignore it just as if it came from a different IP address then you sent it to.) Unfortunately there exist enough broken DNS proxies out that software like Firefox or Chrome can't do this without headaches, but I've never encountered such broken software myself (at least, not that I knew about). At worst I've seen query responses which lack the question portion altogether, and this can cause timeouts (rather than immediate failures) for software which enables anti-spoofing measures. But I've also seen responses which lack the same QID, too. There's always broken software; the threshold for when you can ignore it is highly context dependent.

bluejekyll · on July 9, 2016

> some DNS clients will arbitrarily randomize the case of names to add an element of randomness to thwart DNS spoofing attacks.

I believe this is undefined behavior. It shouldn't be something you count on. The only reference I found in the spec that implies this is:

The question section of the response matches the question section of the query

From rfc 1034. Which isn't very specific, but could be interpreted by some in the way you mean.

If you want to secure the request, it's best to randomize the QID and outbound port. If a server responds with the wrong QID, I'd ignore it.

wahern · on July 9, 2016

From RFC 1034 S. 3.1: "[D]omain name comparisons for all present domain functions are done in a case-insensitive manner, assuming an ASCII character set, and a high order zero bit. When you receive a domain name or label, you should preserve its case. The rationale for this choice is that we may someday need to add full binary domain names for new services; existing services would not be changed."

First, we can't speak of it being undefined in the same manner as we do undefined in the C standard. The DNS standards weren't this rigorous, and didn't use consistent terminology like MUST and SHOULD universal in today's RFCs.

Second, they were explicit that while the existing services (e.g. IN class and A record type) were ASCII-based and case-insensitive, the binary protocol was meant to be 8-bit clean, that some labels might be 8-bit in the future, and it was expected and mandated that this capability be preserved. So strictly speaking, the RFC allowed a server to, e.g., modify the case of an A record label on the wire, but not of some unknown label. In practice it's easier to simply treat all labels in an 8-bit clean manner, and that's in fact what major implementations do. You literally have to go out of your way to do otherwise while still obeying the standard.

Caching name servers like BIND and unbound will reply with the identical question label. For example, notice in the following how the TTL is decremented (and thus being pulled from cache) but the query case is preserved:

  % dig -t A google.com                               
  ; <<>> DiG 9.8.3-P1 <<>> -t A google.com
  ;; global options: +cmd
  ;; Got answer:
  ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 20838
  ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
  
  ;; QUESTION SECTION:
  ;google.com.			IN	A
  
  ;; ANSWER SECTION:
  google.com.		105	IN	A	172.217.4.206
  
  ;; Query time: 0 msec
  ;; SERVER: 192.168.2.1#53(192.168.2.1)
  ;; WHEN: Sat Jul  9 00:45:57 2016
  ;; MSG SIZE  rcvd: 44

  $ dig -t A GoOgLe.com                               
  ; <<>> DiG 9.8.3-P1 <<>> -t A GoOgLe.com
  ;; global options: +cmd
  ;; Got answer:
  ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 7947
  ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
  
  ;; QUESTION SECTION:
  ;GoOgLe.com.			IN	A
  
  ;; ANSWER SECTION:
  GoOgLe.com.		95	IN	A	172.217.4.206
  
  ;; Query time: 0 msec
  ;; SERVER: 192.168.2.1#53(192.168.2.1)
  ;; WHEN: Sat Jul  9 00:46:07 2016
  ;; MSG SIZE  rcvd: 44

In reality, the core DNS infrastructure was perfectly capable of fully supporting raw UTF-8 labels (Though a DJB page suggests that some older versions of Unix gethostbyname stripped 8-bit labels.) Unlike other infrastructure, the implementations were fairly homogenous (until a few years ago BIND absolutely dominated), so ad hoc (and broken) implementations were few and far between. And unlike other infrastructure, there was very little incentive to violate 8-bit cleanliness. The biggest problems were not that some ad hoc implementations modified case, per se, but that some ad hoc caching proxies would reply with the case of a cached record. That's out of sheer laziness, or because they didn't read the standard closely enough. It's telling that BIND, unbound, and other major caching proxies are careful to preserve case in the reply even though that's not necessarily the easiest solution.

The real problem was edge software, like browsers, e-mail clients, etc, that baked in way more assumptions than warranted. Arguably IDNA and punycode took more effort to roll out than would have alternatives based on raw UTF-8. The core infrastructure software wasn't a real barrier, and the IDNA solution required more code at the edges. While the major browsers were facing lots of work regardless, most ad hoc software would have been fine just fixing 8-bit cleaniness problems and then punting on things like glyph security issues, especially if they weren't directly user facing. The vast majority of edge software would have just required some slight refactoring, not huge rewrites with library dependencies for the new compression scheme, etc.

rconti · on July 9, 2016

> The first 32 identified the remote host, similar to how an IP address works today. The last eight were known as the AEN (it stood for “Another Eight-bit Number”), and were used by the remote machine in the way we use a port number

Gold.

Great read, it hits home for me with the right mix of nostalgia, history from before my time, and funny little things I never knew.

b15h0p · on July 9, 2016

Another nitpick: on iOS Safari, that pizza-poo-domain name actually does show up in the address bar. So there has to be another mechanism that prevents the Amazon-with-Cyrillic-"a"-trick which I guess involves normalization.

gregrata · on July 8, 2016

Great read! "Thanks" for all the Wikipedia links - I ended up wasting a few hours reading more details

echeese · on July 8, 2016

For http:com/example/foo/bar/baz how would you determine what the host is?

wtbob · on July 8, 2016

It doesn't actually matter: in a world which used that sort of addressing, one could imagine saying to com 'give me HTTP info for your example/foo/bar/baz,' to com/example 'give me HTTP info for your foo/bar/baz' and so forth; in that case, com would just say, 'hey, go talk to 266.328.0.1 (that's what I call example)' and 266.328.0.1 would cheerfully return the information stored at the filesystem path /foo/bar/baz, or it could say, 'hey, I call foo 463.622.42.17' and your browser would keep resolving.

Me, I kinda wish we wrote URLs as http://com.example.host.invalid/path/to/resource.

gumby · on July 8, 2016

> Me, I kinda wish we wrote URLs as http://com.example.host.invalid/path/to/resource.

The UK's predecessor to the DNS worked this way (a "big endian" hierarchy). Sorry I can't remember the network name; if I remember it was rooted in gb.

Theodores · on July 9, 2016

I am old enough to remember the final year of JANET with the backwards addresses.

At the time I was at Plymouth Marine Laboratory with a university email address of the type researcher@uk.ac.pml - i.e. backwards. However, in those days there were many different things about networks, you could have several connector types in a room so anything beyond email was a bit like the difference between travelling across state borders in the U.S. and travelling across the Iron Curtain. I can't remember how one got from one's VT terminal to the wider internet on VAX/VMS but that was possible. FTP and some Telnet was how it worked, none of this www stuff.

The change of address structure to normal internet style was not that big of a change, you would think it would have been as traumatic as changing what side of the road to drive on or the Millennium Bug, but, the change happened with no huge amount of work needed or resultant disruption.

rjsw · on July 9, 2016

I am old enough to have been a JANET site administrator.

The JANET when I used it ran over a private X.25 network with a few gateways to BT's public X.25 network. There was a gateway to the internet at University of London Computer Centre but it only provided an FTP client.

cbd1984 · on July 9, 2016

JANET, which routed email to Czechoslovakia, as the legend goes...

Back then mainly the Computer Science departments had email, so they'd have domain names beginning with a cs. in the ARPA scheme, but, since JANET did it backwards, you'd have to rearrange the domain name so it ended with a .cs for that network. If you did that and didn't reverse it back, the domain name would have a ccTLD of .cs, which is what Czechoslovakia used.

(The .cs ccTLD existed until 1995, years after Czechoslovakia ceased to. The .su ccTLD (Soviet Union) still exists.)

jameshart · on July 8, 2016

Of course that combined host/path winds up leaking not just what hosts you are interested in but what specific resources you are after to your DNS resolver - would be quite damaging to Privacy and security of https

wtbob · on July 9, 2016

True; as digi_owl indicates, you could only ask for the next element in the path, too.

Also, you are implicitly trusting each level in the tree to be honest, anyway: even if you say, 'hey com, give me example,' he could always give you the address of a computer he controls instead of the real com.example, and thus get the next item in the path from you when you ask him to resolve it.

You can't get away from trust: whether it's trust in DNS, or trust in CAs, or evne trust in the great masses reporting public keys seen in the wild, you can't get away from trust.

digi_owl · on July 8, 2016

I guess rather than handing every resolver the full path, you could go "give me example" and then respond depending on it giving a address or a directory listing in return.

zackbloom · on July 8, 2016

That's very interesting! Removing the separation between DNS resolving hosts and applications resolving paths. All paths could be resolved by a hierarchical DNS-like system which you also ran inside your service to route requests to subfolders. Cool.

inopinatus · on July 8, 2016

Fun fact: nowhere in the HTTP protocol specification does it say "use DNS". It is a convention that we do. It is a further convention that we use A records. And in my opinion it was a travesty that HTTP/2 did not mandate using DNS with SRV records.

LukeShu · on July 9, 2016

I was about to disagree with you and say that it indirectly says to use DNS through RFC 3986. But you're right!

RFC 3986 says to (unless otherwise directed by the URI scheme) to use the operating system's registered name resolution mechanism:

                                                        Instead, it
   delegates the issue of registered name syntax conformance to the
   operating system of each application performing URI resolution, and
   that operating system decides what it will allow for the purpose of
   host identification.  A URI resolution implementation might use DNS,
   host tables, yellow pages, NetInfo, WINS, or any other system for
   lookup of registered names.

It's just a convention that all modern operating systems (primarily) use DNS!

wahern · on July 8, 2016

Corporate firewalls have ruined the internet. You can't put a website on an arbitrary port. The shift to SSL won't even change that, because sysadmins and similar professionals like categorizing traffic by port, and no amount of reasoning will change that. They still think you can filter good/bad, allowed/disallowed by port, even if you can't scan it.

Load-balancing isn't within the ambit of HTTP, so that doesn't weigh in favor of SRV.

The _service._proto template for SRV names is pointless for anything already communicating over HTTP, as it's going to be either too specific/redundant or not specific enough for your particular scenario. For example, I'm using SRV for a automatic service registration and discovery project and ignoring it entirely. But ignoring that in a standard is awkward.

Finally, IPv6 (presuming we get there!) will make ports redundant. You can just assign a new address to your service. The historical baggage will be annoying (requiring root to bind to ports <1024), but with VMs and containers corporate software doesn't bother with old-school best practices like that.

And we're ignoring that there's already an installed base and market for load balancers, redirectors, etc. I understand that the dream is to be able to move that logic back toward the edges, and I whole heartedly agree. But as with filtering by port number, the majority of IT professionals just don't think that way.

inopinatus · on July 10, 2016

We achieve change incrementally. If you think corporate firewalls and inflexible admins should define the architecture of the internet, you have already lost. In practice, they adapt. Believe me they are already having to, with the change of wire format and connection behaviours due to HTTP/2 requests.

Actually nothing persuades a stick-in-the-mud sysadmin quite like a ratified standards document: that is something they can relate to. I say this with confidence because I am a stick-in-the-mud sysadmin.

The major immediate utility of SRV records is one we haven't mentioned: they can be aliases at apex records, because they are not A/AAAA records.

The broader case is that all protocols should be using SRV because overloading address records for service discovery has so many misbehaviours. Allocating multiple IPv6 addresses doesn't fix the problem of overloaded symbolic names. Using address records hurts the adoption of DNS for federation.

zackbloom · on July 8, 2016

He, unfortunately, didn't disambiguate between example.com/foo/bar/baz, foo.example.com/bar/baz, etc. so it's unclear.

It is kind of amusing to think of the entire Internet as a giant directory though.

thaumasiotes · on July 9, 2016

> It is kind of amusing to think of the entire Internet as a giant directory though.

How else would you think about it?

krapp · on July 9, 2016

A distributed operating system, with the URL being a shell command sent to some remote application. The resemblance between a URL and a directory path is becoming an anachronism anyway, as:

    GET http://example.com/foo/bar/baz

might as well be

    return com.example.foo(bar, baz)

more often then than not.

thaumasiotes · on July 9, 2016

I don't understand the distinction you're trying to draw. On a personal machine, a directory path might look like /usr/bin/vi whereas a command would look like /usr/bin/vi.

Command invocations require you to identify what command you want. So we provide locations.

krapp · on July 9, 2016

My point is that a URL is often an abstraction of an invocation of a function or class with segments and query parameters mapping to methods or arguments, and that it doesn't need to have any relationship to file paths on the server, despite superficially looking like a file path.

jacquesm · on July 9, 2016

> A distributed operating system, with the URL being a shell command sent to some remote application.

All too often URLs really do allow you to issue shell commands (or raw SQL), even today!

userlabs · on July 8, 2016

very good info thanks

jhardy54 · on July 8, 2016

> This restriction on HTML was ultimately removed in 2007 and that same year Unicode became the most popular encoding on the web.

Nitpick: Unicode is a character set, UTF-8 is an encoding.

zackbloom · on July 8, 2016

Should be fixed shortly, thanks!

bluejekyll · on July 9, 2016

Nit pick: the modern internet is built on IP, not TCP/IP, but it would be fair to say most protocols use TCP today, but definitely not all.

RickHull · on July 8, 2016

Great article. Another nitpick:

> It’s important to dispel any illusion that these decisions were made with precence for the future the domain name would have.

I don't think precence is a word, and I'm not sure what would make sense as its replacement.

juliendorra · on July 8, 2016

Prescience (knowledge of things before they happen)

schoen · on July 8, 2016

Now I wonder if "foreknowledge" could have come into English as a calque of Latin "praescientia".

zackbloom · on July 8, 2016

Fixed, thanks so much!