Hacker News new | past | comments | ask | show | jobs | submit login
Ethernet History Deepdive – Why Do We Have Different Frame Types? (lostintransit.se)
175 points by un_ess 32 days ago | hide | past | favorite | 83 comments



Ironically, this version of the header published in 1980 is what we still use to this day.

IMHO Ethernet is one of the of great examples of backwards compatibility in the computing world. Even the wireless standards present frames to the upper layers like they're Ethernet. It's also a counterexample to the bureaucracy of standards bodies --- the standard that actually became widely used was the one that got released first. The other example that comes to mind is OSI vs DoD(TCP/IP).


> It's also a counterexample to the bureaucracy of standards bodies --- the standard that actually became widely used was the one that got released first.

Sounds like a cautionary tale: whatever gets released first will stick. If you make a blunder, generations will have to live with it (like IPv4).


OSI was usable as a wide area network before TCP/IP was.


[citation needed]


Personal experience.

Was developing OSI applications in 1989, you could order an X.25 circuit from your PTT and run OSI over that using the ISODE [1] toolkit.

The earliest ISP in my country didn't start until 1992.

[1] https://en.wikipedia.org/wiki/ISO_Development_Environment


TCP/IP was developed in the 1970s and adopted as the protocol standard for ARPANET (the predecessor to the Internet) in 1983.


and commercialization wasnt done until the like 1994. OSI as a successor was still proposed when the first BGP4 rfc came out.

before commercialization happened, IP was mostly the realm of government and education.


I feel like your experience was quite different to mine. I used TCP/IP in the late 80s at university and doing commercial contract work. I remember OSI existing but at the places I worked it was treated as less common.


OSI is very widely used, most of the large ISPs use it. End consumers are just unaware of that fact. See https://en.wikipedia.org/wiki/IS-IS


I think this is a stretch, like saying the X.500 Directory System is widely used based on the fact that PKIX is technically adapting X.509 and thus your TLS certificates depend on the X.500 directory system. End users aren't just "unaware" that it's actually the X.500 system, it functionally isn't the X.500 system, PKIX mandates an "alternate" scheme for the Internet and the directory called for by X.500 has never actually existed.

Likewise then, IS-IS is the protocol that OSI standardized, but we're not using as part of an OSI system.


X.500 is widely used in the form of LDAP and Active Directory, however.


Active Directory is not based on X.500 and LDAP was directly created as an alternative to the DAP standard that is part of X.500.

While X.500 is a precursor to both of these things, and influenced both of these things, and both of these things interoperated with X.500, they are not X.500. X.500 is for all intents and purposes pretty much dead in 2024, although I did deploy an X.500 based directory service in 2012 and it's probably still alive and running.


Ironically there's probably more X.500 remnants in Active Directory than in the UMich-derived LDAP servers, thanks to its Exchange 5.5 legacy.


LDAP is just "lightweight" protocol for accessing an X.500 directory service.

The semantics are the same for the accessed store, IIRC. The protocol definition explicitly talks about being used to access X.500 data stores as complementary interface to DAP (which requires lower layers of OSI stack vs LDAP's raw stream of bytes)


Yes, that is why it was invented, as I alluded to. LDAP was an alternative to DAP. DAP is part of the X.500 standard, LDAP is not. LDAP when it was first invented was built to access an X.500 directory. That is no longer a base requirement and most directories in the wild are not built on X.500. Standards mean things, just because LDAP was originally built to access X.500 doesn't mean it's part of X.500, and X.500 is a very detailed and specific standard which most directory services no longer follow.


To reiterate in a slightly different form: Ldap is an inheritor of, and successor to, DAP/X.500, but not backwards compatible and only superficially resembling it at this point.


Would call it stretch to say it was widely used by ISPs. Some old ones may still be using integrated IS-IS as their IGP (early OSPF had scaling issues and complicated solutions for that), but that's nothing like widely using the ISO stack. But they might have used IS-IS it to route NSAP in their network at some point in time to manage ATM-era equipment, the ISP I worked still had some ctunnels for that purpose, I doubt they still have them.


It was still widely used for IPv4 and v6 in 2000s. ex telco engineer told us that two major reasons was that IS-IS was simply more efficient, and that they didn't require IP communication between routers - meaning troubleshooting and management was easier than providing p2p routes for IP-based protocols.


It's still widely used, among old ISPs because why switch when it's just as good or better than the alternative and your engineers know it best. But it's not the whole stack as implied in the first reply.

It was popular primarily because OSPF didn't scale well with hundreds to thousands of routers with the minimal CPU power even large core routers had.


Scalability was one part. Later the fact that you didn't need separate routing setup for v6 also was part of it.

But for a telco, the fact that you didn't have to setup IP connectivity in order for routers to see each other, was also a crucial ability.


OSPF has supported unnumbered links for a long time, don't need IP on the link with point to point and multicast.


Unnumbered just means you have no IP address exclusive to the link. IS-IS works without establishing IP connectivity at all, not just about tricks to handle leaky internals.

This also meant autoconfiguration or supporting non-ip protocols (like routed ethernet) is simpler, though I will admit that OSPFv3 adding default use of link local V6 for router interfaces made the gap smaller.


still good reasons to choose IS-IS for greenfield :) hell, they took IS-IS and did TRILL with it for DC (rip TRILL).

but yeah, I saw one piece of SONET crap hanging off a ctunnel that spoke CLNP back in maybe 2012 or 2013? haven't seen much of it, but still learned the ctunnel stuff back in '09 because it still could occasionally rear its head.


Worked for an old ISP in 2012 (it started when the state owned telco was privatized). Also had the ctunnel to manage old SONET crap, still had modem pools running, racks full of purple Sun machines still running, vax, hpux, as400, etc. Even the old smsc that was 5 racks of dec servers was still there, was never removed as they didn't need the space because modern equipment was using all the available power anyway (was also in a mountain so a bit of a hassle to remove).


I wish layer 2 and layer 3 were 'refactored' to force all links to be point to point, which they effectively are in the modern world. When was the last time you saw ethernet frame collisions because you used a hub not a switch?

We'd get rid of the idea of a broadcast domain. We'd get rid of Mac address and ARP. Switches and routers would become the same device. We'd just use ip addresses for routing, and the 'next hop' would always be the opposite end of the link you sent a packet over.

The world would be a simpler place, and no functionality would have been lost.


All wifi is a giant collision domain. Also, each segment of a wired network is a collision domain.

What you are describing is more in line with MPLS or Infiniband.

I agree with you frustration. I prefer to design networks that start routing right at the access port or even using an agent, virtual network port, or VPN endpoint at the client or application (like QUIC), but that is very expensive from a resource standpoint.

IPv6 is also another way to get closer to what you are describing.

In my perfect world, we'd move to something like a mashup of MPLS and HIP (https://en.wikipedia.org/wiki/Host_Identity_Protocol)

If want to study something more "routed" and more point to point, look at private mobile networks (5G).

What we don't want is more layers of abstraction... that's making every slow, brittle and impossible to troubleshoot.


WiFi uses a different protocol than classic Ethernet, with "Collision Avoidance" instead of "Collision Detection". The reason is that one WiFi station cannot know what sources of radio interference exist at the other stations of a network, because it may hear only a part of them at its location.

So all remnants of the original Ethernet could be removed from wired Ethernet, which does not need layer 2 protocols, while keeping adequate layer 2 protocols for wireless communications. Besides WiFi, there are also long-range point-to-point wireless links, where directive antennas are used at both ends. For these, there is no difference from wired links, so they do not need layer 2 protocols.


>Also, each segment of a wired network is a collision domain

huh? where "segment" means where you are using a hub not a switch? cuz that was a long time ago


If the links are operating at only 10Mbps or 100Mbps, it's possible for them to operate in half-duplex mode (e.g., if the cables don't have all 8 pins wired properly), even with a switch. In this mode, there's a collision domain between the host device and the switch.


> When was the last time you saw ethernet frame collisions because you used a hub not a switch?

10base-T1S is just beginning to ramp up in the automotive industry, which modifies the super-successful 100base-T1 to be cheaper by (a) allowing cheaper PHYs; (b) allowing cheaper endpoints due to the lower data rate to handle; (c) allowing lower-spec single twisted pair wiring; and ... (d) allowing multi-drop. This is intended to allow Ethernet to push down into the space that CAN-FD is currently occupying, and looks likely to succeed, at least in some niches.


> 10base-T1S

I think that standard is a huge mistake... 10Mbits isn't enough for a modern vehicle (no cameras, radars, screens etc). Many sensors alone can push megabits, and in the modern world engineers want to send their data json formatted not with bitfields.

Instead they should have used an cdma-like design with the physical being a 2 cent microcontroller for things like bulbs and micro switches. Then, for things like cameras which require more megabits use a 30 cent microcontroller with a higher chip rate, all transmitting in the same bus and using code division to avoid needing to worry about scheduling.


> I think that standard is a huge mistake...

You have to start somewhere. They're going for 25 Gbps:

> In addition to the more computer-oriented two and four-pair variants, the 10BASE-T1,[20] 100BASE-T1[21] and 1000BASE-T1[22] single-pair Ethernet (SPE) physical layers are intended for industrial and automotive applications[23] or as optional data channels in other interconnect applications.[24] The distances that single pair operates at full duplex depends on the speed: 1000m (1km) with 802.3cg-2019 10BASE-T1L; 15 m or 49 ft with 100BASE-T1 (link segment type A); up to 40 m or 130 ft using 1000BASE-T1 link segment type B with up to four in-line connectors. Both physical layers require a balanced twisted pair with an impedance of 100 Ω. The cable must be capable of transmitting 600 MHz for 1000BASE-T1 and 66 MHz for 100BASE-T1. 2.5 Gb/s, 5 Gb/s, and 10 Gb/s over a 15 m single pair is standardized in 802.3ch-2020.[25] In June 2023, 802.3cy added 25 Gb/s speeds at lengths up to 11 m.[26]

* https://en.wikipedia.org/wiki/Ethernet_over_twisted_pair#Sin...


The 100base-T1 and 1000base-T1 standards already exist, and are already widely adopted. 10base-T1S is intended to replace CAN, which is sub-10 Mb/s, and often ~2 Mb/s; this is a niche where cost comes first. For non-multidrop links, intermixing 1000, 100, and 10 Mbit/s ethernet links on the same switch is trivial, so each link can be independently chosen and cost-minimized; and for sensors and high bandwidth items with safety impact, multidrop is generally not the preferred approach anyway. Basically -- 10base-T1S fits a new niche (for Ethernet), where the niches you mention are already well addressed.


So it's like RS-485, but more complicated?


I think this is a terrible idea. Daisy chain multi drop is so last century. Switched point to point networks are so much better!


On paper, yes absolutely.

In reality, the wiring harness is one of the more expensive and complex components in a modern vehicle. The majority of the data being carried is low speed and low risk signalling: climate system controls, entertainment system, lighting control, etc. Obviously braking, steering, throttle controls and things like that are a different class.

Look at a modern luxury vehicle and how many things are in one door alone. Accent lights, window controls and motors, locks and lock controls, speakers (yeah, often plural), side mirrors and controls, side-looking cameras, etc. The wiring harness into a door needs consolidation or else it can become a giant heavy thing, the multi-drop approach makes a lot of sense here.


A really interesting article covering this: “The world in which IPv6 was a good design” https://apenwarr.ca/log/20170810

It talks about how when IPv6 was being designed, they wanted to do exactly that: drop most of the layer 2 stuff, abandon the idea of a bus network, make everything point-to-point, all switches would be L3 routers, etc.

Search for “What if instead the world worked like this?” for the relevant part.

My question though, is how would IP assignment work for each of the intermediary devices between me and (say) my ISP’s gateway? My computer is plugged into a switch right now, which is plugged into another switch, which is plugged into my router, which has a point-to-point link to the ISP gateway. Would my router get a /64, then delegate a /68 to the next “router” (ie. The physical thing I currently call a switch), which would delegate a /72 to the next one, etc? How would it determine the optimal IP allocation? What if there’s a cycle? Aren’t we sorta reinventing spanning tree at this point? (I’m genuinely curious about this, because I don’t really grok all of the implications of an “everything is L3” world like this.)


For v6-specific world, scoped addresses and scoped multicast are explicitly for that purpose. You do not need to hierarchically subnet each following router, you just need to be able to express "next hop" for the subnets you need to route towards.

You use link-local autoconfiguration, and use appropriately-scoped multicast addresses to ask "all-nodes" or "all-routers", making autoconfiguration a breeze compared to v4 world. In v4 world a similar setup is also possible, though specific details of the setup differ, and you have to setup addresses manually for each p2p link.


I mean sure, you’d definitely use a scoped address to talk to the next hop, but it still doesn’t solve how the router/switch knows which port to send the packet to for its next destination.

Say I have a global unicast address on my desktop, 2 hops from my router, and I want to allow traffic to it. My router gets a packet sent to it over its link-local address, with a destination header of my desktop’s IP. Say it has 4 ports (each going to another router/switch, each with its own link-local address.)

How would it know which port to use as the next hop? It would need a routing table, and that would need to be configured automatically if we want to work as well as switches do today. What would be the protocol for this auto configuration? BGP or something like it? How do the routers know the available address space? Or are we just stipulating that we’d invent a protocol for this, if it had ever happened?

In ethernet we have the Spanning Tree Protocol for this, to discover the topology of an Ethernet network and know which links to use for which MAC addresses (including the ability to detect cycles.) I feel like something like spanning tree would still need to exist in an all-unicast, no-ethernet, L3-only world. Does such a thing exist already, or would we need to invent it in this counterfactual universe?


STP does not cover filling in MAC forwarding tables, it only tells the switches which ports to shut down so that no loops would be created. RSTP and later evolutions just made it faster as well as capable of creating multiple spanning trees, but the core logic is still based on the "oh fuck" realisation of a corporation that sold nonrouteable ethernet protocol and thus had to figure how to deal with larger L2 domains than designed (while keeping logic simple to not make switches too expensive).

For both IP, OSI, and routed ethernet, you run a complete routing protocol (IS-IS actually will handle both OSI, IP, and routed ethernet) where routers tend to send "hello" packets to announce themselves on the links to establish communication. This is how IP routing works since some of the oldest routing setups. On IP-bound protocols like OSPFv2/V3 or BGP you have to configure connections between the routers then tell them to peer with router at given IP, with IS-IS it has its own independent low level mechanism so you connect to a link and tell it which router ids to trust/peer with.


Interesting.

I guess since IPv6 solves the address allocation problem natively (with SLAAC), it’s not allocation you need to worry about so much as “what IP’s does this router see vs that router, recursively”, which it sounds like IS-IS can provide.

This would mean each router box in my network would grab its own random (SLAAC) address within the /64 advertised by my “main” router (found via multicast all-routers), and would then each (through IS-IS or something similar) announce and forward individual /128 routes for each host plugged into them. The network would converge such that each individual box would have a complete routing table of what next-hop to use for each IP address in the tree.

I can see this being a preferable setup if it allows you to completely eliminate layer 2 from the equation… alas this never happened so it’s all a thought exercise.


> Would my router get a /64, then delegate a /68 to the next “router” (ie. The physical thing I currently call a switch)

This is another weird thing about networking. As far as I've been able to learn, a "router" is a device with two ports that handles transmission of data between those ports, whereas a "switch" is a device with more than two ports that handles transmission of data between those ports.

But nobody would ever care about that distinction.


This is wrong and networking people care very much.

(Wildly oversimplifying here, there are always exception, YMMV, no warranty expressed or implied, may cause blurred vision or a rash)

A switch is a device that handles things like "I need to get a packet from my desktop PC to the printer down the hall". It has lots of ports because, usually, there are a lot of things local to you that you might want to talk to, and you want that traffic as fast as possible without the expense and overhead of 'routing'. If those things aren't on the same L3 network, a switch won't be able to get there[1].

A router is something that maintains a table that maps non-local (not the same L2 network, collision domain, VLAN, whatever) destinations to a 'next hop' based on various metrics[2]. In the general case, routers are concerned with questions like "I'm in Atlanta and I need to get this packet to Tokyo so is the best way to send it via my connection to Comcast or Level-3?".

The degenerate case for routers is a two-port box that does nothing but move packets not destined for something in your local network to your ISP or other upstream network for forwarding to a non-local destination. Since that's the use case most folks see, it's easy to misunderstand the bigger pix. Much is done via various kinds of virtual interfaces now, but I have in my career worked on routers with hundreds of physical ports.

[1] Yes, some switches have a router in them. Stop overthinking. [2] Yes, it's more complicated than that. Stop overthinking.


Too expensive to do in an ASIC. There is a reason the MAC table is bigger than the routing table on a chip, because it is cheaper. Think of an ASIC as a box that can be divided up in to smaller boxes that are an index. The total number of boxes is limited by the size of the ASIC. The bigger chip, the more box and greater the cost. To do MAC forwarding it takes 2 boxes. To do routing it takes 5. To do an ACL match it takes 14. This is the reason OpenFlow never really worked on switches at scale. What you are asking for is someone simpler to MPLS and that is an expensive feature die size wise. I have highly oversimplified this post, but it is mostly correct at a 1000ft level.


If you force all links to be conceptually point to point, you probably make it harder to do some things. Already 1G and higher force full duplex, and 100base-TX full duplex is very common. I've still got a couple 10baseT half duplex devices though.

I have redundant internet/nat routers at home (overkill!), and they communicate amongst each other to decide which is active and which isn't, but either way, the active one ARPs for the router address with 00-00-5E-00-01-01 as the mac address. The rest of the network just sends off-network packets to 00-00-5E-00-01-01, and failover happens because switches figure out what port is currently using that address.

I share a different mac address for the upstream connection, which is PPPoE (sadly), but same deal --- when failover happens, the new computer starts using the address and everything figures it out, because stations are allowed to move to different ports by design.


You can pretty much 1:1 what you describe in the redundancy case with IP, just replace the "relearn which port the MAC address associated with that IP is on" with "relearn which port the next hop for that IP is on".

Things tend to get a little messier than people expect in figuring out the "what values do I use for the point to point links and how do they get assigned" step of things, though there are some clever answers there too.


That sounds a lot like ATM where you called a machine and received a point to point pipe. Though you had to call first unlike Ethernet where you fling packets into the ether at will. ATM over SONET is used heavily in teleco but is on its way out in favor of OTN and Ethernet.


Actually, I know of one large system that heavily relies on having racks and racks of servers all located in the same broadcast domain. It makes the networking a bit more complicated, but in turn the software is a lot less complicated. It’s a decent trade–off.


One thing that isn't mentioned is that the physical layer at the time was 'flat' ie: a network had a shared wire. That means bus arbitration (to prevent collisions) was a big deal. Token ring solved that by passing tokens, which presumably guarantees latency. I believe Ethernet just raised a line high, and it was up to everyone to respect that.

Of course that changed when switches came out. I have a 10/100 hub in a closet somewhere for debugging, since it's nice to not have to remember how to get into a switch and set the monitor port.

Token ring equivalents are still used in lots of places. From what I remember cable modem data is basically token ring off of channel 0 (though that may not be accurate anymore).


> I believe Ethernet just raised a line high, and it was up to everyone to respect that.

It's actually much simpler than that. When you transmit you also listen. If what you hear is not what you sent, there is a collision, and you backoff.


To be specific, it's even more basic than that. For 10Base-5 and all the coax ethernets, it was "if there's more energy on the wire than you are transmitting, a collision is present".


Exactly. CSMA/CD (carrier sense multiple access with collision detection)


I wish we could have another and bump the packet size.

We're at the point where we can have millions of packets per second going through a network interface, and it starts to get very silly.

It's at the point where even a 10G connection requires some thought to actually perform properly. I've managed to get bottlenecked on high end hardware requiring a whole detour into SR-IOV just to get back to decent speeds.


> I wish we could have another and bump the packet size.

The clock precision (100s of ppm) of the NIC oscillators on either side of a network connection gives a physical upper limit on the Ethernet packet size. The space between the packets lets the slower side "catch up". See https://en.wikipedia.org/wiki/Interpacket_gap for more info.

We could use more precise oscillators to have longer packets but at a more expensive cost.


You don't need that as much on modern protocols. The point of 8b/10b or 64b/66b is that it guarantees enough edges to allow receivers to be self clocking from the incoming bits being more or less thrown directly into a PLL.


That's a separate concern.

The previously mentioned issue is that to never buffer packets in a reclocking repeater on a link, you _need_ the incoming packet rate to never be higher than the rate at which you can send them back out, or else you'd fill up/buffer.

If your repeaters are actually switches, this manifests as whether you occasionally drop packets on a full link with uncongested switching fabric. Think two switches with a 10G port and 8 1G ports each used to compress 8 long cables into one (say, via vlan-tagging based on which of the 8 ports).


Realistically I think we would be fine to make packet size significantly larger than Ethernet would currently allow if we really wanted. E.g. Infiniband already has 1x lane speeds of 200 Gbps without relying on any interpacket gap for clock sync at all. Ethernet, on the other hand, has been consistently increasing speed while decreasing the number of bits used for the interpacket gap since it's less and less relevant for clocking. Put a few bytes back and you could probably do enormous sizes.


I don't get how that limits the packet size. If a sender's clock is 500 ppm faster than an intermediate node's, you need 500 ppm of slack. That could be short packets with a short gap, or large packets with a large gap.

Ethernet specs the IPG as a fixed number of bits, but it could easily be proportional to the size of the previous packet.


(Intentional) jumbo frames at layer 2 and expanded MTUs at layer 3 are certainly available (as you may know). In fact it seems (I am, it should be obvious, not an expert) that using jumbo frames is more or less the common practice by now. There does in fact seem to have been some standards drama about this, too: I can't find it now, but IIRC in the '00s someone's proposal to extend the header protocols to allow the header to indicate a frame size of over 1500 bytes was rejected, and nothing seems to have been done since. At the moment it seems that the best way to indicate max. Ethernet frame sizes of over 1500 is an optional field in LLDP(!) https://www.juniper.net/documentation/us/en/software/junos/u... and the fall-back is sending successively larger pings and seeing when the network breaks(!) https://docs.oracle.com/cd/E36784_01/html/E36815/gmzds.html .


The common advice I've heard for jumbo frames is not to enable them unless you can do it for every devices on your LAN, and even then it's probably not worthwhile outside specific situations like a separate iSCSI network or such.

I just now ran iperf3 from my Mac to my Synology without jumbo frames:

  [ ID] Interval           Transfer     Bitrate         Retr
  [  7]   0.00-10.00  sec  10.0 GBytes  8.61 Gbits/sec    0             sender
  [  7]   0.00-10.00  sec  10.0 GBytes  8.60 Gbits/sec                  receiver
Given how rarely I actually care to saturate the 10Gbit link, I'd rather use the slightly hypothetically slower default settings that are highly likely to work in all scenarios.


It seems much more effective on plain 1000Base. It's the difference between 850Mb/s and 975Mb/s for me.


That difference is due to your devices relatively poor at handling high pps workloads and not really related to the particular speed it's running at. You can certainly get more than 850 mbps on a 1000BASE-T link with standard MTU, on the order of ~100 mbps more.


That surprises me a little. From the Wikipedia article[0] I'd expected jumbo frames to be only about 5% more efficient.

[0] https://en.wikipedia.org/wiki/Jumbo_frame


The use of jumbo frames is NOT normal and is only used in very specific setups. In general something like a storage system that is isolated to its own layer2 network. At some point your jumbo network has to hit the reset of the network and packet fragmentation is done in software via the CPU which is very expensive and not linerate. The normal outcome is you break your network.


I'm not certain what my point is, but I wanted to mention that jumbo frames don't work over the Internet. More of a LAN thing.


My local internet exchange has a 1500 vlan and a 9000 vlan. My understanding is there are many fewer peers on the 9000 vlan, but it's not zero.

If you want to use jumbo packets on the internet at large, you need to have working path MTU detection, which realistically means at least probing, but you really should have that at 1500 too, because there's still plenty of broken networks out there. My guess is you won't have many connections with an effective mtu above 1500, but you might have some.


Separate peering VLANs for those using 1500 byte peering and 9000 byte peering about sums up how much of a PITA it is to mix things and expect PMTUD to work.

I'd be willing to bet my lunch more 10x more places have been moving down to assuming 1280 byte connections (since IPv6 guarantees it) than have been peering on the internet at >1500 (not counting 1504 for VLAN tags and whathaveyou).


MTUs are one of the eternal gremlins of networking, and any choice of MTU will almost certainly be either too large for the present day or too small for the future. 1500 was chosen back when computers ran at dozens of megahertz and it was actually kind of large at the time.

Changing the MTU is awful because parameters like MTU get baked into hardware in the form of buffer sizes limited by actual RAM limits. Like everything else on a network once the network is deployed changing it is very hard because you have to change everything along the path. Networks are limited by the lowest common denominator.

This kind of thing is one of the downsides of packet switching networks like IP. The OSI folks envisioned a network that presented a higher level interface where you'd open channels or send messages and the underlying network would handle all the details for you. This would be more like the classic analog phone network where you make a phone call and a channel is opened and all details are invisible.

It's very loosely analogous to CISC vs RISC where the OSI approach is more akin to CISC. In networking RISC won out for numerous reasons, but its simplicity causes a lot of deep platform details to leak into upper application layers. High-level applications should arguably not even have to think about things like MTU, but they do.

When higher level applications have to think about things like NAT and stateful firewall traversal, IPv4 vs IPv6, port remapping, etc. is where it gets very ugly.

The downside of the OSI approach is that innovation would require the cooperation of telecoms. Every type of connection, etc., would be a product offered by the underlying network. It would also give telecoms a ton of power to nickel and dime, censor, conduct surveillance, etc. and would make anonymity and privacy very hard. It would be a much more managed Internet as opposed to the packet switching Wild West that we got.


Most high level applications do not deal with any of the low level details. They open a channel by specifying a hostname and a port number, and are given a reliable bidirectional byte stream from the network layer.

As far as most applications are concerned, the hostname is just a string that is interperated by the network layer. Be that through a DNS lookup, or parsing as an address native to the underlying network protocol.

A minority of applications get fancy and request a datagram oriented link, which the network layer also provides (with an admittadly small limit of 65KB that leaks over from that.

Few applications ever go deeper than that.


Maximum packet size is already configurable on most NICs. 9000 is a typical non-default limit. If you increase the limit, you must do so on all devices on the network. https://en.wikipedia.org/wiki/Jumbo_frame


I’ve seen Cisco gear that supports 8192… that was fun to figure out with a separate network team :) “yup, we’ve enabled jumbo frames!”


> I wish we could have another and bump the packet size.

That's why I'm in full support of a world ending apocalypse that allows society to restart from scratch. We've made so many bad decisions this time around, with packet sizes being some of the worst.


Maybe we can then also redefine pi as 2*pi, while we're at it.


Or just use tau and call it "tau"


What a throwback! I remember when the Tau Manifesto came out: https://tauday.com/tau-manifesto


As long as we also make sure electrons are positively charged this time.


Then I'm going to be really confused about positrons!


Oh, but have you heard the news about negatrons?


I mean, larger packets (and working path MTU detection) could be useful, but with large (1500) byte packets and reasonable hardware, I never had trouble pushing 10G from the network side. Add TLS and some other processing, and older hardware wouldn't keep up, but not because of packetization. Small packets is also a different story.

All my hardware at the time was xeon 2690, v1-4. Nics were Intel x520/x540 or similar (whatever SuperMicro was using back then). IIRC, v1 could do 10G easy without TLS, 8-9G with TLS, v3 improved AES acceleration and we could push 2x10G. When I turned off NIC packetization acceleration, I didn't notice much change in CPU or throughput, but if packetization was a bottleneck it should have been significant.

At home, with similar age desktop processors with @ dual core pentium g3470 (haswell, same gen as a 2690v3), I can't quite hit 10g in iperf, but it's closeish, another two cores would probably do it.

In some cases, you can get some big gains in efficiency by lining up the user space cpu with the kernel cpu that handles the rx/tx queues that the NIC hashes the connection to, though.


I discovered that putting a 10G interface into a bridge implies a very significant slowdown. Linux has to do stuff on the CPU to do the bridging, so that turns off a large part of the card's acceleration.

That's not a good thing for a server that runs a bunch of VMs.

Fortunately SR-IOV exists, but it seems a tad silly to me that I have to do all this weird PCIe passthrough stuff just for this. It's nice, don't get me wrong, but a bit too exotic for what should be a simple setup.


I always found it ironic that virtualization made me have to care about the hardware more than I ever had to before.

2013 me: slap the app on a Dell. It'll be fine.

2017 me: aw crap, the nic doesnt support SR-IOV. What do you mean, I need a special driver? Oh lordy, I'm pinning a whole damn CPU just so DPDK can pull packets off the wire?


Oh yeah, bridged mode on my little pentium system brings perf way down. It was fine on 1G, but when I upgraded to 10G and wanted to hit numbers, I needed to stop doing software bridging. For me, I have slots and NICs, so I moved away from virtual ethernet on a bridge to actual ports; main host gets the 10G, and everything else gets to use 1G ports.

No SR-IOV on my board, but that's ok.


The story is a bit like the Xkcd classic

https://xkcd.com/927/

Looks like it took some years for one standard to prevail. Also TCP/IP was not clear winner in the early days.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: