K8s Service Meshes: The Bill Comes Due

andsens · 2024-03-02T10:11:05 1709374265

I never understood the appeal of service meshes. Half of their reason to exist is covered by vanilla kubernetes, the rest is inter-node VPN (e.g. wireguard) and tracing (cilium hubble). Unless I’m missing something encrypting intra-node traffic is pretty silly.

K8S has service routing rules, network policies, access policies, and can be extended up the wazoo with whatever CNI you choose.

It’s similar to Helm, in that Helm puts a DSL (values.yaml) on top of a DSL (go templates) on top of a DSL (k8s yaml), just that it is routing, authentication, and encryption on top.. well, routing (service route keys), authentication (netpols), and encryption.

It boggles the mind!

perrygeo · 2024-03-02T23:17:24 1709421444

I've worked on several k8s clusters professionally but only a few that used a service mesh, Istio mainly. I'll give you the promise first, then the reality.

The promise is that all of the inter-app communication channels are fully instrumented for you. Four things mainly 1) mTLS between the pods 2) Network resilience machinery (rate limiting, timeouts, retries, circuit breakers). 3) Fine grained traffic routing/splitting/shifting. And 4) telemetry with a huge ecosystem of integrated visualization apps.

Arguably, in any reasonably large application, you're going to need all of these eventually. The core idea behind the service mesh is that you don't need to implement any of this yourself. And you certainly don't want to duplicate all of this in each of your dozens of microservices! The service mesh can do all the non-differentiated work. Your services can focus on their core competency. Nice story, right?

In reality, it's a little different. Istio is a resource hog (I've evaluated Linkerd which is slightly less heavy weight but still). Rule of thumb: For every node with 8 CPUs, expect your service mesh to consume at least a CPU. If you're using smaller nodes on smaller clusters, the overhead is absurd. After setting up your k8s cluster + service mesh, you might not have room for your app.

Second, as you mention, k8s has evolved. And much of this can be done, or even done better, in k8s directly. Or by using a thinner proxy layer to only do a handful of service-mesh-like tasks.

Third, do you really need all that? Like I said, eventually you probably do if you get huge. But a service mesh seems like buying a gigantic family bus just in case you happen to have a few dozen kids.

kuhsaft · 2024-03-02T17:02:35 1709398955

One major usage of services meshes that I’ve come across is for the transparent L7 ALB. gRPC, which is now very common, uses long-running connections to multiplex messages. This breaks load-balancing because new gRPC calls, within a single connection, will not be distributed automatically to new endpoints. There is the option of DNS polling, but DNS caching can interfere. So, the L7 service mesh proxy is used to load balance the gRPC calls without modification of services.

https://learn.microsoft.com/en-us/aspnet/core/grpc/loadbalan...

fragmede · 2024-03-02T10:20:45 1709374845

Look, back in the day, things weren't encrypted, so you could listen in on your neighbor's phone calls, read their email, hack their bank accounts. Wireshark and etherdump and the most fun of all, driftnet. So, since then, everything has to be encrypted, lest someone hack there way to the family jewels. Never mind that the number of breaks to get there means there are usually bigger fish to fry. The important thing is to sprinkle magic encryption dust on everything because then we know it's Very Secure. (That's not to deride the fact that encryption is important, because it is, but sometimes it goes a bit far when there are other gaping holes that should be patched first.)

anonzzzies · 2024-03-02T10:36:46 1709375806

Usually, unless someone is really doing naive things, you will need to have access to a lot of almost physical things to sniff traffic. You almost need to physically have access to room where either the server or the client is, even with unencrypted traffic. People say; 'but they can sniff it at level3'; they sure can, IF they have actual access to level3 on a higher level than just using them for normal traffic. Hacked switch or router or so. Probably state actors can and do pull that off, but outside that, it's really not so easy to get to unencrypted traffic of just a random target. You still should encrypt things of course when you can, but you don't have to get quite that paranoid about it.

All major hacks are 0-days (well, not updated Wordpress is not necessarily 0-day; a lot of 0-days are exploited months or years later), stolen credentials (social engineering usually), brute force password hacks or applications that are left open (root/root for mysql with 3306 open to the world). Those have nothing to do with (un)encrypted traffic.

felixgallo · 2024-03-02T15:06:36 1709391996

if you have the ability to execute code on a CPU, and that CPU is connected to a bus, and that bus is connected to a network card, you can sniff traffic. If you have data and business processes that include at least one entity A that lacks absolute trust at least one other entity B in your cluster, then the visible traffic of A by B is bad.

anonzzzies · 2024-03-02T15:38:17 1709393897

Yes, but if you know that I run unencrypted traffic on my network and if I tell you that, you still won't be able to get to any of that if you cannot get into our network. Even if I tell you that I host at provider X and the traffic is unencrypted until it hits our webserver, you still won't be able to sniff any of it without getting very intimate with someone who has deeper access. Just hiring a machine at the same provider and putting the card in promiscuous mode is not going to get you anything from us.

jabradoodle · 2024-03-02T16:02:54 1709395374

It's not just a specific actor targeting a specific entity though; it's any malicious dependency being ran in a privileged environment.

anonzzzies · 2024-03-02T16:05:35 1709395535

Yes, that's true. But then you might have bigger issues I would say. But agreed. It's a good reason to make sure it's all closed off.

nyrikki · 2024-03-02T19:23:50 1709407430

Look at the default capabilities below, as a poster above mentioned NET_RAW and MKNOD are enabled by default.

https://docs.docker.com/engine/reference/run/#runtime-privil...

Unless you perfectly drop all privileges from every pod you are open to attack.

Containers are not security contexts, they are namespaces, that require all actors that can launch a VM to actively drop privileges.

This is an intentional design decision and not a bug.

bushbaba · 2024-03-02T15:35:45 1709393745

Service Meshes are something necessary for a small portion of Fortune 500s which have 1000s of microservices. Sure you could use load balancers but it becomes cost efficient to move towards a client-side load balancer.

If you aren't a Google, Apple, Microsoft, ...etc scale company than a service mesh might be a tad overkill

xyzzy_plugh · 2024-03-02T15:57:47 1709395067

You're close, but it's really when you have thousands of microservices using either shitty languages or shitty client RPC libraries where you can't easily perform client-side load balancing.

There are plenty of languages and RPC frameworks where you can solve this without resorting to a service mesh.

Practically, and to your point, service meshes solve an organizational problem, not a technical one.

marcosdumay · 2024-03-02T16:24:27 1709396667

I don't get this either. Doesn't the mesh become an scalability bottleneck just like load balancers?

On that scale I'd expect people to use client-selected replicated services (like SMTP), and never something that centralizes connections (no matter where it's close to).

You can always add observability at the endpoints. Unless your infrastructure is very unusual (like some random part of it costing millions of times more for no good reason, as on the cloud), this is not a big challenge; you add it to the frameworks your people use. (I mean, people don't go and design an entire service with whatever random components they pick, or do they?)

crackez · 2024-03-02T17:06:12 1709399172

With Istio (envoy) you run a "sidecar" container in your pods which handles the "mesh" traffic, so it scales with the number of instances of your pods.

marcosdumay · 2024-03-02T17:37:27 1709401047

Oh, thanks. That does solve the issue.

packetlost · 2024-03-02T16:29:53 1709396993

So like a DNS SRV record with multiple entries. Or Anycast, if you're being fancy

remram · 2024-03-02T16:50:14 1709398214

Or IPVS... wait that's built into Kubernetes, it's kube-proxy.

kircr · 2024-03-03T17:52:02 1709488322

kube-proxy operates at L3/L4 while service meshes generally operate at L7 so it can load balance on a per HTTP request basis. Particularly useful for long lived connections commonly used in gRPC and others.

packetlost · 2024-03-02T16:53:43 1709398423

Right? Like, I really don't understand the problem that service mesh solves that isn't already solved by more standardized technologies

remram · 2024-03-02T16:49:01 1709398141

Isn't kube-proxy already a client-side load-balancer?

champtar · 2024-03-02T15:56:54 1709395014

I agree that intra node encryption, if implemented by sidecars, is just wasting CPU cycles.

Small note, unless it has changed recently, containerd default capabilities list includes CAP_NET_RAW, so hostNetwork=true pods can sniff all traffic.

pylua · 2024-03-02T17:47:53 1709401673

I like that istio does mtls. It also helps with monitoring the requests.

neya · 2024-03-02T16:40:56 1709397656

I actually never understood the appeal of Kubernetes in the first place. I have production apps running on bare bones VMs serving millions of customers. Is this sort of complexity really necessary? At this point I would just consider serverless options. Sure, they would be a little more expensive, but that's a huge savings if we account for engineering teams' time.

growse · 2024-03-02T16:59:53 1709398793

Counter-take: I never understood the appeal of virtualizing the hardware. Is that complexity really necessary?

Of course there's tradeoffs, but I think it's a specific perspective that says that Kubernetes is any more complex than virtualizing the hardware and scheduling multiple VMs across real hardware.

pclmulqdq · 2024-03-02T19:41:08 1709408468

I personally am of the opinion that you only really need one of these. Containers don't really need VMs and vice versa to get all of the benefits of abstracting away the fact that your hardware might break. It's just a choice of which abstraction you prefer to operate at (of course, clouds will put your containers in VMs anyway because they need that abstraction). It sort of sounds like the parent prefers the VM abstraction, and you prefer the container abstraction.

growse · 2024-03-02T21:56:03 1709416563

100% agree that containerising on VMs is ridiculous.

pdimitar · 2024-03-02T17:18:05 1709399885

You didn't respond with any benefits of k8s though. What's the true value-add? Surely, coding an entire infrastructure in YAML is not it because that's horrific for anyone who wrote actually working software (one that didn't need 20+ commits of "try again" for a single feature to start working anyway).

growse · 2024-03-02T17:51:14 1709401874

I find it hard to believe that you've considered this for more than than two seconds and can't think of a single reason why k8s might be a good fit for someone's requirements. But here's one:

It's a curated, extensible API that provides a decent abstraction over a heterogeneous collection of hardware. Nobody's done that before, and it's extraordinarily useful for being able to define intent.

No-one forces you to use YAML, it's just a serialisation format.

Once you have a bunch of components that implement this API, it becomes trivial to deploy pretty much any level of complexity of containerised application, without having to care too much about the actual exact details of how scheduling, networking, storage etc. is implemented. Even better, I can hit two different clusters configured in two completely different ways with the same manifest and get roughly the same result.

It's the abstraction.

pdimitar · 2024-03-02T18:01:53 1709402513

> I find it hard to believe that you've considered this for more than than two seconds and can't think of a single reason why k8s might be a good fit for someone's requirements.

That could also tell you that you're in a bubble of converts and have forgotten what it's like to live without k8s. ;) There are always at least two sides of the coin.

> No-one forces you to use YAML, it's just a serialisation format.

Really? So when do I get to describe my singular app that needs Postgres and Kafka in 7-10 lines in a .txt file and not {checks our company's ArgoCD backend repo} ... not 8 YAML files? I ain't got all day or a week, where is that? Nice declarative MINIMAL syntax with all the BS inferred for me (like names and stuff, why should I think of 5-10 names of stuff? Figure it out, automated tool!) that only concentrates on what is being deployed. It can generate everything else just fine, or at least it should, in a more sane world anyway.

Excuse the slightly combative tone, I ain't trolling you here but you also come across as a bit blind about things outside the k8s holy land.

> Once you have a bunch of components that implement this API, it becomes trivial to deploy pretty much any level of complexity

Nope, nothing ever becomes trivial with k8s. I was on call with two senior platform engineers and we needed 3-4 hours to describe a single small app that needs Postgres and Kafka and to listen on a single port while needing a single env var and a few super small config files. These guys provisioned an entire network of 500+ pods working perfectly for years. They made 10+ mistakes while trying to help me deploy this small app on Argo. (And they've done the same dozens of times at this point.)

Do with that info what you will but I'd strongly disagree if your takeaway is "they are not that good" -- because they have done quite a lot very successfully (with and on k8s).

> It's the abstraction.

My 22+ years of programming have taught me that people enjoy abstractions too much and make huge messes with them. I am not convinced that having an "abstraction" at all is even a good selling point anymore.

growse · 2024-03-02T18:57:16 1709405836

shrug

You originally said "I don't see how it's useful", and I observed that some people find it useful.

I'm sorry you've had a bad experience, but it's a bit short-sighted to extrapolate from that to "this is universally useless".

> you also come across as a bit blind about things outside the k8s holy land.

You're not the only one with multiple decades of experience in writing code and managing infrastructure. I've used a lot of the tools in the toolbox, I know when each is likely to be more or less appropriate.

pdimitar · 2024-03-02T20:56:21 1709412981

Ah, sorry if I sounded like saying it's universally useless. I know it helps people and more seriously speaking I'm aware of most of what it abstracts. But from where I'm standing it didn't do it well. Top much YAML, and I have to hand-hold it too much. It absolutely can do better at inferring or generating names / IDs, for example.

Volundr · 2024-03-02T18:22:10 1709403730

> Really? So when do I get to describe my singular app that needs Postgres and Kafka in 7-10 lines in a .txt file

Could you post those 7-10 lines needed to fully manage a Postgres AND Kafka deployment? I'm by no means a master, but I have a decent amount of experience outside the "k8s holy land" and I have no idea how to accomplish that.

pdimitar · 2024-03-02T20:53:49 1709412829

We didn't need anything fancy at all, 99% can just be the defaults. That's what I meant. The other 1% is basically: user, password, port, topic name, replicas, partitions. Could have been two lines, namely two URLs.

Volundr · 2024-03-02T23:56:24 1709423784

No backups? Never do major version upgrades? Neither of these things are covered by the Postgres defaults. And even without them, I don't think I can get Ansible to install Postgres in < 10 lines. Yet all of these are covered in my ~30 line YAML running Postgres on Kubernetes. Plus of course recovering from pod and node failure, load balancing and more.

I think for most places, most of the time Kubernetes is overkill. Cloud in general isn't great bang for the buck. But speaking to your anecdote, having actually deployed these things in the past 3-4 hours to deploy a custom application, AND Postgres AND Kafka is a pretty compelling use case FOR Kubernetes. It would certainly take me a lot longer doing a proper job of it managing the system directly or using something like Ansible.

pdimitar · 2024-03-03T00:21:18 1709425278

OK, you are correct. Let's introduce nuance:

I as a dev should not waste days on this -- something I have no experience in. I can add the basic service / ingress / whatever and move on with life.

The infra / platform guys can add HA and backups later, right?

> I think for most places, most of the time Kubernetes is overkill.

Then we agree. I am not sure this app in particular was a good fit for k8s though; having it on a DO droplet + managed Postgres + managed Kafka would have taken me 1 hour, 40 minutes of which would be me cursing because I forgot to put a host name and port combo somewhere. :D

> I think for most places, most of the time Kubernetes is overkill.

True, but as you yourself said, backups and high availability are much harder on bare metal -- for devs anyway. I can do an amazing job at it because I do it at home for my stuff buuuuuuuut, it's going to take me days, maybe even two work weeks. Not a good time investment.

That's why I am resenting it when I have to do it: it's wasting my time with something that I never learn properly so I have to relearn it every time from scratch because I can't engrave it in my memory and that is because I never do it for long enough stretches of time to indeed remember it.

I already complained to my manager about it and he took it seriously, by the way. I am not raging at you or others about a deficient process in my employer's company (that I plan to help fixing, or at least mitigate somewhat). I am displeased with the fact how everyone just settled on a very, VERY low maxima -- and somehow nobody wants to rock the boat.

We can do better. We should do better. But alas, guys like myself have worked 22+ years on stuff they don't love -- and I already have burnout and hate working programming for money, which is another tragedy entirely -- and will NEVER EVER get the chance to change anything at all because I have no choice but work for the man... for now. Though at 40+ I am not sure I have the strength to manage to somehow command huge commissions for consulting or w/e else. Anyhow, off-topic.

Again, we should do better.

osigurdson · 2024-03-02T12:32:11 1709382731

If you didn't have helm, you would be writing your own regex scripts. I don't see how this would be better.

numbsafari · 2024-03-02T12:47:34 1709383654

If you didn’t have helm, you’d be using one of the other, much better tools, and be happier for it.

osigurdson · 2024-03-02T13:37:16 1709386636

I generally meant, if you didn't have something like helm you would use regex. In any case, please elucidate, what tool do you like best?

grncdr · 2024-03-02T15:13:49 1709392429

Not the poster you're replying to, but when it comes to deploying your own applications: generating Kubernetes manifests with whatever language you're already using and feeding JSON to `kubectl apply -f -` can accomplish the same outcome with less effort.

Helm is still useful for consuming 3rd party charts, but IMO it's status as the "default" is more due to inertia more than good design.

osigurdson · 2024-03-03T00:25:27 1709425527

In my own experience I started off doing this because I wasn't ready to learn helm. However, after using helm once I didn't see the reason to do it in my own code any more.

osigurdson · 2024-03-03T15:10:25 1709478625

I get the sense for the responses (and downvoting) that I will eventually learn to dislike helm!

rsanders · 2024-03-02T15:04:47 1709391887

Which ones would you recommend?

baq · 2024-03-02T15:44:47 1709394287

Any, it doesn’t matter which as long as you don’t have to count spaces in yaml by hand.

If you really want a concrete recommendation try https://cdk8s.io/.

rexarex · 2024-03-02T17:22:34 1709400154

Spinnaker

MrDarcy · 2024-03-02T15:44:27 1709394267

pdimitar · 2024-03-02T17:26:24 1709400384

Really? What's wrong with https://github.com/mikefarah/yq? Works quite fine.

romantomjak · 2024-03-02T10:56:34 1709376994

> Infrastructure teams can usually implement a feature faster than every app team in a company, so this tends to get solved by them.

Well, that's comparing apples to oranges. Product teams have completely different goals, e.g. adoption/retention/engagement, so naturally internal cluster encryption is so far out of scope that in fact only the platform team can reasonably implement it. I don't see how that statement is relevant. You don't send an electrician to build a brick wall

flumpcakes · 2024-03-02T11:01:13 1709377273

Application security should be everyone's responsibility. Architects, developers, and operations.

Too many times have I seen architects and developers completely ignore it to make their jobs easier, leaving it to operations/infrastructure to implement. It's easy to twist the arm of business people with a "I can't ship feature X if you want me to look at security Y".

If everyone took this seriously perhaps we would have fewer issues.

romantomjak · 2024-03-02T11:16:51 1709378211

I agree, I was just making a point that different teams have different priorities and thus different scope. Saying "PodA can only talk to PodB over mTLS" is very different to "Users need to login using oauth". Who is going to build the product if product team is working on the service mesh?

arccy · 2024-03-02T11:38:06 1709379486

you can implement mtls (and almost all the other service mesh features) without service meshes, and it's usually better because of lower overhead, less total complexity (see for example the fat client libraries in use by google, netflix, etc.). but people don't want to think about this so leave it to infra teams to plaster a service mesh over everything.

solatic · 2024-03-02T13:24:49 1709385889

> mtls without service meshes

You can, but it's absolutely a pain in the neck. Services need to load the certs from the filesystem on boot-up and trust the certs provided by other services. To manage trust, you need a certificate authority. Now you need to load the certificate authority's cert, and you need to manage rotation of certs. You need to help developers set up laptop-local certificate authorities and get them to issue certs so that you have Dev/Prod parity. You need to ensure that developers are enforcing modern ciphersuites, not doing bullshit "insecure-skip-verify" kind of toggles that make their jobs easier (because remember, their job isn't security, it's shipping features), not accepting self-signed certs or other certs not signed by the certificate authority. You need to make sure all this stuff is put in the testing suite to make sure it keeps getting maintained, and you need the files for these tests marked in CODEOWNERS to be under InfoSec control to ensure nobody rips them out just because they're inconvenient. And you need to copy this for every single service you run in production and every single development team.

You know what else you can do? Write your own web server (/sarcasm). I mean, who needs nginx? Probably writing your own will have lower overhead and less total complexity, not running a bunch of features that you don't use. And probably it will not be anywhere close to as good as a battle-hardened web server used by millions of engineers that gets regular support.

Personally I think it's debatable whether services really need mTLS within a private network. It's mostly a question of what scale you're running at; probably there's higher benefit-to-effort-ratio InfoSec projects to tackle. But if you do decide you need it, unless you can prove that the overhead is unworkable for your requirements, really you need to bite the bullet and put in a service mesh.

tayo42 · 2024-03-02T16:47:28 1709398048

I haven't worked with a service mesh, I worked at a company that did everything you are describing.

I don't get how you don't still need to do all that? Are the local server proxy to the service in plaintext then? Encryption is just between proxies?

darkwater · 2024-03-02T17:08:13 1709399293

Yes, traffic between generic service and the mesh entrypoint is clear text BUT since the proxy is in a sidecar of the generic service pod, it shares the same "localhost" by mean of Linux network namespaces, so it's virtually isolated (if there isn't a bug) from other code running on the same node. When it exits the pod localhost, traffic is already encrypted.

tayo42 · 2024-03-02T17:21:53 1709400113

Oh i see, that makes sense I think. Thanks!

gerad · 2024-03-02T16:51:53 1709398313

I read the GP as it’s easier to have the single infrastructure team implement it than have every single product team add support in their service.

I mean most app servers abstract away https on the server level and most dev is done unencrypted. So this seems reasonable.

pm90 · 2024-03-02T05:37:30 1709357850

> Istio has become infamous among k8s infrastructure staff as being the cause of more problems than any other part of the stack

True in my opinion. Its very complicated and the docs are confusing af.

jq-r · 2024-03-02T11:06:46 1709377606

I've attended a talk on Kubecon last year on how one company adopted Istio service mesh. I've lost the guy in the first 10 minutes of the talk as it was so complicated, and decided that service mesh is 100% not going into our k8s clusters.

Recently an overly confident security engineer came to us and demanded that we get service mesh because thats a SOX requirement. I have no idea from where these people get pipe dreams like that.

AlecBG · 2024-03-02T15:17:15 1709392635

I think Istio docs are great! I do agree that it's complicated, and I think their API is more confusing that it needs to be. The ontology of DestinationRules, VirtualServices, ServiceEntrys, Gateways (as in the K8s resource), gateways (as in the istio gateway Helm chart) is not the best.

throwitaway222 · 2024-03-02T07:14:56 1709363696

At my last company the devops guy was installing istio for 2 years before he gave up. K8s by itself was just fine.

rexarex · 2024-03-02T17:27:32 1709400452

The docs are TERRIBLE once you need to actually use them in prod.

qazxcvbnm · 2024-03-02T06:47:44 1709362064

Not having worked with K8s, it seems to me a number of things that service meshes are capable of can be done by SDN (e.g. Tailscale, ZeroTier). As far as I'm aware, SDN can do encryption and service discovery (via things like custom DNS) just fine. Can someone explain to me the differences and tradeoffs involved?

ibotty · 2024-03-02T06:49:57 1709362197

Cilium is a service mesh via being a SDN. That's hinted at in the article.

atombender · 2024-03-02T10:55:36 1709376936

That's been my thinking for a while, too. I work extensively on Kubernetes-hosted apps, but our org has (wisely, probably) eschewed service meshes in favour of ingress-based solutions. However, the simplicity of those solutions make things creaky and error-prone.

Rather than injecting sidecar containers that set up networking and so on, having pods join an existing SDN that just works with no app-side config would be a much more elegant solution.

Other than Cilium, I'm not aware of an SDN like ZeroTier or Wireguard that works seamlessly with Kubernetes this way (and which also works on managed Kubernetes like GKE and EKS).

whatthesmack · 2024-03-03T02:26:55 1709432815

As great as reducing complexity is, I just don't see how it's possible to avoid implementing a service mesh in a FedRAMP moderate or high impact level environment. You essentially need to implement mTLS to meet SC-8(1), and to implement mTLS at scale, you need something like a service mesh.

Are there other ways of going about this for FedRAMP moderate or high IL?

remram · 2024-03-03T05:48:06 1709444886

> to implement mTLS at scale, you need something like a service mesh

What makes you think that?

parhamn · 2024-03-02T05:51:56 1709358716

I think people often don't realize that depending on the language runtime, micro-services can easily be a must.

Most service boundaries at organizations are "I need a different version of a pinned package we can't upgrade.". This is common in languages where there is support for only using one version of a given package, and it's worse if there isn't a culture of preserving function APIs. E.g. any python company with pandas/numpy in the stack will need to split the environments at some point in the future, no ifs ands or buts!

pclmulqdq · 2024-03-02T05:54:43 1709358883

I have heard that the reason Docker (and containers in general) took off was that they solved the problem of Python's awful package management. I didn't believe it until I saw people put Python in production and have to deal with this. At this point, I would rather have a physical snake in my server racks than any Python code.

noitpmeder · 2024-03-02T07:30:30 1709364630

Do we live in different worlds? Virtual environments solve 99% of all python packaging and installation use cases.

YetAnotherNick · 2024-03-02T08:58:50 1709369930

I guess so. Virtual environments doesn't solve any of the problems as python authors either don't specify the version or their dependency author doesn't specify it. Try running any of the program which is >3 years old. I remember in one of the programs I needed to pin 10 dependency versions manually to make it run.

biorach · 2024-03-02T11:31:50 1709379110

> I needed to pin 10 dependency versions manually to make it run

um... You should have been pinning all your dependencies from the very start

YetAnotherNick · 2024-03-02T12:32:15 1709382735

It wasn't my code. I pin all my dependencies, but lot of python code has been written by people without software engineering experience like university students and to an extent ML engineers.

biorach · 2024-03-02T18:45:29 1709405129

OK but your occasional contention was that

> Virtual environments doesn't solve any of the problems as python authors either don't specify the version or their dependency author doesn't specify it.

This is a problem of lack of training or willful poor practice, not an issue with venvs themselves which absolutely do solve this issue.

Agreed, the problem would not exist at all if an approach like that of say, node development, existed, but here we are.

baq · 2024-03-02T15:53:05 1709394785

But only in what you deploy. For the love of God don’t pin versions in libraries unless you’re really sure it has a completely different api. That’s how you get a dependency graph with no solutions.

biorach · 2024-03-02T18:35:26 1709404526

> But only in what you deploy

You mean only your direct dependencies, right? And yes, of course

baq · 2024-03-02T19:28:09 1709407689

I mean the whole pip freeze output, if you’re still using that (this works like a lock file).

noitpmeder · 2024-03-02T12:14:37 1709381677

Agreed, or if you dont, you should run pip-compile during CI/merge in order to pin every dependency in the entire tree.

These are very solveable (and have been solved for a while) problems. It really pains me that most people are either unaware or do not dive deep enough below the surface to find them.

mrweasel · 2024-03-02T13:58:04 1709387884

Dependency pinning is one of those things where I do see valid use cases, but "making it run" isn't one of them. It should primarily be use to deal with incompatible version until you can make the necessary changes.

If you depend on dependency pinning due to unmaintained code then you should go deal with the problem directly. Say you pin 10 external libraries to three year old versions, how many security holes does that expose you to?

That's really my issue with dependency pinning, you end up with software that are just allowed to rot, making upgrades more difficult with every passing year.

biorach · 2024-03-02T18:41:46 1709404906

> Say you pin 10 external libraries to three year old versions, how many security holes does that expose you to?

> That's really my issue with dependency pinning, you end up with software that are just allowed to rot, making upgrades more difficult with every passing year

You seem to be suggesting not pinning and just pulling in the latest (minor, I presume?) versions every time you redeploy.

A better way is to pin _all_ your direct dependencies and check for minor version upgrades on a regular basis, especially for publically available services. A proper CI system should allow you to do this with confidence.

Occasionally you will be forced into major version upgrades, but in my experience this is rare.

This should be a regular and essential part of software maintenance and security vigilance.

mrweasel · 2024-03-02T19:00:46 1709406046

> A better way is to pin _all_ your direct dependencies and check for minor version upgrades on a regular basis

I'd agree with that, I just don't believe that to be happening on any significant scale. Reasonably I think you should be able to do pinning like:

  library>=3.0,<4

Depending on the versioning scheme of the library, but yes, pulling in minor release. We currently do that using Debian packages and just rely on the OS to provide security updates to underlying Python packages.

Pinning is fine is you manage updates, but if you're then pulling in three year old libraries, which may come with their own pinned third party dependencies, I still think you're doing it wrong. In the example from GP it appears that they are pinning versions as a way to avoid forking and patching a deprecated and unmaintained package.

I'm not oppose to version pinning as such, but if you don't have a plan to stay on top of security updates, then you're better off pulling in any minor update. It's not just at redeploy, you may reasonably need to pull dependency updates more frequently than you redeploy.

If you deploy using Docker e.g. you're CI system needs to be constantly be pulling in updates, rebuilding and testing container images. People just don't do that. Most developers I worked with don't even care to update their container images if they aren't updating their own code. I've on multiple occasions had to deal with developers who absolutely lost their mind because we didn't automatically pull in OS updates, yet they themselves shipped outdated Java libraries or at one point even relied on an out date version of an alpha release of Tomcat that hadn't been updated for three years. The only different was that their dependencies where in containers and therefor, in their mind, "safe".

biorach · 2024-03-02T19:23:50 1709407430

> Depending on the versioning scheme of the library, but yes, pulling in minor release. We currently do that using Debian packages and just rely on the OS to provide security updates to underlying Python packages.

Why not use a venv or Docker? It's a world of difference. There are multiple issues with relying on the OS package manager for Python dependency management.

Overall it sounds like you know good practice and you know the value of venvs, pinning and Docker but you've had to deal with some workplaces with really shoddy practices. Still, that's no reason to say "venvs don't work" or to give up altogether on the notion of good practice by saying "People just don't do that".

I've worked in shoddy shops, I've worked in shops with excellent practices. In some places I've been the one that made them transition from the former to the latter.

jabradoodle · 2024-03-02T16:06:13 1709395573

In this case your accidentally upgrading your dependencies only when rebuilt. Especially with micro services some things can be ran for years without being rebuilt.

vbezhenar · 2024-03-02T13:40:51 1709386851

I have python service with does some AI job but it just doesn't build anymore. I have "golden image" which I carefully back up, because losing it would mean the catastrophe.

baq · 2024-03-02T15:50:48 1709394648

Had the same problem with node 16->18 recently. No recent OS version would build some extension. Had to rearchitect a good bit of an app to get back to something working.

marcosdumay · 2024-03-02T17:05:32 1709399132

Python has a real problem with version incompatibilities for the interpreter, and a few packages require C libraries of specific versions. But outside of that, vitualenvs solve all of the issues of "how do I run those two programs together".

After the Py2 vs. Py3 thing settled down, almost all of the operations issues got away.

That said, Python has a really bad situation about dependencies upgrade on the development side. But Docker won't help you there anyway. Personally, at this point I just assume any old Python program won't work anymore.

nurettin · 2024-03-02T06:28:51 1709360931

I have used python in production for years, multiple servers, multiple racks, and deployment has always been as simple as

./deploy.sh pull sync migrate seed restart

pull calls git pull, sync runs pipenv sync, migrate runs django migrate, seed runs django management command seed, restart calls systemctl --user restart

toomuchtodo · 2024-03-02T07:06:56 1709363216

This was not my experience building infra for a startup heavily leveraging a Python monolith. It was painful AF (both when developing locally and deploying to VMs) and Docker made the deployment story palatable (build, push to hundreds of VMs, run).

nurettin · 2024-03-02T11:07:32 1709377652

Any details as to why?

toomuchtodo · 2024-03-02T17:15:43 1709399743

Virtual envs, dependency hell, etc. I just want a binary to build and push (previously to VMs, now to k8). Docker does that for Python.

Target simplicity, it’s the ultimate sophistication.

nurettin · 2024-03-02T18:33:57 1709404437

I target simplicity by having one virtual environment instead of many and synchronizing dependencies instead with the main branch instead of... hell?

toomuchtodo · 2024-03-02T18:51:04 1709405464

Have you tried deploying to hundreds or thousands of VMs? In my experience, managing container deployment and state is much easier, vs wrestling with inconsistent env state on compute for whatever reason.

nurettin · 2024-03-02T19:56:03 1709409363

For the life of me I don't understand how deploying to 10 machines is functionally different from deploying to a billion machines as the process is exactly the same. Unless you sabotaged your deployment machines with some manual meddling, it's the same with Docker images and Ansible.

pclmulqdq · 2024-03-02T14:59:50 1709391590

When you run a single, simple Python service, it's fantastic. When you scale your Python, it's an awful dependency hell.

nurettin · 2024-03-02T20:46:26 1709412386

Can you please explain how this dependency hell manifests itself if you are using a dependency resolver such as poetry or pipenv which locates the appropriate maximum version? Once you've locked the appropriate versions, you just do pipenv sync in prod like I said.

pclmulqdq · 2024-03-02T23:13:26 1709421206

Yes - you pull in two third party libraries that need two incompatible versions of a mutual dependency. That library (the common dependency) is not backwards compatible due to deprecated functions or whatnot. That dependency could also be Python itself.

This is an incredibly common occurrence, especially with ML systems which are not designed by people with an engineering mindset.

nurettin · 2024-03-03T05:57:33 1709445453

Not getting the package versions you want is a local dev problem and not specific for python at all. I will remind you that this thread is about deploying to prod and dispelling vague assertions of nebulous problems that are supposedly solved by docker.

sofixa · 2024-03-02T15:13:28 1709392408

That only works until you have a conflicting dependency (same codebase: a.py imports libraryA which needs dep>3.0.2, and b.py imports libraryB that needs dep==1.8.3). Then you're screwed.

nurettin · 2024-03-02T20:43:30 1709412210

no, pipenv install will resolve that and use the appropriate max version.

pclmulqdq · 2024-03-02T23:14:19 1709421259

It will use the max version, but that version will likely have deprecated things that the other dependency relies on in its chosen version. Specifying that you must use an old version of a library is how a lot of Python maintainers resolve deprecations in their dependencies.

jmspring · 2024-03-02T06:07:52 1709359672

“Microservices can easily be a must” please explain.

Your example talks of packaging issues.

LegibleCrimson · 2024-03-02T06:15:33 1709360133

Microservices (or really services in general) solve some of these packaging issues. If I have my application that depends on package A that pulls in dependency C version 1.x, and also on package B that pulls in dependency C version 2.x, this just doesn't work in Python, and many other languages. The only way to make it work is either rectify my dependencies so all my versions match (by running one of your dependencies out of date) or to split them up so my application is composed of one service that pulls in package A and another that pulls in package B, and have them talk over some IPC.

eptcyka · 2024-03-02T05:55:36 1709358936

Did you read the same article I did? How is this relevant?

parhamn · 2024-03-02T05:57:45 1709359065

I meant to reply to "good lord is this what modern microservices are like?".

jmspring · 2024-03-02T06:11:55 1709359915

The article is about service meshes and the tradeoffs amongst them. Going back several years, teams at companies ask about feature X - mtls a big one. The discussion goes to - should we use a service mesh, often the answer was no.

K8s is a great platform with many options, but many decision makers have little knowledge (or don’t research) the implication of their choices.

xyst · 2024-03-02T14:51:23 1709391083

> decision makers have little knowledge (or don’t research) the implication of their choices.

I hate how true this statement is within the industry. To many of these C-level executives base decisions off whatever CEO summit they recently attended.

“Every app to use microservices!111!”

“Hybrid cloud. We are doing it”

“Serverless, let’s start using this”

“We are fully going to the cloud!11!!”

Then when the results come in, the complaints start rolling in:

1) wHy iS aPp SlOwEr (after MSA)

2) gUys, iNfRaStruCtUrE cost is SoArInG (shifting to “cloud”)

3) the ApP is ToO cOmPlEx (after MSA, and “serverless”)

Some of these aging dinosaurs need to be put out to pasteur

neilv · 2024-03-02T16:28:31 1709396911

> To many of these C-level executives base decisions off whatever CEO summit they recently attended. [...] Some of these aging dinosaurs need to be put out to pasteur

AFAIK, the virus of C-suite IT bad ideas doesn't discriminate on the basis of age.

mianos · 2024-03-02T08:30:30 1709368230

It is worse than that. Many decision makers are making their decisions based on advice from people who fired up k8s and all its gadgets for their pet project or google.

billfor · 2024-03-02T07:02:16 1709362936

I read the article as being about service meshes now being a cost item whereas they were free or low cost. I’m not sure debating the technical merits speaks to that.

jmspring · 2024-03-02T07:22:27 1709364147

There were technical components mentioned in the article. Yes, cost comparison was the main thrust.

gnarbarian · 2024-03-02T06:38:32 1709361512

this is a personal attack

jmspring · 2024-03-02T06:50:17 1709362217

How so?

DonHopkins · 2024-03-02T10:12:37 1709374357

I take what he said as admitting that he is one of the "many decision makers [who] have little knowledge (or don’t research) the implication of their choices", and that objective fact stings him personally.

gnarbarian · 2024-03-04T16:08:20 1709568500

yes it was also a joke

nyrikki · 2024-03-02T20:05:30 1709409930

Service meshes as a default pattern is harmful.

Ideally you would be using immutable events over a message bus as a default pattern.

Just as anti corruption layers have a use case. So do service meshes.

But service meshes they are synchronous communication, with indirection typically through a sidecar, they will have the very real costs of synchronous communication in distributed systems.

IMHO the problem isn't that they are problematic when applied to the correct use case, it is that as a default pattern they inherently will cause problems with little to no benefit for systems that can use events.

It is simply the same as any tight coupling, it should be avoided unless a specific use case justifies the costs.

Cargo culting it in because it is superficially 'easier' will typically result in code that doesn't have the scaffolding in place to support events and is very expensive to pivot away from if you don't intentionally write your code to support looser coupling in the future.

Istio, Envoy, Linkerd, etc... currently cater to the synchronous request/response communication between microservices.

It doesn't matter how decetralized their implementation is, it is still essentially sync operations over a network.

We have known about the costs of that decision for a long time irrespective of the implementation complexity.

It is a 'sometimes' food.

trailrunner46 · 2024-03-02T14:25:31 1709389531

This was a good read, as someone using K8s a lot in the last two years but not service meshes yet, it gave me a lot to think about.

I understand there are various advantages like metrics, etc, but the encrypted traffic between pods and services is the one that I can see many orgs demanding. If not a service mesh what other options are there?

pclmulqdq · 2024-03-02T05:46:39 1709358399

I am sort of surprised that the solution to encrypting your traffic isn't using HTTPS internally, and people rolled these wild systems of proxies. It makes sense thinking about it now because you have no idea whether you need the encryption or not (two containers may be on the same host), but this seems to be a pyrrhic victory.

Once again, the bad implications of a seemingly good idea come to bite us.

flumpcakes · 2024-03-02T11:20:23 1709378423

The solution _is_ running HTTPS internally. But as an application developer, do you want to manage certificates? Do you want to make sure you're using the "correct" TLS versions and that your app can talk to other TLS versions? What happens when team Z deploys their legacy Java app that doesn't? Where are you getting your certificates from?

All of these go away with a service mesh and sidecar model where the everything is encrypted for you and certificates are completely managed by the service mesh. No need to roll your own PKI. Your app only needs to speak plain HTTP and from it's perspective every other app is also just speaking plain HTTP, no TLS in sight. It makes developers' jobs a lot easier, only concentrating on shipping features vs. wrangling with TLS.

Saying all of that, you could just roll your own fine. Depending on the competency of your company.

vbezhenar · 2024-03-02T13:44:23 1709387063

The first thing developer does implementing HTTPS is chatgpting "how do I disable certificate validation".

There's no way I'd trust random developer to implement mTLS properly.

remram · 2024-03-03T03:53:50 1709438030

You can trust that the same way you trust any other part of your product. Code review, automated test.

If adding basic mTLS security to every service is that much of a pain, maybe you have too many of those?

lomereiter · 2024-03-02T11:46:26 1709379986

To set up an open-source service mesh, the infra team anyway has to configure a private certificate authority and cert-manager to create k8s secrets for the service mesh components. From there, it's straightforward to extend the common deployment template (hopefully there is one) to mount a volume with an auto-rotated certificate. All an application developer has to do is to use that certificate, which is much less effort than what you are implying.

MrDarcy · 2024-03-02T15:56:34 1709394994

It’s not less effort. I’ve done both ways in production for large teams. What you described is literally entirely automated by the mesh in a more secure and maintainable way than a bespoke hand rolled solution.

lomereiter · 2024-03-02T16:49:20 1709398160

"Large teams" is the key, there has to be enough services and/or language diversity to justify the extra complexity.

DinaCoder99 · 2024-03-02T11:37:17 1709379437

> But as an application developer, do you want to manage certificates?

It might take years of trying to avoid TLS to learn that all the alternatives are far worse. So—yes, just bite the bullet, it's not that bad once you internalize the model, and you really only need to solve this once per organization.

sweetjuly · 2024-03-02T05:53:23 1709358803

It does have some advantages in that you don't have key material on every single server. How much this risk actually matters will vary. I suspect most enterprises would actually be relatively safe with their frontend certificates being compromised, though the same cannot be said about the actual compromise of the backend application.

pclmulqdq · 2024-03-02T06:01:03 1709359263

I assume you would either use self-signed certs or run a private CA to do this. You wouldn't get the attestation benefits, but there's no material of any value in each container if you do this. The encryption is the only benefit.

Reading that comment, I sort of understand why people don't do this because it takes some understanding of HTTPS and cryptography to do this properly.

marcosdumay · 2024-03-02T17:13:19 1709399599

What I get is that the proxies are a really complete "not my problem" solution, that puts all the burden of setting-up the network encryption on the hands of the people with direct access to set-up and debug your network.

If this doesn't look like a real thing to you, congratulations, you are in a well run organization that doesn't have this problem.

Anyway, whether the costs of this thing are worth the gain, I have no idea.

cyberax · 2024-03-02T06:44:29 1709361869

It boils down to the complexity of running an internal DNS and of issuing certificates. If you don't do that, HTTPS is useful only for encryption against passive eavesdroppers.

sporkland · 2024-03-02T23:52:37 1709423557

As someone that got a lot of value out of client side load balancers as a language neutral network library (e.g. we ran envoy under synapse for a while) From the start the Istio design choices tended to address problems we didn't have / felt like architectural dream states while making problems we did have harder to address. The messaging and overall focus seems to have improved and focused more on meat and potatoes features but there's still a number of things that are hard to tweak via the standard istio apis that we have to backdoor in there.

BlackjackCF · 2024-03-03T00:24:57 1709425497

Istio reminds me of the saying on Sawbones (a medical history podcast) when it comes to “cure-alls”: Cure-alls cure nothing.

throwawaaarrgh · 2024-03-02T05:49:09 1709358549

Just like you probably shouldn't be using K8s, you probably shouldn't be using a service mesh. Only add it when you actually need it. You'll know when that happens.

secondcoming · 2024-03-02T13:32:55 1709386375

Assuming that this is yet another DevOps tech I'll be expected to be an expert in, I looked up Istio [0]:

    As organizations accelerate their moves to the cloud, they are, by necessity, modernizing their applications as well. But shifting from monolithic legacy apps to cloud-native ones can raise challenges for DevOps teams.

    Developers must learn to assemble apps using loosely coupled microservices to ensure portability in the cloud.

This just isn't true. You don't need microservices to use the Cloud.

[0] https://cloud.google.com/learn/what-is-istio

baq · 2024-03-02T15:56:59 1709395019

I think they mean ‘you must be able to run your monolith in a way which only starts a small subset of services’ to leverage the cloud cost model, otherwise just rent a colo.

geodel · 2024-03-02T23:04:05 1709420645

Well it is true in same sense many think they need meal kit delivery service if they want to avoid eating outside.

throwitaway222 · 2024-03-02T07:12:34 1709363554

K8S is quickly turning into BPML

xyz-x · 2024-03-02T11:37:51 1709379471

Seems to be it's an ebb and flow of complexity; sometimes it increases like with sidecar proxies and their issues, like how they might listen on NICs differently, or how they cause grey failures, and sometimes it decreases like when the sidecar moves to the kernel and gets rewritten in eBPF and we can remove another ip filtering-firewall-chain. So right now it seems to be a bit of an ebb, as these things are becoming moduralised.

FridgeSeal · 2024-03-02T06:31:36 1709361096

> When you start using Kubernetes one of the first suggestions you'll get is to install a service mesh.

Who on earth is giving you this advice? Service-messages are squarely in the “you’ll know when you need it, and it’s not day 1” bucket.

Is this the experience people are having with K8s? Welcome to Kubernetes, here’s a service mesh gl;hf. If this advice is common, no wonder some people think K8s is over complicated.

To anyone not aware: it _does not_ have to be this complicated. Default ingress controller and a normal Deployment will carry you really, really far.

moondev · 2024-03-02T07:47:53 1709365673

I'm sad this is the top comment on an incredibly detailed and well researched article.

The context of the content is obviously for Kubernetes enterprise platform teams and not deploying your first nginx pod. The author also gives specific examples of why a service mesh is useful and in what scenarios it shines.

YetAnotherNick · 2024-03-02T08:52:43 1709369563

> The author also gives specific examples of why a service mesh is useful and in what scenarios it shines.

The author's entire point was money which is weird as it is peanuts compared to the DevOps salary of people needed to work with this complex setup. Unless you are in a startup and good in tech, in which case service mesh doesn't make a lot of sense. Why do you need encryption or metrics collection above ingress. Nginx does a good enough job for metrics.

darkwater · 2024-03-02T11:42:56 1709379776

It might come as a shocker for you but many medium-sized orgs that prefer to pay salaries for people who can do N things and just augment their capabilities with some Salas/paid service. But if the service is too costly, they will not allocate budget for it. Also, unlike many Salas/software vendors would try to make you believe, those solutions usually need tweaking & knowledge to be operated. And if any issues happen, swift support comes at a hefty premium. Also, not all salaries are SV-level, while these softwares usually have the same cost wherever you are in the world.

YetAnotherNick · 2024-03-02T12:40:20 1709383220

> It might come as a shocker for you

Thanks for the unneeded tone. But the solution that author likely uses is linkerd which is free for 50 users. And even then very likely it would be less than $1000/month. If it is too much for the org to pay, I don't think they need it.

Service mesh is not something a small or medium sized company needs.

> those solutions usually need tweaking & knowledge to be operated. And if any issues happen, swift support comes at a hefty premium.

Exactly my point. Base price is not something that is gonna cost them the most when using advanced tools like this. Even if it was free it would costed pretty much the same for any medium sized org.

jajko · 2024-03-02T14:17:11 1709389031

Man, big corporations are often weird and way more complex than first (or tenth) look can tell.

You can have a guy banking 200k a year in base salary begging his superiors to have a fast laptop for work and compilation, and request is not even rejected, just hangs indefinitely. It would pay for itself within few days with increased efficiency. Or say Java dev fruitlessly begging for intellij idea license that cost less than 1 MD.

Been there, done that (in some way), now I just accept whatever boundaries are given, and let everybody know that due to this my work will take X amount of time. If they don't like it, escalate this so that management structure is aware of unreasonable expectations within given situation. If corp is so rotten that this all doesn't matter and its a long term situation, just leave such toxic place, you may feel scared of uncertainty but later you will thank yourself.

Health is a finite resource and leaking fast, you often don't notice it since warning signals come way too late.

darkwater · 2024-03-02T14:21:24 1709389284

> Thanks for the unneeded tone.

Sorry for that, it was unnecessary indeed.

> But the solution that author likely uses is linkerd which is free for 50 users. And even then very likely it would be less than $1000/month. If it is too much for the org to pay, I don't think they need it.

Didn't check the Linkerd plans in detail but the author says 2 grands per cluster per month. Now, even a medium org (like where I work), can have easily 10 clusters between different environments or different workloads. It's not that weird. And that would make the price already going to 20k a month which needs to be justified. Especially if you were using the same thing but not paying anything before.

lelanthran · 2024-03-02T12:52:56 1709383976

Salas?

hyperdimension · 2024-03-02T13:23:12 1709385792

Autocorrected SaaS?

darkwater · 2024-03-02T14:16:31 1709388991

It was SaaS indeed, sorry!

ixaxaar · 2024-03-02T10:04:30 1709373870

K8s, service meshes, all this infrastructure complexity to me seems like more of a band-aid to bad backend practices and lack of standardization.

This kind of world arises if an org has, say 1000 engineers, or ~100-200 teams, and every team wants to do their own thing with accountability improperly mapped etc.

And infrastructure engineers turn into police with an obvious overemphasis on observability.

I guess a large amount of complexity in infra and wastage is done to hide human inefficiencies.

FridgeSeal · 2024-03-02T11:12:19 1709377939

The flip side is that some of this standardisation lets smaller teams very easily get this functionality without the man-hours required to implement it in your application layer.

< Usual caveats apply, don’t bother until you need it, etc etc. Personally though, I’ve found dropping in linkerd, or Cilium CNI about an afternoon’s work, with no application code changes necessary>

jakupovic · 2024-03-02T11:17:19 1709378239

Or there is a company mandate to encrypt everything due to the customer and legal needs. Then instead of rolling your own solutions you use an established mesh and move to other problems. The complexity is not there but your lack of understanding why someone may need a mesh. The last sentence is because you pre-judge the need and the platform it needs.

Yes I do this for living snd know what I'm talking about.

politelemon · 2024-03-02T06:41:26 1709361686

To answer your question, yes this is the majority of the experience and is almost a given decision in the initial setup, in part because of the security association. Especially in enterprises it's easier to just say service mesh than have to think the decision through.

Regardless of service mesh though, although it's a part of it, the complicated reputation is not undeserved and not solely due to meshes.

vbezhenar · 2024-03-02T06:54:52 1709362492

I have small cluster and the only reason I’d want service mesh is to inspect req/resp between services for better debug. And I’m not even sure they can do that.

Retries for free are good I guess, but not something essential. MTLS is something I don’t want at all.

nullify88 · 2024-03-02T07:15:27 1709363727

Couldn't you accomplish that with tracing? You may not even need any changes in your application as eBPF maybe able to automatically instrument the application. Alternatively, If you use Cilium as a CNI, this functionality can come out the box. https://docs.cilium.io/en/latest/observability/visibility/

jakupovic · 2024-03-02T11:23:45 1709378625

Exactly if you don't need the functionality provided don't use a mesh, but consequently if you then use it. Honestly not very complicated, but mTLS is very complicated done correctly.

dvfjsdhgfv · 2024-03-02T12:37:36 1709383056

If you don't hear this advice, you are simply not in the audience that routinely uses service meshes. They are quite useful in enterprise environments.

sofixa · 2024-03-02T06:47:57 1709362077

> If this advice is common, no wonder some people think K8s is over complicated

I mean, if even Google who were behind it in the first place, are saying it's complicated and have 3 levels of managed Kubernetes services, I'd say it's pretty clear to everyone it is indeed complicated. Some of the complexity is simply due to the complex problems it solves, some is footguns, some is layers upon layers of abstractions to glue around design deficiencies.

> To anyone not aware: it _does not_ have to be this complicated. Default ingress controller and a normal Deployment will carry you really, really far

You're absolutely right. However, Kubernetes is rarely chosen after a careful evaluation of requirements; it's more often than not because it's considered necessary or for resume driven development. In that case, service mesh is easy to tack on, regardless of need.

Already__Taken · 2024-03-02T11:02:17 1709377337

and default gateway API is going to let your ops people help you go far without adding a bunch of coordination over ingress.

out-of-ideas · 2024-03-02T07:51:42 1709365902

its yet another tradeoff, it adds more security for the apps too, now your apps dont need to have tls stuff, you let the sidecar envoy-proxy handle that ect

FridgeSeal · 2024-03-02T07:59:13 1709366353

Yeah I’m aware of how useful they can be, my comment was about for someone first introduction to K8s, they’re probably overkill.

out-of-ideas · 2024-03-03T23:41:46 1709509306

yes a bit overkill for first use; deff want to tinker around with it without the extra moving parts at first; but i'd say if somebody is venturing into that space they should also consider security as well and should certainly try to get the added layers. a lot of companies miss out on security and implement it as an afterthought and run into more pain for everybody involved

kosolam · 2024-03-02T12:07:59 1709381279

This article gives an overview of the 5 or so most popular services mesh options for Kubernetes for Enterprise installations. So if you are reading it and not the target audience, keep that in mind. Also, people in the comments arguing against service mesh with “it seems”, “i assume”, “i think”, etc. Please read the list of concerns that a service mesh is good at resolving, also in this article. If you don’t have these concerns, it doesn’t mean they dont exist elsewhere, right?

phildenhoff · 2024-03-02T05:20:11 1709356811

good lord is this what modern microservices are like?

How is it better have service A request to a proxy, which requests to another proxy, which requests to service B? I get the security benefits of that, but the network architecture is boggling. How many PBs of data are sent each day for what could be a monolithic service?

Actually, to the end — companies that embrace microservices, which see the value in them. How do they manage network traffic for these kinds of K8s setups at scale? Surely they’re not using REST and HTTP. Is it as simple as protobufs over HTTP? Quic? Something else?

Edit: lots of great discussion below but I really meant how do they manage network TRAFFIC, not microservices in general :)

lukeschlather · 2024-03-02T06:46:45 1709362005

I really hate this pattern in principle, but it isn't as bad as you're making it out to be. Most of the service meshes operate as sidecars, which is to say that if service A is calling service B through a mesh, there are two proxies in between the services, but proxy A is on the same machine as A, and proxy B is on the same machine as proxy B. So it is kind of offensive to have 3 network hops involved, but actually only one of those is over the wire, and in most cases I don't think the proxies are actually causing any measurable latency. (At least, whatever latency it's causing is smaller than the latency caused by the TLS encryption, which is necessary and the whole point.)

phildenhoff · 2024-03-02T07:18:45 1709363925

That makes sense! I didn't realise the proxies were sidecars, but that seems obvious in retrospect.

15457345234 · 2024-03-02T05:48:41 1709358521

You're asking some really good questions here

I find that microservices and this type of architecture have become a religion - you do it this way because you do it this way. You add another layer of complication because that's what you do now. You add this product because that's what you do now. Now you do it this way. Now you stop doing this thing and do this thing instead. It's all proclamations and a truly insane level of complexity and often a truly stunningly low level of performance achieved from some very powerful hardware because everything is behind at least twenty layers of abstraction and you're like, encrypting traffic which is just being passed between VMs which are on the same hardware, but because you can't guarantee that they're always on the same hardware you have to encrypt and use a proxy and... oh wow

Watching it from the outside is a bit exhausting, it just seems to be so much churn and overhead.

pm90 · 2024-03-02T05:32:53 1709357573

> How is it better have service A request to a proxy, which requests to another proxy, which requests to service B? I get the security benefits of that, but the network architecture is boggling. How many PBs of data are sent each day for what could be a monolithic service?

The tradeoff here is to decouple services in order to allow them to be developed somewhat independently of each other. Monoliths remove the overhead of network requests but they present their own challenges. You have a lot of implicit dependencies and feature development becomes complicated with changes having unexpected effects very far from the source.

Ultimately engineering organizations need to decide the model that works best for them. Neither is inherently better, they’re solving different problems.

mdekkers · 2024-03-02T06:23:14 1709360594

> The tradeoff here is to decouple services in order to allow them to be developed somewhat independently of each other. Monoliths remove the overhead of network requests but they present their own challenges. You have a lot of implicit dependencies and feature development becomes complicated with changes having unexpected effects very far from the source.

I have seen, developed, designed, managed, deployed, operated, fondled, and otherwise been around thousands of large systems that are anything between trivial importance to “must always be running, in the national interest”. I was around when SOAs were a hot new thing, and SOAP was being rumoured as the thing that was going to save us from everything. A fondly recall an overpaid Compaq consultant talking about “token passing systems” when they were describing message queues.

I have seen exactly two systems that really had to be designed and built along a microservice architecture. Both of these had requirements that introduces a scale, scope, and complexity you simply don’t see very often. All other microservice architectures didn’t solve for requirements, they solved for organisational inefficiencies, misalignments, mismanagement, and - in no small part - ego.

When discussing this topic, proponents of microservice proponents talk about many of the advantages these architectures bring, and they are often not completely wrong. What is lacking from these discussions is often a sense of perspective. “Is this solving real problems we have?”, “What is the compound lifecycle cost of this approach?”, and, of course, “How much work is involved in displaying the users’ birthday date in the settings page?”[1], and “when will Omega Star get their fucking shit together?!”[2].

Don’t start with microservices as the default. I will typically work out the monolithic approach as a point of departure. Want microservices? I’m open to that, just demonstrate how that will be better.

[1][2] https://m.youtube.com/watch?v=y8OnoxKotPQ

dragonwriter · 2024-03-02T08:48:59 1709369339

> All other microservice architectures didn’t solve for requirements, they solved for organisational inefficiencies, misalignments, mismanagement, and - in no small part - ego.

Be developed, maintained, and operated by the real organization that exists and not an ideal organization which doesn't is, in fact, usually a practical if not a theoretical requirement, and its usually easier to adapt architecture than to adapt organization.

tormeh · 2024-03-02T10:45:03 1709376303

Absolutely. But in many cases you'll find companies have far more than one microservice per team (or squad). At that point ypu start to wonder. Some devs seem to use microservices for what modules used to do.

deathanatos · 2024-03-02T05:38:27 1709357907

> The tradeoff here is to decouple services in order to allow them to be developed somewhat independently of each other.

You don't need a service mesh for that, though? Heck, you don't even need an ingress for service to service traffic. ingress-nginx does the job well, without being overly complex, and most importantly to me, logs when something is wrong, which I cannot say the same for Istio which I was fighting earlier this week where it was just happily RST'ing a connection and saying nothing about why it was deciding to do that.

campbel · 2024-03-02T05:56:53 1709359013

There are a lot of good and bad reasons to adopt a mesh. Some of which might relate to your concerns. The things I like most about them, working in infrastructur:

1. I can have a unified set of metrics for all services, regardless of language/platform or how diligent the team is at instrumenting their apps.

2. I can guarantee zero trust with mTLS, without having to rely on application teams dealing with HTTPs or certificates.

3. I can implement automation around canary releases without much lift from dev teams. Other projects leverage these capabilities and do it for you as well.

4. I can get the equivalent of tcpdump for a pod pretty easily which I can use to help app teams debug issues.

5. I can improve app reliability with automatic retries and timeouts.

Probably some other things as well... That said, it can be a big increase in complexity to your system the pains of which aren't always distributed to the folks getting the benefits.

deathanatos · 2024-03-02T06:03:53 1709359433

> 1. I can have a unified set of metrics for all services, regardless of language/platform or how diligent the team is at instrumenting their apps.

But they're only going to be coarse metrics, like what requests/second. You're still going to be needing application-specific metrics.

> 2. I can guarantee zero trust with mTLS, without having to rely on application teams dealing with HTTPs or certificates.

I do like the idea of this feature of service meshes. It is a slog to get teams to do this responsibly. But, like I said: fighting Istio to understand why it was RST-ing a connection, for no apparent reason. Not logging errors is worse. Perhaps the idea is sound, but the implementation leaves one desiring more.

I should mention the same Istio service mesh above is a SPoF in the cluster it runs in, on account of being a single pod. I can't tell if the people who set it up were clueless, or if that's the default. I suspect probably the latter.

> 3., 4., and 5., as well as actually using mTLS in 2.

TBH, these are just benefits I've never been able to realize. I'm stuck slogging through the swamp of service mesh marketing and the people who want to bring the light of their savior the service mesh but without actually getting their hands dirty doing the work of deploying it.

gotbeans · 2024-03-02T09:30:39 1709371839

Just gonna touch some wood on (1);

The fact that you need other metrics does not substract from OPs original point. It's still good and much better overall to handle a set of comprehensive metrics at infra level than to orchestrate every app.

About the coarsness, i think it's not really true. Proxies are freaking powerful and they do a lot of stuff at l7, too much in fact (look at envoy, jesus christ). That's one of the reasons why despite the insane complexity of service meshes, they are paramount for observability.

KaiserPro · 2024-03-02T08:44:39 1709369079

> I can guarantee zero trust with mTLS, without having to rely on application teams dealing with HTTPs or certificates.

ok, but what's your threat model for this? great you can tie service versions to each other, but they are just proxies.

ongy · 2024-03-02T12:07:58 1709381278

1. I agree with the sibling comment, that's usually not the depth I need for application level info. So the entire skill and infrastructure cost for metrics still exists. But it's nice to have as basic data for triage.

2. I always wonder whether that's a timing thing vs. NetworkPolicies and encrypted inter-node traffic. Are the realistic attack scenarios where it's possible to read out intra-cluster traffic but not mess with the cluster, or even read the intra-pod traffic?

3. I've been quite disappointed with how little k8s provides here. I wish it was easier to move traffic off of an old version and only shut the pod down once the last connection was done :/ Maybe I need to look into a service mesh for that?

4. What's the difference to e.g. kubeshark, or just attaching a tcpdump debugcontainer to the pod? Another instance of first to market / potentially nicer ecosystem?

5. I get squeemish with infra-level activities like this. Yes, technically the http method and some headers should make it obvious whether that's save or might break at-least/at-most once or similar semantics. But that requires well behaved applications. While the premise here is infra imposing behaviour to allow applications to be looser around these kind of things.

campbel · 2024-03-02T17:13:10 1709399590

For #5, I like mesh level retries for a few reasons, but perhaps the biggest is avoiding retry storms by using budgets https://linkerd.io/2.15/tasks/configuring-retries/#retry-bud...

never_inline · 2024-03-02T07:48:11 1709365691

Can someone enlighten me on this: if Authorization policies (which pods can communicate with which services) was built in kube-proxy, wouldn't it solve the use case for a large percent of service mesh deployments?

gotbeans · 2024-03-02T09:41:19 1709372479

Kube proxy works configuring ip tables (oversimplified) and its scope is limitted to l4. It doesn't operate at L7.

adhamsalama · 2024-03-02T10:55:35 1709376935

Isn't that what k8s network policies do?

never_inline · 2024-03-02T11:16:24 1709378184

The support varies by CNI so I wouldn't call it "built-in".

But yeah I forgot this existed.

adhamsalama · 2024-03-02T10:55:35 1709376935

Isn't that what k8s network policies do

deathanatos · 2024-03-02T06:05:22 1709359522

> Edit: lots of great discussion below but I really meant how do they manage network TRAFFIC, not microservices in general :)

I don't really understand the question you're asking, but I think maybe the answer is that network pipes are just bigger than the scale most people are operating at? I don't think anything I've ever done has really had that many qps, and if it has, it is more likely to raise an eyebrow that says "who's spamming requests" more that "I guess we've made it to the big time".

REST & protobufs are orthogonal. Empirically, literally nobody is doing REST, and most things are just ad hoc, poorly to not-at-all defined JSON/HTTP with a few HTTP verbs sprinkled in to make everyone feel good. It could be protobuf, too, if you like, but unless you have some truly gargantuan JSON, it really won't matter in the end. Compression will make up enough of the difference in size on the network. Some languages don't have to allocate the keys a billion times, too, though even in Python, it's a while before it starts to hurt.

What I see more of is processes just inexplicably using gigabytes upon gigabytes of RAM, burning through whole years of CPU time for no particular reason before just dropping back to nominal levels like nothing happened, and dev teams that can't coherently understand the disconnect between just how much power a modern machine has, and what their design doc says their process is supposed to do (hint: something that shouldn't take that many resources).

I like microservices, but there should be strong areas of responsibility to them. For most companies, I think that's ~2–3 services. At my current company, it's ~2 services + a database, with the rest being things like cron jobs or really small services that are just various glue or infra tooling.

throwup238 · 2024-03-02T05:26:25 1709357185

> How do they manage these kinds of K8s setups at scale?

Our DevOps team starts off the monthly all hands meeting by leading a ritual during which they ceremoniously sacrifice an animal from the Fish and Wildlife Service's Threatened & Endangered Species list while the rest of the company chants:

    Exorcizamus te, omnis immundus spiritus
    omnis satanica potestas, omnis incursio
    infernalis adversarii, omnis legio,
    omnis congregatio et secta diabolica.

vaporary · 2024-03-02T06:01:49 1709359309

"Getting a SCSI chain working is perfectly simple if you remember that there must be exactly three terminations: one on one end of the cable, one on the far end, and the goat, terminated over the SCSI chain with a silver-handled knife whilst burning *black* candles."

-- Anthony DeBoer

"SCSI is *not* magic. There are *fundamental* *technical* *reasons* why you have to sacrifice a young goat to your SCSI chain every now and then."

-- John F. Woods

jakupovic · 2024-03-02T11:43:12 1709379792

I LOL-ed at this because I used to cry when no goats were around.

zubairq · 2024-03-02T06:08:44 1709359724

haha, IT staff doing mystic chants reminds me of this video I made over a decade ago:

https://www.youtube.com/watch?v=8Sj3_NfDYeU

jacinda · 2024-03-02T05:39:21 1709357961

I needed this laugh so much. Thank you.

solatic · 2024-03-02T12:39:10 1709383150

> How many PBs of data are sent each day for what could be a monolithic service?

The companies that need service meshes couldn't possibly run everything in a single monolith. They already have several if not dozens of monoliths, each one typically coming into the architecture when a large enterprise acquires another company and its monolith, or simply different business units / product lines that don't talk to each other because of sheer organizational scale.

It's called a service mesh, not a microservice mesh. First you get the benefits wrapping, securing, and monitoring each of your monoliths, then you reap more benefit when you start to break up each of those monoliths so common concerns can be addressed by common services and provide a unified experience to the enterprise's customers/users.

spockz · 2024-03-02T05:27:27 1709357247

The data over the wire is the effect of having micro services and has little to do with service meshes. If anything, a service mesh could help by transparently enabling compression or upgrading to h3 without having to bake that in every app. Also, from a networking perspective, the proxies are hosted next to the instances so there is no difference there.

We run 1800 services in our mesh. Most of them rest/http, a select few gRPC and graphql. We even have some soap/http services in it.

slimsag · 2024-03-02T05:33:27 1709357607

with 1800 services, do you feel like you are programming/architecting code in the same way a complex monolithic codebase might, working across them all?

Or are the services just cogs in the machine, managed by individual devs / cogs in the machine? I imagine the latter and presume that's the main benefit of so many services, but genuinely curious as I've never experience that many services in an architecture

bostik · 2024-03-02T08:28:05 1709368085

I would say it's right about the middle of those two extremes.

A well functioning service mesh (or even just a well maintained and discoverable ingress controller) is essentially invisible to the individual dev teams. Just think how modern stacks work from a frontend dev's perspective: team wants to use an additional feature, so they find the budgeted credit card, sign up to a random third-party provider, get their access token, and go. From the codebase standpoint, they merely added a new roundtrip to a random service and process the responses in their code.

From the dev team's point of view, having the same feature available internally, behind "just another URL", makes no big difference. Maybe less politics around vendor spend and, with luck, easier integration with the remote service auth. Almost certainly less wrangling with compliance and legal.

Whether that URL is provided by an ingress with a proper FQDN, or a service mesh entry with otherwise unresolvable name, is (and should be) irrelevant.

Modern distributed systems have long since become too large for any single person to fully comprehend them through and through. There is no Grand Design[tm], they are all results of organic changes and evolution. Service discovery and routing can be architected. Individual services within the system can be architected. The complete system where hundreds or even thousands of services interact can not.

paulddraper · 2024-03-02T05:34:39 1709357679

> A request to a proxy, which requests to another proxy, which requests to service B?

I hesitate to tell you how many layers are between me an HN's servers right now.

phildenhoff · 2024-03-02T05:38:46 1709357926

Right but there are all those layers so that HNs one server can service your request and send a response.

How much traffic would be generated if, for every request to HN, six other requests fire? And every time you comment a cascade of requests fire in HNs imaginary K8s cluster?

It’s not that there is anything wrong with this, or that the tradeoff isn’t worth it… it’s just so much data flying back and forth over the wire.

aetimmes · 2024-03-02T06:37:35 1709361455

Ultimately, in the age of massive cloud compute, the constraining resource in an organization is engineering hours, not CPU/memory/bandwidth. And even then, I've yet to encounter a system (outside of massively parallel MPI-based HPC) where saturating network pipes became the bottleneck for a system before CPU/memory utilization.

Microservice architecture means there's lot of data flying around, but it keeps local resource utilization predictable.

com · 2024-03-02T19:16:51 1709407011

I thought that the constraining resource was money in all but the top 1000 or so businesses in the world with effectively 99%+ margins, mostly because they control their markets absolutely and can raise prices or change demand like the big three algorithmic advertising companies. Everybody else in the cloud goes to the wall without very careful cost control, which is by no means automated or low cost, either.

eptcyka · 2024-03-02T07:46:46 1709365606

Most often, the wire is virtual, its just copying bytes between processes.

paulddraper · 2024-03-03T03:08:27 1709435307

> so much data flying back and forth over the wire

A mesh doesn't add any extra data back and forth over a wire.

It adds some data being copied from one place in RAM to another place in RAM.

__turbobrew__ · 2024-03-02T05:32:16 1709357536

> How do they manage these kinds of K8s setups at scale?

You check in configs into the monorepo and there is tooling to continuously sync the configs with the actual state of the infrastructure.

The advantage of microservices at scale is that team X breaking the build doesn’t affect team Y. This scale is probably not until you have 1000+ engineers however.

pclmulqdq · 2024-03-02T05:50:00 1709358600

To be clear, that is the advantage of services. Microservices are a culture of taking that isolation as far as you can.

jakupovic · 2024-03-02T11:47:21 1709380041

You do deployments as one used to do with big C++ programs also, you can delete everything and deploy again to make sure everything is new. The latter is preferred. Doing continuous deployments is very hard and not sure possible, at least I haven't seen it work well even though it's touted as fundamental k8s feature. K8s groups resources and presents a platform, how you use it it's up to you, and the management.

foota · 2024-03-02T06:25:19 1709360719

Traffic within a data center is relatively cheap, is it not? Of course, I think managed network proxies tend to be expensive, but I think that mostly comes from the machine costs, not the network itself. You pay a bit of a tax for network serialization, but it's unlikely to be the bottleneck in most things ime.

Plus, applications built on an external database (e.g., postgres not sqlite) will already have 1 hop. Going from 1 to 2 hops is less dramatic than going from no hops to 1. And I guess implicitly any service sending lots of data to the client already has 1 hop.

Imo the only case where a network proxy would be egregious is an in memory database kind of workload where the response size is small relative to the size of the data accessed (maybe something like a custom analytics engine), but that's pretty niche.

ozr · 2024-03-02T06:30:48 1709361048

> Traffic within a data center is relatively cheap, is it not?

Complexity is very, very expensive.

manfre · 2024-03-02T13:03:19 1709384599

> Traffic within a data center is relatively cheap,

Most organizations will run their services spread across multiple data centers / AZs.