More

cmeiklejohn · on Oct 19, 2023

Microservices require distributed debugging; distributed debugging requires distributed tracing. Just imagine, as I've been trying to push forwards in my own Ph.D. work, that you could debug a process across microservices. This is why we want this; possibly, done a bit more resiliently and thoroughly than the original OP.

bheadmaster · on Oct 19, 2023

I cannot help but feel that "distributed debugging" would already be a solved problem in Plan 9 operating system...

cmeiklejohn · on Oct 19, 2023

…or, I guess Barbara Liskov in 1987, with Argus. And yet, we still seem to debug programs interactively in isolation. Perhaps it’s because all of those systems assumed a system developed in isolation, that didn’t evolve independently and weren’t implemented in different programming languages communicating through different network protocols instead of function/method invocations.

cmeiklejohn · on Oct 19, 2023

Academics have tried to make this a reality for years. I suggest revisiting Waldo's "A Note on Distributed Computing" and working forwards from there. If you want to go back further, look at Argus, Emerald, and the original Hermes (from DEC.)

cmeiklejohn · on Oct 19, 2023

This is a joke, right? I appreciate a good joke.

cmeiklejohn · on Oct 19, 2023

> Think that's called Erlang.

No, it's not.

As someone who programmed Erlang both professionally and published academically at Erlang venues for a long time, no.

These optimizations "for runtime" are not well supported by Erlang (i.e., cluster performance changes dramatically when behavioral characteristics of message passing switch from local to remote to remote cluster very quickly) and were long discussed in Waldo's paper back in the 90s, dynamic relocation is not supported well (i.e., unless you use global, which falls apart quickly under network anomalies, of which I, and several others, wrote paper(s) about), and the runtime hardly provides any information on introspection on cluster performance.

Sadly, distributed Erlang had the edge on programming distributed systems almost 20 years before they became pervasive, but has since been left to atrophy and hasn't seen any real innovation in quite a long time.

cmeiklejohn · on Oct 19, 2023

This is the correct answer. Microservices are a software engineering optimization from the socio-technical perspective primarily. 100%.

cmeiklejohn · on Oct 19, 2023

OpenTelemetry's Java implementation does this, but it actually does it in a way that non-GRPC things can access this context as well by ensuring that it propagates throughout both the CoroutineContext and the thread-local state that's used by OpenTelemetry itself to propagate tracing information into Java code that is used by a Kotlin coroutine that happens to execute code that was written in Java.

e.g., I handle a request, get the incoming context, have to stash it because I might execute a coroutine that is suspended/resumed across different threads, and subsequently then execute another GRPC call in a Java library, that happens to start, get rescheduled and resume on receiving the response on a different thread, in a possibly different thread pool.

The OpenTelemetry handling for this is quite complex: it must be used as a javaagent so it can actually instrument underlying libraries with the necessary code for handling thread scheduling/context switches in both thread pools (e.g., ForkJoinPool), threads themselves, with cooperative scheduling in application code (e.g., Thread) and Kotlin's coroutine handling with is mostly codegen (e.g., async, suspend fun.)

Finally, in my own Ph.D. work, we did a similar thing to propagate trace identifiers for a dynamic analysis for fault injection, and we quickly ran into a problem that --- not only is the propagation difficult in itself --- but, you also run the risk of running out of header space if you store any (longish?) information when GRPC is run over HTTP2 because of the maximum allowed header size.

cmeiklejohn · on Oct 19, 2023

I'm also not sure what you mean by the context doesn't propagate between containers and/or pods -- GRPC isn't aware of these Docker/Kubernetes aspects at all.

Do you actually mean that unless explicitly propagated to a subsequent downstream RPC the data is dropped? If so, that's by design.

However, most large-scale organizations that are doing distributed tracing (e.g., Twitter, Uber) have either invented, reproduced, or leveraged OpenTelemetry's design for this precise thing.

Naive context propagation isn't (really) the difficult part with most of these designs -- it's what you've done, using an interceptor, reading the data and assigning it automatically on subsequent requests -- the challenge is dealing with this under many different, real world conditions: a.) concurrency and thread scheduling; b.) not all services use the same version of downstream RPC libraries; c.) not all calls are GRPC, and some use HTTP (and, different HTTP libraries, at that.); and d.) you cross message passing boundaries: i.e., I receive request, write to Kafka queue/reliable workflow backend (e.g., Cadence, Temporal) and re-read the request and then execute a subsequent RPC as a result of that message.

If you're using Kotlin, I suspect you will run into these challenges. Tune your thread pools up/down, restrict your JVM's resources, and you'll suddenly see that if the thing that handles the request uses different threads/coroutines/etc. then the code block that issues the downstream RPC, you'll start dropping the context without explicit handling of that case.

In fact, a very simple test case in Java where you use several, concurrently executed CompleteableFuture's that each issue RPCs, in a very small thread pool should be enough to see the issue.

reidbuzby · on Oct 19, 2023

Thanks for the comment, it seems like you know a lot more about this than I do.

This was a solution that has worked well for my company that averages <1 req/s, so yes I have not tested it under more extreme conditions. This is version 1.0.0 so it is quite new and naive by design. I was posting here to get some feedback on the initial version and see how I can improve it, which you have given me!

Feel free to contribute to the project! It seems like your expertise applies nicely!

anuraaga · on Oct 19, 2023

It's too bad that there is still no JEP for an official context in Java.

Disclaimer: I wrote the OTel one, and am sad to see yet more context implementations being made, including the OTel one, these really need to all be on the way out.

cmeiklejohn · on May 12, 2022

I went to Community College of Rhode Island (00 - 05) and I enjoyed it quite a bit and felt, while I had some weird and unnecessary courses, I actually learned quite a bit. I recommend community college of those who can't get a traditional university education for financial, family obligations, or other reasons.

In 98, I started working for an internet company (right, when the internet was becoming commonplace), and in 2000, graduated from high school. I was admitted to the University of Rhode Island and didn't like it at all -- I was commuter and spent most of the time I wasn't at school at my job. I dropped out 3 days into my second semester.

I worked for several years at that company full-time going to night school 3-4 nights a week; I went on to become a junior developer at Berklee College of Music (building their online music school), promoted to senior, and then a manager of team there. Once I graduated from CCRI, I moved to Northeastern University and did the online program, which was quite new in 2006, and then to finish my final requirements, took them at night after I worked my 9-5 at Berklee. I'm now a Ph.D. student at Carnegie Mellon University -- a top-tier research university working on resilience engineering; I'm proposing this year and hoping to defend about 1.5 years from now.

I had to take a lot of weird course during my CCRI experience. I had to take introduction to computers in 00: the professor actually gave a quiz at the start of class each week and I would show up late because I worked a real job. I failed the first semester: one of the quizzes was on using the mouse. I also took introduction to the internet. However, I basically got an immediate A because the university bought internet access through my company and I was the contact. I had to take classes on Microsoft Excel, Word, Access, etc. I passed them. This is peak 2000.

By far, the most important classes I ever took were the following:

- Intro, Intermediate, Advanced C# programming. I never did C# programming before, .NET was VERY NEW and I we wrote a few desktop programs. The professor was hard; I got an A for building a CIDR subnet calculator, which no one understood because... hey, it was 2001. I didn't touch .NET until I got a research internship position at Microsoft Research in Redmond where I had to program C# and somewhat knew what I was doing.

- Advanced Databases. While I had to build everything in Microsoft Access, my professor made our project group go to another organization in the university, interview them about their problems, and build a database and form interface for interacting with it based on our interviews. This was the most real software engineering course I ever took, until I taught CMU-313, which is designed that way. Let me emphasize the time difference here: that course I took at CCRI was in 2002/3 and I taught CMU-313 in 2021. We need more of that style of course.

All in all -- do community college if that's what you have access to, work hard, work with your professors, and it will (hopefully) not be a limiting factor.

cmeiklejohn · on May 12, 2022

I should be clear (as both someone who experienced this and as someone who has been involved in MS/Ph.D admissions):

Brown University admitted me as a "special student" and allowed me to take master's courses as long as I paid the $6k fee per course (in 2013), but when asked about pursuing an actual master's or Ph.D. level degree, I was told, rather disrespectfully, that there was absolutely no chance since I didn't have a computer science degree from a major university. Therefore, I had to independently seek out my own research opportunities to build a resume that would allow me to be admitted to an actual CV institution.

Therefore, I would say that if your goal is research focus on the more traditional path.

cmeiklejohn · on May 12, 2022

For sure, the algorithm changes that we made we believe work across all languages; they are inherent to the architecture of microservice applications and not specific to a particular programming language.

That said, we currently do not have any client libraries for node.js. Therefore, we would probably have to make a client library for the common GRPC and HTTP libraries that are used in node.js. As I am not a node.js programmer, I do not know what they are. However, since we have written client libraries for Python (HTTP and GRPC clients) and Java (one HTTP client and the primary GRPC client), we believe this is a "small matter of programming."

cmeiklejohn · on May 12, 2022

Primary author of Filibuster here, happy to answer any questions you may have.

cmeiklejohn · on March 19, 2022

I laughed out loud, so I appreciate the comment.