Hacker News new | past | comments | ask | show | jobs | submit | gguergabo's comments login

Antithesis employee here. Happy to jump in and answer any burning questions people might have about multiverse debugging.


Is the hypervisor multicore? How do you handle shared memory non-determinism? What is the runtime slowdown for shared memory multicore (lets say 16 cores if you need a concrete example) execution?


Found the answer in a different post [1]. The hypervisor and virtual machines are single-core only. The talk also indicates that all I/O operations need to be manually rewritten to use the instrumented mechanism, so it demands a highly paravirtualized guest OS. Logically, that means there are probably no cross-VM shared memory interfaces either. So, no shared memory and thus no need to deal with shared memory non-determinism.

This is just a standard replay engine from what I can tell.

[1] https://news.ycombinator.com/item?id=41501577


No, we don’t require any paravirtualization at all, and nothing needs to be manually rewritten. I’m not sure where you got that impression.

It also is not in any sense a replay engine. We don’t need to record anything except the inputs!


At timestamp 23:40 in the video by Alex Pshenichkin from 2024-06-10 it says data ingestion comes via VMCALL interactions. As such a call is literal nonsense if you are not virtualized, any such call inherently means you are using a paravirtualized interface. Now maybe FreeBSD has enough standardized paravirtualized drivers similar to virtio that you can just link it up, but that would still be paravirtualization solution with manual rewrites, just somebody else already did the manual rewrites. Has the fundamental design changed in the last 3 months?

This is exactly a replay engine (or I guess you could say replay engines are deterministic simulators). How do you think you replay a recording except with a deterministic execution system that injects the non-deterministic inputs at precise execution points? This is literally how all replay engines work. Furthermore, how do you think recordings work except by recording the inputs? That is literally how all recording systems designed to feed replay engines work. The only distinction is what constitutes non-determinism in a given context. At the whole hypervisor level, it is just I/O into the guest; at the process level, it is just system calls that write into the process; at the threading level, it is all writes into the process. These distinctions are somewhat interesting at a implementation level, but do not change the fundamental character of the solution which is that they are all a replay engine or deterministic simulator, whatever you want to call it.


> Let’s get more concrete. Let’s use this to solve a real problem. My server has crashed and its process has exited! No worries, I’ll just rewind time, attach a debugger to the process, and set a breakpoint or capture a thread dump:

Is this kind of stuff only possible in an Antithesis Environment?


Yes, unfortunately we have not figured out how to rewind time in the real world yet. When we do, there are a lot of choices I'm going to revisit...


... but the intro makes it sound like this system is valuable in investigating bugs that occurred in prod systems:

> I’ve been involved in too many production outages and emergencies whose aftermath felt just like that. Eventually all the alerts and alarms get resolved and the error rates creep back down. And then what? Cordon the servers off with yellow police tape? The bug that caused the outage is there in your code somewhere, but it may have taken some outrageously specific circumstances to trigger it.

So practically, if a production outage (where I think "production" means it cannot be in a simulated environment, since the customers you're serving are real) is caused by very specific circumstances, and your production system records some, but not every attribute of its inputs and state ... how does one make use of antithesis? Concretely, when you have a fully-deterministic system that can help your investigation, but you have only a partial view of the conditions that caused the bug ... how do you proceed?

I feel like this post is over-promising but perhaps there's something I just don't understand since I've never worked with a tool set like this.


(I work at Antithesis)

I think you're right that the framing leans towards providing value in prod issues, but we left out how we provide value there. I think you're also right that we're just used to experiencing the value here, but it needs some explanation.

Basically this is where guided, tree-based fuzzing comes in. If something in the real world is caused by very specific circumstances, we're well positions to have also generated those specific circumstances. This is thanks to parallelism, intelligent exploration, fault injection, our ability to revisit interesting states in the past with fast snapshots, etc.

We've had some super notable instances of a customer finds a bug in prod, recalls its that weird bug they've been ignoring that we surfaced a month ago, and then uses this approach to debug.

The best docs on this are probably here: https://antithesis.com/docs/introduction/how_antithesis_work...


This was my thinking as well. Prod environments can be extremely complicated and issues often come down to specific configuration or data issues in production. So I had a lot of trouble understanding how the premise is connected to the product here.


> Yes, unfortunately we have not figured out how to rewind time in the real world yet.

10 bucks says you get complaints for not implementing the "real world" feature.


The intro mentions that ordinarily, we have to pay a high upfront cost to record info that we might need to debug later.

> When we succeed at this, we collect huge volumes of logs “just in case” they provide some crucial clue, incurring equally huge storage costs.

The 'packets from the past' section says we can just retroactively decide what we should have recorded.

Doesn't that mean we're effectively recording everything always? What's the cost of this? Or is all of this under the assumption that we never have to debug something that happened outside of the simulation environment, e.g. in response to an actual in-bound request from a customer? If this is just saying we can afford to save everything in our development environment ... well in that context recording the logs probably wasn't a "huge storage cost" either, right? Or am I missing something basic here?


You're right that if you tried to do something like this using record/replay, you would pay an enormous cost. Antithesis does not use record/replay, but rather a deterministic hypervisor (https://antithesis.com/blog/deterministic_hypervisor/). So all we have to remember is the set of inputs/changes to entropy that got us somewhere, not the result of every system operation.


The classic time space tradeoff question: If I run Antithesis for X time, say 4 hours, do you take periodic snapshot / deltas of state so that I don't have to re-run the capture for O(4 hours) again, from scratch just to go back 5 seconds?


Yes! See Alex's talk here: https://www.youtube.com/watch?v=0E6GBg13P60

In fact, we just made a radical upgrade to this functionality. Expect a blog post about that soon.


Is there a way to take this for a test drive, without talking to sales :)


Thanks for this, Phil! Blog posts like this one that break down complex topics into digestible pieces are a big help for the space and are some of my favorites.

Antithesis employee here. Happy to jump in and answer any burning questions people might have about Deterministic Simulation Testing (DST).


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: