It's a strict improvement over the technology before LLM, which usually just ass...

RolfRolles · 2025-01-31T01:51:47 1738288307

No, it's really not a strict improvement. A meaningless name like `v2` does at least convey that you, as the analyst, haven't understood the role of the variable well enough to rename it to something more fitting to its inferred purpose. If the LLM comes up with an "informative" variable name that is not very well-suited towards what it actually does, the name can waste your time by misleading you as to the role of the variable.

skissane · 2025-01-31T03:53:36 1738295616

I think Ghidra could do better even without any LLM involved. Ghidra will define local variables like this:

SERVICE_TABLE_ENTRY* local_5c;

I wish it at least did something like:

SERVICE_TABLE_ENTRY* local_5c_pServiceTableEntry;

Oh yeah, there’s probably some plugin or Python script to do this. But I just dabble with Ghidra in my spare time

It would be great if it tracked the origin of a variable/parameter name, and could show them in a different colour (or some other visual distinction) based on their origin. That way you could easily distinguish “name manually assigned by analyst” (probably correct) vs “name picked by some LLM” (much more tentative, could easily be a hallucination)

rgovostes · 2025-01-31T22:20:31 1738362031

In my view one of the most pressing shortcomings of Ghidra is that it can't understand the lifetimes of multiple variables with overlapping stack addresses: https://github.com/NationalSecurityAgency/ghidra/issues/975

Ghidra does have an extensive scripting API, and I've used LLMs to help me write scripts to do bulk changes like you've described. But you would have to think about how you would ensure the name suffix is synchronized as you retype variables during your analysis.

skissane · 2025-02-01T02:35:34 1738377334

Yeah, I don't know why they don't use something like SSA – make every line of code which performs an assignment create a new local variable.

Although I suppose when decompiling to C, you need to translate it out of SSA form when you encounter loops or backwards control flow.

rgovostes · 2025-01-31T00:17:40 1738282660

I disagree with it as a blanket statement. It’s related to the problem of hallucinations in LLMs; sometimes they come up with plausible, misleading bullshit. The user quickly relies on the automatic labeling as a crutch, while exhausting mental effort trying to distinguish the bullshit from the accurate labels. The cryptic symbols do not convey meaning but consequently can’t mislead you.

I don’t reject the whole concept and am bullish on AI-assisted decompilation. But the UX needs to help the user have confidence in the results, just like source code-level static analyzers generate proofs.