More

obl · 2024-12-19T11:33:42 1734608022

It is not ridiculous at all. Those things have pretty precise definitions and type segregation absolutely does remove a bunch of soundness issues related to type confusion.

You can think of it as the rather classic "Vec of struct + numeric IDs" that is used a lot e.g. in Rust to represent complex graph-like structures.

This combined with bound checking is absolutely memory safe. It has a bunch of correctness issue that can arise due to index confusion but those are not safety issues. When combined with some kind of generational counters those correctness issue also go away but are only caught at runtime not at compile time (and they incur a runtime cost).

Rust's memory safety is about avoiding liveness issues (that become type confusions since all memory allocators will reuse memory for different types), nothing more, nothing less.

gpderetta · 2024-12-20T10:40:31 1734691231

> but those are not safety issues.

there are not memory safety issues. But they definitely can lead to security issues with some sort of confused deputy attack.

For example a capability based system that relied on just this form of memory safety would be pointless.

Of course this can be mitigated by adding version counters to objects and object selectors.

jandrewrogers · 2024-12-19T19:15:42 1734635742

FWIW, arrays of structs + integer handles is the primary way objects are represented in performance-engineered C++.

obl · on Jan 30, 2024

Weird that this treats uninitialized variables as unknown values. For example in ex3.c, the program

  int main(){
      int x;
      if (x <= 42){
              assert(x != 12345);
      }
  }

is of course UB in C, even though under a "uninitialized is random" model the program is valid and does not assert (as the model checker concludes).

(even in O1 clang gets rid of the whole function, including even the ret instruction, I'm surprised it does not at least leave an ud2 for an empty function to help debugging since it would not cost anything https://godbolt.org/z/eK8cz3EPe )

nanolith · on Jan 30, 2024

Well, specifically, it counts uninitialized variables as being set to a non-deterministic value. The point of this tool isn't to optimize functions as a compiler would, but rather to find bad behavior based on its machine model.

This isn't perfect, of course. In this case, the compiler can rightly treat x as uninitialized, meaning that its value could be <= 42 at one point, and not be <= 42 at another point. Since the uninitialized variable isn't "pinned", it could technically be different values in different locations.

CBMC's machine model works differently. In this case, x is assigned a non-deterministic value. The branch condition creates a refinement of this value within the scope of that statement. If the branch is taken, then by SMT rules, x can't equal 12345, because it was already refined as being <= 42.

On its own, a model checker can miss situations like these. It's why I recommend -Wall -Werror -Wpedantic in conjunction with CBMC. The compiler should catch this as a warning, and it should be upgraded as an error.

zzo38computer · on Jan 30, 2024

> In this case, the compiler can rightly treat x as uninitialized, meaning that its value could be <= 42 at one point, and not be <= 42 at another point. Since the uninitialized variable isn't "pinned", it could technically be different values in different locations.

LLVM has a "freeze" command to stop propagation of undefined values (although I think that command was added later than the first version), so that the value is "pinned" as you say. However, the "undef" command, if you do not use "freeze", will not do this.

I think that the C compiler should "pin" such undefined values where they are used, but I don't know which compilers have an option to do this. (Perhaps CBMC should also have such a switch, so that you can use the same options that you will use with the C compiler.)

With this ex.3 file, the optimizer should be allowed to result in a function that does nothing and has an unspecified return value (which might or might not be the same each time it is executed). (If optimizations are disabled, then it should actually compile the conditional branch instruction.)

nanolith · on Jan 30, 2024

This is one reason why I explored model checking machine code output, since at this point, the behavior is very much defined, even if it differs from implied source behavior. But, this gets complicated for other reasons.

philzook · on Jan 30, 2024

This is actually one of the reasons I've been tinkering with CBMC https://www.philipzucker.com/pcode2c/ The idea is to use Ghidra to lift a specialized C interpreter of the machine code and then compare to source or ask other questions via CBMC

nanolith · on Jan 30, 2024

This kind of analysis is excellent.

I do recommend standardizing on hardware and toolchain for projects like these, as it can ensure that the machine code is matched.

The third phase of my work will round the corner with machine code analysis. Currently, I'm working on constructive proofs and equivalence proofs of imported C code to handle the pieces that CBMC doesn't do so well, as part of my second phase. So far, I can extract efficient C from Lean 4, but I want to import C directly into Lean 4 to prove it equivalent to extracted code. Hand-written C is easier to read than the line noise I can extract.

In particular, model checking doesn't fare well with data structures, algorithms, and cryptography. These can be torn down quite a bit, and at least we can verify that the algorithms don't leak memory or make bad integer assumptions. But, if we want to verify other properties, this requires constructive proofs.

Between 80 and 95 percent of the time, depending on the type of code being written, CBMC is enough. I'm now solving for that 5 - 20 percent of code that CBMC can't easily tackle.

rurban · on Jan 31, 2024

> model checking doesn't fare well with data structures, algorithms, and cryptography

To the contrary, that's what I'm using it for in most of my projects. It found interesting algorithmic bugs in my ctl find function (3-way comparison callback missing cases), is used to crack poor hashes. Also my tiny-regex matcher profited from it.

Also a lot of baremetal firmware.

philzook · on Jan 30, 2024

ooooooh. link?

What do you mean by standardizing on hardware and toolchain? I am currently choosing ghidra and cbmc, and it seems like the approach is applicable to any compiler or arch that ghidra supports without too much change. I have only been doing x86-64 and gcc so far though

nanolith · on Jan 30, 2024

Sadly, I don't have a link for this yet. My goal is to write this up once I have a compelling example.

In the meantime, check out my github. https://github.com/nanolith

Currently, I'm feeding commits into libcparse, which is the start of this effort. That stream is currently about 180 days out of phase with what's in my local repositories, but it is slowly trickling in. The first step of this second phase will be to use libcparse to verify itself via constructive proofs in Lean. Currently, and by that I mean what hits in March or April, libcparse has a mostly C18 compliant lexical scanner for the C preprocessor. The goal is to have a SAX-like C18 parser that can detect all C declarations and definitions. Tooling will include a utility that imports C declarations and definitions into both Lean 4 and Coq. This is a moonlighting effort for an upcoming book I intend to write.

nanolith · on Jan 30, 2024

As for standardizing on hardware and toolchain, what I mean is that I try to get projects to commit to using specific hardware and a specific toolchain. The goal is to limit variables for automated testing and analysis.

OpenBSD, in particular, is adamant about testing their OS on real hardware for platforms that they support. Their definition of "working" is on real hardware. I think that's a reasonable goal.

It helps with analysis like this, as you don't have to worry about different compilers or different hardware. GCC, in particular, can produce different output depending on which hardware it was compiled on, unless a specific configuration is locked down. If there is a flaw in how it does vector operations on a particular Xeon processor, this won't be caught if testing locally on a Core i5 processor that produces different code.

kazinator · on Jan 31, 2024

Since any use of the variable is undefined behavior, that's what we want to be informed about.

The reasoning about nondeterministic values should be spared for situations when it's other than undefined behavior. For instance, accessing an uninitialized structure member that came from malloc isn't undefined behavior. It's (I think) unspecified behavior. In an implementation that has trap representations for some types, it could hit a trap.

nanolith · on Jan 31, 2024

That's understandable, and I would suggest reaching out to the CBMC developers. It shouldn't be difficult to add an uninitialized variable pass to the solver.

Practically speaking, -Wall -Werror should catch this. Any use of a tool like CBMC should be part of a defense in depth strategy for code safety.

It does in clang.

   $ cc -Wall -Werror bar.c
   bar.c:5:11: error: variable 'x' is uninitialized when used here [-Werror,-Wuninitialized]
      if (x <= 42){
          ^
   bar.c:4:12: note: initialize the variable 'x' to silence this warning
      int x;
           ^
            = 0

   $ cc --version
   clang version 16.0.6

It also does in gcc.

   $ gcc -Wall -Werror bar.c
   bar.c: In function 'main':
   bar.c:5:10: error: 'x' is used uninitialized [-Werror=uninitialized]
    5 |       if (x <= 42){
      |          ^
   bar.c:4:11: note: 'x' was declared here
    4 |       int x;
      |           ^
   cc1: all warnings being treated as errors

   $ gcc --version
   gcc 12.2.0

kazinator · on Jan 31, 2024

> I would suggest reaching out to the CBMC developers

For something I don't use?

How about this: I would suggest the CMBC developers read HN comments about their stuff, when it comes up.

akoboldfrying · on Jan 31, 2024

You're right, notifying them directly would be too much effort.

The sensible way for developers to become aware of possible improvements they could make is for them to constantly scan the rest of the internet.

kazinator · on Jan 31, 2024

I do that! Some of my projects are packaged in distros. In the past, I've found discussions of problems among distro people, who didn't contact upstream at all, but just tried to patch things on their own. I fixed things on my end, and in such a way that their patch would fail to apply when they update. No words exchanged at all.

I'm not a user of CMBC. I read about it here, I wrote a comment here, and that's the end of it.

But for a minute I will bite. The submission has no link to the actual project. I found that by a web search. The project has no project trappings: no mailing list, or anything of the sort. The page says, "For questions about CBMC, contact Daniel Kroening". The name is a link to his home page. His home page has no obvious contact information. There are links to papers; probably his e-mail address is given in some of them. Let's click on "A Tool for Checking ANSI-C Programs". Hey, e-mail links in that page, before we even get to the PDF. But ... from 2004. Is k-----g@cs.cmu.edu from two decades ago going to work, given that he's at Oxford?

Suppose I hunt down his current e-mail address. What do I say? "Dear Sir, I feel really awkward to be contacting you about your CBMC program, which I have never actually used, but believe me when I say I have a valuable suggestion ..."

If I used the thing, I'd fix it myself and send a patch!

philzook · on Jan 31, 2024

https://github.com/diffblue/cbmc The github repo is active and they are very responsive to issues. As you say, it is unneccesary and probably undesirable to post an issue for a tool you don't use. The links for this are multiple places in the submission. The state of the cprover website is unfortunate and represents the state of the project unfairly imo.

kazinator · on Jan 31, 2024

I do not have an account on github and do not engage projects that have selected github as their only means of interaction.

pastage · on Jan 31, 2024

FWIW I have had luck mailing old research emails, I read my 20 year old university email. It is one of the best signal to noise ratios in communications I have.

I would not mail if it is only about unintialized variables.

mort96 · on Jan 31, 2024

People can comment on a project in public discussion forums without then also having a responsibility to become a contributor to the project.

akoboldfrying · on Jan 31, 2024

I certainly agree -- but then they should not expect anything to be done about whatever they're complaining about.

I interpreted kazinator's original comment as essentially saying "This looks like something I might use myself, but for this dealbreaker issue."

fsckboy · on Jan 31, 2024

>Since any use of the variable is undefined behavior

use of the variable as an lvalue is not undefined

kazinator · on Jan 31, 2024

When I say that a variable is used, I mean that its value is accessed. This is in line with compiler jargon. In this basic block, the variable x has no next use at the first line; the assignment of 42 is a dead assignment.

  x = 42
  y = w
  x = 43

Search the web for "next-use information".

magicalhippo · on Jan 31, 2024

Recall trying to find some fun bugs caused by uninitialized memory, which was made harder to find due to the fact that the debug runtime zeroed memory allocations while the release runtime did not.

Brian_K_White · on Jan 31, 2024

The declaration didn't reserve the address? What else is the point of having a declaration?

I don't think accepting the random initial value means that the value may continue to change until the first time you set a value.

nanolith · on Jan 31, 2024

A declaration doesn't reserve an address. It just means that "x is a variable of type M." How this actually works depends on the compiler.

Weird things happen to uninitialized variables during optimization. Basically, nothing can be assumed about the value, because using an uninitialized variable results in undefined behavior. The original un-optimized generated code may very well have some default value, maybe. Probably not. But, the optimizer can decide that this value is represented by register R0 here and R4 there, and since the variable is uninitialized, there's no point in doing a spill or a load. It really comes down to how the optimization passes were written.

Brian_K_White · on Feb 1, 2024

I'll take your word for it, but I would say there is a point, which is that the symbol was declared.

And I'm not talking about a default value, this doesn't imply a default value even though assigning an address will come with some value. But I do get that defining the name and type are a seperate thing that don't necessarily require assigning an address any more than assigning an a value.

The difference is, any default value would be arbitrary. 0 is just a value that has a meaning that the compiler can't know. It's almost as arbitrary as 37.

But picking the next available address to attach to the symbol is not like that.

The compiler always picks that itself and the programmer always has to ask what it is when they want to know it. The compiler can safely do that much right at declare time without making any assumptions. The value is still random, which is fine because 0 is almost the same as random, in that it's random if 0 would actually be the best default value for this particular item vs 1, -1, size, -size, size/2, etc.

nanolith · on Feb 1, 2024

> But picking the next available address to attach to the symbol is not like that.

> The compiler always picks that itself

One of the best analogies I heard, years ago on IRC, is that good source code is like select cuts of meat and fat, and the compiler is like a meat grinder. What comes out of the meat grinder is hamburger. It bears little resemblance to the source code, syntax, or semantics. But, it is meat.

I'll talk about register machines with a stack, since code is generated on most architecture this way. But, that's not necessarily a thing that can be relied upon. The compiler will preserve "relevant meaning" once the optimization passes are finished. Side-effects and return values (if used by the caller in the case of LTO) may be preserved, but that's largely it. That being said, let's talk about what a variable is, from the perspective of a compiler. A variable is decomposed into a value, which may be referenced, may be changed, or may need to be preserved. The optimizer aggressively strips away anything that isn't immediately needed. If the compiler needs to preserve the value of the register, then it can choose several ways of doing this. If the value is fixed, then it may just become a constant in code. Fixed values can also be produced synthetically. If the value is variable that is set at runtime and referenced later, then it could be set in one register and transferred to another register. If neither is possible, then the value could be spilled to the stack. But, this spill location is not necessarily assigned to this variable. This spill location could be used for another variable later in the function when this one falls out of scope. Spill locations aren't necessarily fixed. On many ABIs, there are scratch registers, registers that must be preserved between calls (the called function must preserve them if used), and registers that can be overwritten after a call. How a value is preserved depends entirely on whether it must be preserved, and which is the most efficient way to do so.

If a variable in source code is referenced but never assigned, then the optimizer can optimize away any "storage" (i.e. stack or a register) for this variable. When the variable is "read", the value used is arbitrary, if at all. The "read" has no meaning and has been optimized away. The value isn't fixed to a random value either at compile time or at runtime. It's whatever flotsam happens to be at that particular place in code at that particular time. If the value is "read" once in one location and once again in another location, even on the next line of source code, these values could be the same or entirely different. There's no telling what the compiler will do, because the optimizer is really free to interpret this in any way. The only way to ensure that a value is preserved by the optimizer is to explicitly assign this variable a value in the source code. This assignment is respected, assuming that the variable is referenced afterward. Beyond that, we can't make any meaningful assumptions.

Brian_K_White · on Feb 2, 2024

thank you

fweimer · on Jan 31, 2024

Uninitialized local variables are documented as a source of a certain type of nondeterminism: http://www.cprover.org/cprover-manual/modeling/nondeterminis...

So the checker treats them as defined, but with an unknown variable. You could have written this instead:

    extern int unknown_int_value(void);
    int x = unknown_int_value();

And leaving unknown_int_value undefined (so it's not visible to the analyzer). Or write a function and use x as a parameter.

I suspect CBMC does this to have a convenient syntax for this frequent scenario. Apparently, it's used quite often, as in these examples: https://model-checking.github.io/cbmc-training/cbmc/overview...

It seems that CBMC is not intended to check production sources directly against C semantics, but to prove things about programs written in a C-like syntax.

philzook · on Jan 30, 2024

I see the calling an undefined prototype style more often, perhaps for this reason. You are probably right this is undefined behavior, but it's subtle. https://stackoverflow.com/questions/11962457/why-is-using-an... I suspect CBMC just picks some concrete behavior for undefined behavior. It may not be a good detector for that. I'm not sure. This gets into shaky territory of understanding for me.

jcranmer · on Jan 30, 2024

Speaking as a compiler writer:

Any invocation of undefined behavior should be considered an assertion failure as far as any model checker should be concerned. Compilers can--and will--treat undefined behavior as license to alter the semantics of your program without any constraint, and most instances of undefined behavior are clearly programmer error (there is no good reason to read uninitialized memory, for example).

Reading an uninitialized variable is not "subtle" undefined behavior. It's one of the most readily accessible examples not only of what undefined behavior can exist, but also the ways compilers will mutilate your code just because you did it. To be honest, if something as simple as the consequences of reading uninitialized memory are shaky understanding for someone trying to prove code correct, that will completely undermine any trust I have in the validity of your proofs.

tialaramex · on Jan 30, 2024

Also, even if the compiler does what a "Real programmer" type thinks it "should" do for this case (and I agree with you that you're not entitled to expect that), you aren't guaranteed that there's some particular value since you never initialized it.

Your operating system likely feels entitled to assume that if you never wrote to this page of RAM you don't care what exactly is in it. After all what kind of lunatic reads a bunch of unknown data, says "Yeah, that's coincidentally what I wanted" and just leaves it unmodified? No, almost anybody would write data they want to keep instead. So, if you never wrote to this particular page of RAM and your OS finds it convenient to swap that page for a different one, no harm no foul right? But now the contents of your uninitialized variable changed!

addaon · on Jan 30, 2024

> So, if you never wrote to this particular page of RAM and your OS finds it convenient to swap that page for a different one, no harm no foul right? But now the contents of your uninitialized variable changed!

No sane OS will do this. Any page that's handed to a process that was last written by a different process must be zero'd (or otherwise have every address initialized) by the OS to avoid leaking information across process boundaries. You could, in theory, have a page that was munmap'd by /this/ process be handed back to the same process to fill a request for a different virtual address without zeroing it, but I can't imagine that any OS tracks the last writer to enable this "optimization" in the few cases it would apply.

jcranmer · on Jan 30, 2024

glibc marks freed pages with MADV_FREE, and on Linux, an MADV_FREE page won't be unmapped immediately. It's possible to read a value of an MADV_FREE'd page, have Linux swap the page out, and then read it again and now observe a zero-init'd page. If malloc returns a new allocation to a freed page, it isn't written to by the allocator (which would prevent the page from being subsequently zeroed if it hasn't already done so), so it doesn't guarantee that the contents won't be randomly zeroed.

Whether or not this is a sane OS I leave an exercise to the reader, but it is nonetheless the property of a common OS.

addaon · on Jan 30, 2024

> glibc marks freed pages with MADV_FREE, and on Linux, an MADV_FREE page won't be unmapped immediately. It's possible to read a value of an MADV_FREE'd page, have Linux swap the page out, and then read it again and now observe a zero-init'd page. If malloc returns a new allocation to a freed page, it isn't written to by the allocator (which would prevent the page from being subsequently zeroed if it hasn't already done so), so it doesn't guarantee that the contents won't be randomly zeroed.

Oh.

I feel like I should have more to say than that, but... oh.

o11c · on Jan 31, 2024

Wouldn't that be canceled by the read fault?

(there's also usually a write, unless the allocation-head-metadata is on a different page than the parts of the allocation you're acting spooky with)

Edit: after watching the video elsewhere in thread, it was indeed crossing page boundary, but that behavior has to be a kernel bug since the documentation says "accesses" are sufficient to repopulate it.

jcranmer · on Jan 31, 2024

No, it's specifically a write that causes the page to actually be freed. Per the man page:

> However, subsequent writes to pages in the range will succeed and then kernel cannot free those dirtied pages, so that the caller can always see just written data. If there is no subsequent write, the kernel can free the pages at any time. Once pages in the range have been freed, the caller will see zero-fill-on-demand pages upon subsequent page references.

o11c · on Jan 31, 2024

Ah, I mixed up MADV_DONTNEED (which is older) with MADV_FREE.

amluto · on Jan 30, 2024

Neither Linux nor Windows have this sort of uninitialized memory, and it would be quite the security hole if they did.

In general, the things your compiler thinks are UB are not the same things your OS or CPU thinks are undefined.

SkiFire13 · on Jan 30, 2024

At least Linux does this, and it is not a security hole because it only reuses pages with memory of the current process (as in the worst case you end up reading data from some other part of your program, but not other programs).

And yes, it does happen in practice, most famously it has been mentioned in the "The strange details of std::string at Facebook" talk at CppCon 2016 https://youtu.be/kPR8h4-qZdk?t=1150&si=2R358wniZfxTJLmc

o11c · on Jan 31, 2024

After watching, that has to be a kernel bug, since the documentation says "accesses" will cause repopulation, not "writes".

Edit: nevermind, I mixed up MADV_DONTNEED with MADV_FREE

ynik · on Jan 30, 2024

The problem is that uninitialized memory really is subtle. After initializing all members of a struct, the padding bytes of that struct are still uninitialized. Yet, it is valid to memcpy the whole struct somewhere else; even with a hand-written char*-based copying loop. (and indeed, this is a good reason for reading uninitialized memory!) So reading uninitialized memory is not always undefined behavior; just most of the time.

The C standard text is barely sufficient for the purposes of compilers, but wholly insufficient for what a model checker would need to know.

saagarjha · on Jan 31, 2024

> After initializing all members of a struct, the padding bytes of that struct are still uninitialized.

Depends on how you initialize them. If you do it all in one go, then yes. If you partially initialize it, then no: some of the padding bytes are guaranteed to be zero. (Observing this and maintaining this variant is an exercise for the reader.)

o11c · on Jan 31, 2024

For reference, GCC has `__builtin_clear_padding` which is very useful to avoid problems there.

jacquesm · on Jan 31, 2024

> there is no good reason to read uninitialized memory, for example

That's an assumption on your part.

One very good reason might be to ensure that memory really is there given the optimistic allocation schemes and to force a page to be read into core. You could also write to it to get the same effect. Another good reason to do a read on uninitialized memory is to see what's still there from the previous program that ran. Now arguably that's a nefarious reason but it is still a good one and it may help uncover bugs in the underlying system.

aidenn0 · on Jan 31, 2024

The code doesn't read uninitialized memory. Automatic storage of objects need not be in memory. The C99 standard only uses the word "memory" in normative sections when talking about "malloc"

philzook · on Jan 30, 2024

Yes, I do not trust my understanding of C. Which is why I want mechanized assistance.

philzook · on Jan 30, 2024

I apologize for using the word subtle. What I should have said is I don't understand the topic and reading about it has left me a sense that one should be very careful.

jacquesm · on Jan 31, 2024

You're spot on. The CVE lists are filled with people who thought they understood this stuff better than they really did.

philzook · on Jan 30, 2024

For undefined behavior detection, I have heard of these:

- UB sanitizer https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html

- Cerberus Semantics https://www.cl.cam.ac.uk/~pes20/cerberus/

- https://github.com/kframework/c-semantics

- https://github.com/TrustInSoft/tis-interpreter

philzook · on Jan 31, 2024

https://github.com/diffblue/cbmc/issues/7732 I'll note that some form of undefined behavior checking / documentation is on the roadmap for the next major version

obl · on June 16, 2023

This is memory safe if you never re-use a (generational index, address) pair.

They seem to use 64 bit ones so this means that to overflow a single memory cell you would have to allocate and free it every cycle (!) at 4 ghz for about 150 years.

Basically as long as you accept this incredibly small "memory leak" you're fine.

imtringued · on June 16, 2023

By memory leak you mean that every memory location has a limited number of alloc+free and once it passes that, the memory is unusable?

Maybe you should have said, "to overflow the generational index you would have to exclusively allocate and free it with no work being performed for 150 years. Once the overflow happens the only consequence is that this memory cell will never be allocated again."

obl · on March 9, 2023

Actual clipping is expensive so indeed a "guard band" is used : inside the region allowed by the internal precision of the rasterizer, outside pixels are simply "ignored".

See e.g. https://fgiesen.wordpress.com/2011/07/05/a-trip-through-the-... for a nice detailed explanation.

obl · on March 8, 2023

  In actual hardware shading is done 32 or 64 pixels at a time, not four. The problem above just got worse.

While it's true that there are "wasted" execution in 2x2 quads for derivative computation, it's absolutely not the case that all lanes of a hardware thread (warp / wavefront) have to come from the same triangle. That would be insanely inefficient.

I dont think that it's publicly documented how the "packing" of quads into lanes is done in the rasterizer for modern GPUs. I'd guess something opportunistic (maybe per tile) taking advantage of the general spatial coherency of triangles in mesh order.

moonchild · on March 8, 2023

> it's absolutely not the case that all lanes of a hardware thread (warp / wavefront) have to come from the same triangle. That would be insanely inefficient

I am no GPU expert, but I performed some experiments a while ago indicating that this is in fact how it works, at least on nvidia.

I would expect it simplifies the fragment processing pipeline to have all the interpolants come from the same triangle. Another factor that comes to mind is that, due to the 2x2 quad-padding, you would end up with multiple shader executions corresponding to the same pixel location, coming from different triangles; that would probably involve complicated bookkeeping. Especially given MSAA.

obl · on March 9, 2023

It would be interesting to see how you were testing for that, because at least on AMD it's fairly certain that a single thread can be shading multiple primitives.

For example, from the ISA docs [1], pixel waves are preloaded with an SGPR containing a bit mask indicating just that :

> The new_prim_mask is a 15-bit mask with one bit per quad; a one in this mask indicates that this quad begins a new primitive, a zero indicates it uses the same primitive as the previous quad. The mask is 15 bits, not 16, since the first quad in a wavefront begins a new primitive and so it is not included in the mask

The mask is used by the interp instructions to load the correct interpolants from local memory.

In fact, in the (older) GCN3 docs [2] there is a diagram showing the memory layout of attributes from multiple primitives for a single wavefront (page 99).

That being said, of course I expect this process to be "lazy" : you would not want to buffer execution of a partially filled thread forever, so depending on the workload you might measure different things.

[1] https://developer.amd.com/wp-content/resources/RDNA2_Shader_...

[2] http://developer.amd.com/wordpress/media/2013/12/AMD_GCN3_In...

moonchild · on March 9, 2023

I think I drew that conclusion from the following: I rendered a full-screen quad using two triangles, and make each fragment display a hash of its threadgroup id. Most threadgroups were arranged in nice, aligned 4x8 rectangles, but near the boundary, they became morphed and distorted so they could stay within the same triangle. That said, it occurs to me now that this could be an opportunistic thing; I am going to try to repeat that experiment, but with many triangles which are all smaller than a single threadgroup.

delusional · on March 8, 2023

The linked AMD guide seems to suggest the author is correct

>Because the quad is rendered using two separate triangles, separate wavefronts are generated for the pixel work associated with each of those triangles. Some of the pixels near the boundary separating those triangles end up being organized into partial wavefronts

obl · on Nov 23, 2022

If you didn't already, I'd recommend compiling llvm/clang in debug or release+assert mode when working on it. The codebase is quite heavy in debug-only assertions even for relatively trivial things (like missing implementation of some case, some combination of arguments being invalid, ...) which means that it's pretty easy to get into weird crashes down the line with assertions disabled.

obl · on July 11, 2022

Wait, I'm pretty sure operand collection logic & banking is there to keep the number of ports on the SRAM low, so basically you're arbitrating and buffering requests coming from high register count instructions (say 3 input fma) and potentially multiple pipelined SMT threads (not thread in the nvidia sense, thread in the "a whole wavefront/warp" sense).

However to me it seems that's completely orthogonal to the vector lanes : I don't see why two parallel lanes in a single thread (eg a 64-element GCN wavefront) would need cross-connected logic at the register file, since almost all instructions _do not_ read/write data from another lane.

There are a few cross-lane shuffles / reduce instruction but it seems to me that those would be handled in a dedicated execution unit. (they are not really the fast-path/common case)

avianes · on July 11, 2022

> There are a few cross-lane shuffles / reduce instruction but it seems to me that those would be handled in a dedicated execution unit. (they are not really the fast-path/common case)

Yes, you essentially need a (kind of) crossbar for shuffle and value broadcast. But as far as I know there is no unit dedicated to this on Nvidia GPU. However, depending on the GPU microarchitecture, shuffle and broadcast may be implemented differently (e.g. through the load/store units).

Note that I said "crossbar" for simplicity and because there is little information available, I doubt that all the paths really exist

obl · on Sept 6, 2020

If you're interested the dev had a cool GDC talk going over the design : https://www.youtube.com/watch?v=prXuyMCgbTc

zubspace · on Sept 7, 2020

Very interesting talk. My favorite takeaways:

1) The basic physics of sand, water and gases is quite simple. I was impressed by the fact, that he implemented most of it in Q-Basic way back. Thinking about this a bit more, Q-Basic looks like the perfect environment to play around with such pixel simulations, albeit a bit slow I guess.

2) The hard part is mixing rigidbody simulation and pixel simulation. Also updating the world using multiple threads.

3) Procedural world generation uses something called Herringbone Wang Tiles [1]. It reduces visible seams and the algorithm itself does not look too complicated.

[1] https://nothings.org/gamedev/herringbone/herringbone_tiles.h...

Kiro · on Sept 7, 2020

Regarding 2 I would really like to know more about how you can implement rigid bodies in a full pixel simulation. Anyone has any pointers?

mas3god · on Sept 8, 2020

Frankenstein your pixel sim with am existing rigid body sim

adsjhdashkj · on Sept 7, 2020

Appreciate it! As a backend engineer (not game dev) but lacking in math and proper computer science backgrounds, what areas might i want to focus on learning to implement a simulation game similar to this?

Bekwnn · on Sept 7, 2020

The quickest path is to just start writing real time simulations of stuff and reference how other similar simulations went about implementation; terrains and erosion, cloth, fluid, etc.

It's easier if you have a bunch of random linear algebra and calculus knowledge lying around in your head, but trying to study that stuff isn't the most direct route.

cpeterso · on Sept 7, 2020

I don’t have a good reference to share, but also read up on game tick loops and clock time. Processor speeds vary and thread scheduling is nondeterministic, so a stimulation code must be written carefully to produce a deterministic result.

galkk · on Sept 7, 2020

I don't think that you even need to produce deterministic result in game

setr · on Sept 7, 2020

Same-machine determinism is probably not too bad, and would greatly aid debugging; cross-machine determinism is mainly really useful for lockstep multiplayer I think.

But yeah it wouldn't really matter in-game unless you're trying roll time backwards/forwards as a game mechanic.

Also of note, all games can be considered simulations, with varying complexity and rules (and players given a way to interact with the system) -- from that perspective, the only requirements to implementing a game simulation is whatever is required to implement any game.

suby · on Sept 7, 2020

This is correct. If you want to create a multiplayer game, you can go about it by running the simulation on everyone's computer, and sending only the inputs to the simulation across machines (Lockstep networking). This requires the simulations to be deterministic which can be tricky to get right due to differences across architectures / compilers.

This is the approach used in a lot of RTS games. Starcraft 1 and 2 both work this way. I believe the original doom used this method as well.

You can get away with not having deterministic simulations via having the server run the simulation, and sending the data required for clients to render the state of the world that they can see. This requires a lot more bandwidth and complicates the network code quite a bit (or at least demands a certain architecture for the code). You still need to run some of the simulation on the client just for having acceptable visuals due to how long it takes for data to travel from server to client (level collision might also be simulated on client for example), but this client side simulation is just for visuals. Still though it's perhaps easier to get a reliable solution working like this due to how hard it is to get deterministic simulations.

The big downside to lockstep, beside it being hard to get deterministic simulations running across machines, is that it opens you up to certain types of hacks. Because the simulation is running on everyone's machine, this allows malicious players to hack the code to do things like let them see through walls or reveal the entire map.

Some games like (some) fighting games use an approach where they just let the simulation run on the client, predicting what they think the inputs will be, and if they get data from the server which doesn't agree with the predicted inputs, the game will revert to the last known good state, and just re-run the simulation with the correct inputs. This sounds insane but if for example a player is holding down W to move forward, it's more likely that W will continue to be held down rather than being released on any given frame.

MereInterest · on Sept 7, 2020

Regarding the fighting game approach, that would be an interesting application of machine learning. Given a player's previous performance in a fight, predict what they will do and use that as a better heuristic in the absence of additional input. The downside is that lag would be much less identifiable as such, because things that require volition (e.g. an attack) might be predicted and shown to an opponent even if they never happen.

lioeters · on Sept 7, 2020

I found the following book insightful, in understanding some of the concepts involved in simulating physics and "life-like" behaviors.

The Nature of Code: Simulating Natural Systems with Processing - Daniel Shiffman

You can read the entire book online here: https://natureofcode.com/book/

adsjhdashkj · on Sept 9, 2020

Cool looking book, thanks :)

obl · on June 7, 2020

That sounds very unlikely ?

"The reason" why memory is slower than registers (apart from size obviously) is that register dependencies are static and can be easily tracked by the core. Memory is way harder since you don't know all the addresses in advance (any other random store could actually be storing to the stack).

That's why they have to use costly (associative) structures like the store queue.

The way cores track the stack is usually for return address prediction because it's a huge win for little cost and almost no well behaved program overwrites return addresses manually (low mispredict rate).

As far as I know, all x86 cores emit 2 uops for a push or a pop (load/store + sp adjust). I guess it can save you some frontend bandwidth but that's about it.

obl · on July 2, 2019

Nice article.

Of course if you get some of the benefits of run time specialization, you also get some of the downside, i.e., overspecialization.

The heuristics to know when to stop specializing is a really hard part of those kind of systems.

You could have a program that does a single write() on 100k different files. You could have thousands of processes each writing to a single file and bloating the I$ with useless copies of write.

trurl · on July 2, 2019

This isn't exactly a new issue in runtime code generation, for example, the HotSpot Java JIT was introduced over 20 years ago.

vidarh · on July 3, 2019

The typical approach to this (and yes it can of course still fail), is to use counters and first optimize when you see a given pattern enough times.