In this case the linked list will actually be lower latency, as you'll only be t...

dragontamer · on Oct 14, 2020

I'm just making this up as I go, but I'm only seeing 1x pointer dereference in the hot-loop (the "if-side" of the following if statement. The "else-side" is more complex, but not really bad)

    if(this->size >= 0){
      return lastLinkedSlab[this->size--];
    } else {
      auto old = lastLinkedSlab;
      lastLinkedSlab =  lastLinkedSlab->next;
      delete old;
      if(lastLinkedSlab == nullptr){
          lastLinkedSlab = new LinkedSlab();
      }
      this->size = 16384 - 1;
      return lastLinkedSlab->heap[this->size--];
    }

Or something like that. I'm not really seeing how this scheme incurs many indirect pointer references.

> And yes, of course you would allocate your nodes from a linear contiguous (but possibly chunked) fix size bump allocator first.

Hmm, that's probably a bit more efficient, but probably not a big deal because its a "once ever" optimization. You usually want an optimization to loop over itself a few times...

gpderetta · on Oct 14, 2020

so, what I have in mind is this. Let's assume in both cases the allocator is static. In the list case you have:

    struct node { node* next; };
    static node * root;

    void * alloc() {
         node * result = root; // 1 load
         if (result) { // fast path
           root = result->next; // 1 load + 1 store
           return result;
         } else {
           //slow path
         }
    }

There is a single load in the critical path (loading the content of result); the additional load of result next and the store are not in the critical path as the result does not depend on them. Also loading the content of result->next will touch a cacheline that will be accessed very soon by the caller of alloc anyway.

In the queue case you have:

    struct queue  { void ** data; int size; }
    static queue q;
    void * alloc() {
         int sz = q.size -1; // 1 ld [a] + 1 sub 
         if (sz >= 0) { // fast path
              size = sz; // 1 store
              void ** data = q.data; // 1 ld [b]
              return data[sz]; // 1 ld [c]
         } else { /* slow path */ }
    }

In this case the chain b->c is in the critical path (a can be in parallel with b, and the sub has minimal latency); the store is not in the critical path. So as you can see there is an additional dereference. Also it will touch the additional cacheline pointed by data+sz.

If you are doing many allocations in a very tight loop, then the store in the linked list example suddenly becomes part of the critical path and can become a problem, while the compiler can extract some parallelism from the size computation in the queue example. If that scenario is important for you then the queue can be preferable.

dragontamer · on Oct 14, 2020

Okay, I had something wrong earlier, but I deleted my earlier, incorrect post. That's what I get for rushing and trying to do this in one go. I think I got it correct this time:

    struct Node{
        char rep[64]; // 64-bytes per node
    };

    struct StackLink {
      struct StackLink* prev;
      Node* nodePtrs[4096];
    };

    static StackLink* last; // Points to nullptr when empty
    static int stackLinkSize;

It'd be a stack, not a queue. So I changed the name to "stack" to better represent the strategy. stackLinkSize doesn't need to be part of the link structure. That's why the fast path is simply:

    return root->nodePtrs[stackLinkSize--];

This basically breaks down into:

    dereference(root + offsetOf(nodePtrs) + stackLinkSize * sizeof(Node))

Just one dereference, if I'm counting correctly.

I think I agree with you that "size" is dereferenced (for dereference #2), but since its "Hot" I don't think it would be the bottleneck for anything. L1 references are very fast.

gpderetta · on Oct 14, 2020

You also need to load root. If you do not count that, then in the linked list case there are no dereferences (root is litterally the pinter to the allocated block).

The stack case will touch three cachelines (root, nodeptrs and the final block). The linked list only two (there is no nodeptrs).

TBH it would be very hard to design a nonncompletely artificial benchmark were une desig make a difference compared to the other. Only way is test on an actual application and check what fits best.

dragontamer · on Oct 14, 2020

> TBH it would be very hard to design a nonncompletely artificial benchmark were une desig make a difference compared to the other. Only way is test on an actual application and check what fits best.

I think I can agree to that.

I should note that I designed this allocator to be used on the GPU. "Size" is shared between a wavefront / SIMD unit. popcnt(exec_mask) determines how many items to alloc at a time. (if 200 threads call alloc, then they will collectively grab 200 pointers all at once, by doing size -= popcount(exec_mask)).

gpderetta · on Oct 15, 2020

oh, in that contest, being able to parallelize allocation this way must be a big win. Of course it wouldn't work with a free list.