More

sparkie · 2025-12-19T11:22:59 1766143379

Should really give the big-O notation for common operations for comparison of the approaches.

---

There are some variations of Chunk/Bucket arrays which offer different characteristics.

HAT (Hashed Array Trees[1]) are close to what is presented - the index block is always a power of 2 and points to data blocks of the same length - ie, it increases as: 1x1, 2x2, 4x4, 8x8, 16x16, ...

The downside is each time the length reaches a new power of 4 it requires the data to be shuffled about, which may cause latency spikes.

RAOTS (Resizeable Arrays in Optimal Time and Space[2]) is a variation where the index block block is approximately the square root of the length. It increases by adding a new block and resizing only the index block. The old blocks remain part of the structure and don't need reshuffling like HAT, which avoids the latency spikes, but there are more individual allocator calls. However, blocks are always a power of 2 which makes them arena-friendly.

Both of these have O(1) element access and waste only O(√n) space, which makes them quite optimal for uses where memory may be a constraint.

---

In the double-and-copy style array, one trick is we can avoid storing `capacity` and compute it on demand - determine the MSB of the length (using `lzcnt` or `bsr`), and then determine the next power of 2 sufficient to hold that length. In C23, this is `stdc_bit_ceil` (<stdbit.h>).

In the "Xal" structure (which has several other names), we can use `stdc_first_leading_one` to determine the index in the index block, then zero out the leading one bit to determine the index in the bucket.

This structure would be more efficient if processors provided a `log2mask` instruction which computes both the MSB and the remainder after masking the MSB in one instruction. (Similar to how `divmod` instructions work where the quotient and remainder are both computed in one instruction).

---

[1]: https://en.wikipedia.org/wiki/Hashed_array_tree

[2]:https://cs.uwaterloo.ca/research/tr/1999/09/CS-99-09.pdf

sparkie · 2025-12-12T01:51:57 1765504317

Thread locals don't fully solve the problem. They work well if you immediately call the closure, but what if you want to store the closure and call it later?

    #include <stdlib.h>
    #include <string.h>
    #include <stddef.h>

    typedef int (*comp)(const void *untyped_left, const void *untyped_right);

    thread_local int in_reverse = 0;

    __attribute__((noinline))
    int compare_impl(const void *untyped_left, const void *untyped_right, int in_reverse) {
        const int* left = untyped_left;
        const int* right = untyped_right;
        return (in_reverse) ? *right - *left : *left - *right;
    }

    comp make_sort(int direction) {
        in_reverse = direction;
        int compare(const void *untyped_left, const void *untyped_right) {
            return compare_impl(untyped_left, untyped_right, in_reverse);
        }
        return compare;
    }

    int main(int argc, char* argv[]) {

        int list[] = { 2, 11, 32, 49, 57, 20, 110, 203 };

        comp normal_sort = make_sort(0);
        comp reverse_sort = make_sort(1);

        qsort(list, (sizeof(list)/sizeof(*list)), sizeof(*list), normal_sort);
            
        return list[0];
    }

Because we create `reverse_sort` between creating `normal_sort` and calling it, we end up with a reverse sort despite clearly asking for a normal sort.

groundzeros2015 · 2025-12-12T03:11:53 1765509113

The answer is you wrap it and you don’t return until the thing stored in the thread local is not needed

sparkie · 2025-12-12T03:48:05 1765511285

So we can't return the closure?

Then it's clearly only half a solution.

The example I gave above should work fine in any language with first-class closures.

groundzeros2015 · 2025-12-12T06:47:30 1765522050

> The closure problem can be neatly described by as “how do I get extra data to use within this qsort call?”

    _Thread_local struct {
      void *data;
      int (*compare)(const void *a, const void*, void*);
    } _qsort2_closure ; 

    static int _qsort2_helper(const void *a, const void *b) {
        return _qsort2_closure.compare(a, b, _qsort2_closure.data);
    }

    void qsort2(void *base, size_t elements, size_t width, int (*compare)(const void *a, const void*, void*), void *userData) 
    {
        _qsort2_closure.data = userData;
        _qsort2_closure.compare = compare;
        qsort(base, elements, width, _qsort2_helper);
    }

gpderetta · 2025-12-12T10:56:41 1765537001

you also need to restore the _qsort2_closure when done. But again you are reinventing dynamic scoping with all its advantages and disadvantages.

groundzeros2015 · 2025-12-12T14:08:48 1765548528

> you also need to restore the _qsort2_closure when done

No I do not. It will reassigned next call.

> But again you are reinventing dynamic scoping

No. I’m not reinventing anything. I’m using the existing feature of thread local variables.

The usage of such is entirely an implementation detail of qsort2 with the exception of recursion.

Dynamic scoping typically refers to defining variables which have scope outside of their call stack. No usage of this API requires it.

Can you just try to learn something new?

gpderetta · 2025-12-12T10:42:52 1765536172

This is the classic lexical vs dynamic scoping. Dynamic scoping works great until it doesn't.

groundzeros2015 · 2025-12-12T14:14:14 1765548854

Don’t use C then? It sounds like you want JavaScript, Python, or Lisp.

Once again, the caller of the API does not declare any variables so there is no dynamic scoping.

sparkie · 2025-12-02T21:33:13 1764711193

> x86 is unusual in mostly having a maximum of two operands per instruction[2]

Perhaps interesting for those who aren't up to date, the recent APX extension allows 3-operand versions of most of the ALU instructions with a new data destination, so we don't need to use temporary registers - making them more RISC-like.

The downside is they're EVEX encoded, which adds a 4-byte prefix to the instruction. It's still cheaper to use `lea` for an addition, but now we will be able to do things like

    or rax, rdx, rcx

https://www.intel.com/content/www/us/en/developer/articles/t...

sparkie · 2025-12-02T21:18:32 1764710312

It's due to the way the instruction is encoded. `lea` would've needed special treatment in syntax to remove the brackets.

In `op reg1, reg2`, the two registers are encoded as 3 bits each the ModRM byte which follows the opcode. Obviously, we can't fit 3 registers in the ModRM byte because it's only 8-bits.

In `op reg1, [reg2 + reg3]`, reg1 is encoded in the ModRM byte. The 3 bits that were previously used for reg2 are instead `0b100`, which indicates a SIB byte follows the ModRM byte. The SIB (Scale-Index-Base) byte uses 3 bits each for reg2 and reg3 as the base and index registers.

In any other instruction, the SIB byte is used for addressing, so syntax of `lea` is consistent with the way it is encoded.

Encoding details of ModRM/SIB are in Volume2, Section 2.1.5 of the ISA manual: https://www.intel.com/content/www/us/en/developer/articles/t...

sparkie · 2025-12-01T15:03:15 1764601395

This is why you should use Graphene's contact scopes and only allow access to contacts that you want to contact on WhatsApp.

sparkie · 2025-12-01T14:03:03 1764597783

It's not only xor that does this, but most 32-bit operations zero-extend the result of the 64-bit register. AMD did this for backward compatibility. so existing programs would mostly continue working, unlike Intel's earlier attempt at 64-bits which was an entirely new design.

The reason `xor eax,eax` is preferred to `xor rax,rax` is due to how the instructions are encoded - it saves one byte which in turn reduces instruction cache usage.

When using 64-bit operations, a REX prefix is required on the instruction (byte 0x40..0x4F), which serves two purposes - the MSB of the low nybble (W) being set (ie, REX prefixes 0x48..0x4f) indicates a 64-bit operation, and the low 3 bits of low nybble allow using registers r8-r15 by providing an extra bit for the ModRM register field and the base and index fields in the SIB byte, as only 3-bits (8-registers) are provided by x86.

A recent addition, APX, adds an additional 16 registers (r16-r31), which need 2 additional bits. There's a REX2 prefix for this (0xD5 ...), which is a two byte prefix to the instruction. REX2 replaces the REX prefix when accessing r16-r31, still contains the W bit, but it also includes an `M0` bit, which says which of the two main opcode maps to use, which replaces the 0x0F prefix, so it has no additional cost over the REX prefix when accessing the second opcode map.

cesarb · 2025-12-01T14:39:13 1764599953

> It's not only xor that does this, but most 32-bit operations zero-extend the result of the 64-bit register. AMD did this for backward compatibility.

It's not just that, zero-extending or sign-extending the result is also better for out-of-order implementations. If parts of the output register are preserved, the instruction needs an extra dependency on the original value.

ychen306 · 2025-12-01T18:26:56 1764613616

This. It's for renaming.

nickelpro · 2025-12-01T18:09:22 1764612562

Except for `xchg eax, eax`, which was the canonical nop on x86. Because it was supposed to do nothing, having it zero out the top 32-bits of rax would be quite surprising. So it doesn't.

Instead you need to use the multi-byte, general purpose encoding of `xchg` for `xchg eax, eax` to get the expected behavior.

sparkie · 2025-11-23T10:14:09 1763892849

The primary focus of Lions/seL4 is security, and formal verification of it. It's a necessary project for a world where CVEs are frequently published for the Kernel and other underlying system services, in part because they're still using a 1970s OS design which didn't have internet-connected computers in mind.

The seL4 kernel uses an Object-Capability system (ironically which have also been around since the '70s) in place of Access Control Lists used in POSIX systems. The idea behind capabilities is "don't separate designation from authority" - the capability is both a designator of a resource and the authority to use it - there are no separate ACLs. Separation of designation from authority leads to confused deputies - where one process with lower privileges can trick a higher-privileged process to use the higher privileges on its behalf, which very often leads to privilege escalation vulnerabilities (ie, RCE as root).

Each process has a "cspace" (Capability Space), which holds the capabilities the process may use. When the process performs a system call, it references a capability slot from a set of tables in its own cspace where the capability is held, and the Kernel verifies that the capability is valid in order to perform the syscall. The capabilities can be revoked at any time, and importantly, they cannot be forged - only delegated.

Additionally, seL4 is a microkernel design which is intended to have a small attack surface. It provides only a small number of system calls like `Send`, `Recv`, `Yield`, and combinations of, such as `Call`, `Reply`, `ReplyRecv`. It's a generic interface where an IPC buffer is passed along with a set of delegated capabilities, but there's no specific calls such as `SYS_write` or `SYS_mmap` - these are built on top of the `Send`/`Recv` syscalls and their services are provided by other user-space daemon processes.

One of the best developments of seL4 is the "Mixed-criticality system" support, which provides capabilities for processing time - to enable some processes to have real-time support without becoming CPU hogs.

seL4 can also be seen as a hypervisor which can virtualize other kernels, and part of the Lions effort is to be able to utilize Linux drivers using isolated Linux guests.

To learn more, the seL4 manual is a good place to start: https://sel4.systems/Info/Docs/seL4-manual-latest.pdf. There's some videos on Youtube by Gernot Heiser discussing the MCS support.

sparkie · 2025-11-16T15:21:52 1763306512

We could potentially use a conditional move and an unconditional jump to make the branch target predictor do the work instead - and flood it with a bunch of targets which are intended to mispredict. Eg, we could give 255 different paths for abandon and select one randomly:

    #define BYTE_VALUES \
        X(0x00) X(0x01) X(0x02) X(0x03) X(0x04) X(0x05) X(0x06) X(0x07) X(0x08) X(0x09) X(0x0A) X(0x0B) X(0x0C) X(0x0D) X(0x0E) X(0x0F) \
        X(0x10) X(0x11) X(0x12) X(0x13) X(0x14) X(0x15) X(0x16) X(0x17) X(0x18) X(0x19) X(0x1A) X(0x1B) X(0x1C) X(0x1D) X(0x1E) X(0x1F) \
        X(0x20) X(0x21) X(0x22) X(0x23) X(0x24) X(0x25) X(0x26) X(0x27) X(0x28) X(0x29) X(0x2A) X(0x2B) X(0x2C) X(0x2D) X(0x2E) X(0x2F) \
        X(0x30) X(0x31) X(0x32) X(0x33) X(0x34) X(0x35) X(0x36) X(0x37) X(0x38) X(0x39) X(0x3A) X(0x3B) X(0x3C) X(0x3D) X(0x3E) X(0x3F) \
        X(0x40) X(0x41) X(0x42) X(0x43) X(0x44) X(0x45) X(0x46) X(0x47) X(0x48) X(0x49) X(0x4A) X(0x4B) X(0x4C) X(0x4D) X(0x4E) X(0x4F) \
        X(0x50) X(0x51) X(0x52) X(0x53) X(0x54) X(0x55) X(0x56) X(0x57) X(0x58) X(0x59) X(0x5A) X(0x5B) X(0x5C) X(0x5D) X(0x5E) X(0x5F) \
        X(0x60) X(0x61) X(0x62) X(0x63) X(0x64) X(0x65) X(0x66) X(0x67) X(0x68) X(0x69) X(0x6A) X(0x6B) X(0x6C) X(0x6D) X(0x6E) X(0x6F) \
        X(0x70) X(0x71) X(0x72) X(0x73) X(0x74) X(0x75) X(0x76) X(0x77) X(0x78) X(0x79) X(0x7A) X(0x7B) X(0x7C) X(0x7D) X(0x7E) X(0x7F) \
        X(0x80) X(0x81) X(0x82) X(0x83) X(0x84) X(0x85) X(0x86) X(0x87) X(0x88) X(0x89) X(0x8A) X(0x8B) X(0x8C) X(0x8D) X(0x8E) X(0x8F) \
        X(0x90) X(0x91) X(0x92) X(0x93) X(0x94) X(0x95) X(0x96) X(0x97) X(0x98) X(0x99) X(0x9A) X(0x9B) X(0x9C) X(0x9D) X(0x9E) X(0x9F) \
        X(0xA0) X(0xA1) X(0xA2) X(0xA3) X(0xA4) X(0xA5) X(0xA6) X(0xA7) X(0xA8) X(0xA9) X(0xAA) X(0xAB) X(0xAC) X(0xAD) X(0xAE) X(0xAF) \
        X(0xB0) X(0xB1) X(0xB2) X(0xB3) X(0xB4) X(0xB5) X(0xB6) X(0xB7) X(0xB8) X(0xB9) X(0xBA) X(0xBB) X(0xBC) X(0xBD) X(0xBE) X(0xBF) \
        X(0xC0) X(0xC1) X(0xC2) X(0xC3) X(0xC4) X(0xC5) X(0xC6) X(0xC7) X(0xC8) X(0xC9) X(0xCA) X(0xCB) X(0xCC) X(0xCD) X(0xCE) X(0xCF) \
        X(0xD0) X(0xD1) X(0xD2) X(0xD3) X(0xD4) X(0xD5) X(0xD6) X(0xD7) X(0xD8) X(0xD9) X(0xDA) X(0xDB) X(0xDC) X(0xDD) X(0xDE) X(0xDF) \
        X(0xE0) X(0xE1) X(0xE2) X(0xE3) X(0xE4) X(0xE5) X(0xE6) X(0xE7) X(0xE8) X(0xE9) X(0xEA) X(0xEB) X(0xEC) X(0xED) X(0xEE) X(0xEF) \
        X(0xF0) X(0xF1) X(0xF2) X(0xF3) X(0xF4) X(0xF5) X(0xF6) X(0xF7) X(0xF8) X(0xF9) X(0xFA) X(0xFB) X(0xFC) X(0xFD) X(0xFE) X(0xFF)
        
    void resolve(Transaction *t) {
        void* jt[] = {
            #define X(n) [n] = &&abandon##n,
            BYTE_VALUES
            #undef X
            [0] = &send,
        };
        
        static uint8_t branch = 0;
        bool csend = should_send(t);
        branch = csend ? 0 : branch;      // cmov or SETcc
        goto * jt[branch];
        
        #define X(n) \
        abandon##n: \
            abandon(t); \
            branch = random() % 256; \
            return;
        BYTE_VALUES
        #undef X
        
        send:
            if (csend) send(t);
            else abandon(t);
            return;
    }

Assuming no inherent bias in the low byte produced by `random`, there's only a ~1/255 chance that an abandon branch will correctly predict, though this is also true for the send branch. The conditional branch in send though should only mispredict 1/256 times (when random returns 0).

If we're sending significantly more often than 1/256 calls to resolve, it may be possible to train the BTP to prefer the send branch, as it will correctly predict this branch more often than the others which are chosen randomly - though this would depend on how the branch target predictor is implemented in the processor.

Dwedit · 2025-11-16T15:41:21 1763307681

rand() doesn't do what you think it does. It's a multiply then an add, then return particular bits of the current seed. See "Linear congruential generator" for more information.

On GCC, it's multiply by 1103515245 then add 12345, and return low 30 bits. On MSVC, it's multiply by 214013 and add 2531011 then return bits 16...30.

sparkie · 2025-11-16T15:45:09 1763307909

I don't specifically mean stdlib `rand` (hence comment about assuming no inherent bias) - but just some arbitrary PRNG. Updated the code above to say `random` so as not confuse.

sparkie · 2025-11-16T12:29:59 1763296199

The branch taken hint (3EH) was re-added in Redwood Cove (2023), but it's only for static prediction where the branch predictor has not yet encountered the branch - ie, useful for things you would only use once or twice but would likely take the branch. Once the branch predictor has some information the static prediction hint is ignored, so it's best to omit it for anything that will eventually have dynamic branch prediction (ie, run many times), because you'll be consuming bytes in the i-cache which serve no purpose.

tehjoker · 2025-11-16T16:20:54 1763310054

if the branch is only taken once how can you realize a significant performance benefit more than a few ns?

ack_complete · 2025-11-16T20:29:34 1763324974

Cold branch comes to mind -- something like a interrupt handler, that is run often enough but not in high enough bursts.

sparkie · 2025-11-07T23:48:26 1762559306

It's most likely a combination of both genetics and society - neither are absolutes. There is no concrete evidence that intelligence is purely a social construct, nor that it is genetic. We simply don't know.

People get cancelled not for saying that it is genetic, but for questioning whether it may be. Of course, we will never know if we're not allowed to ask. Cancel culture is anti-science.

Watson may have been racist, but questioning whether there is a relationship between genetics and intelligence by itself is not racism.

krapp · 2025-11-08T00:11:00 1762560660

We are allowed to ask this question, and we have asked it, and we've found that the evidence does not validate the premise of inherent racial intelligence or other racial essentialist views[0]. Claims like "Asians have the highest IQ" are not meaningful or scientifically valid.

[0]https://en.wikipedia.org/wiki/Race_and_intelligence#Research...

JuniperMesos · 2025-11-08T00:50:24 1762563024

This (https://nces.ed.gov/fastfacts/display.asp?id=171) US-government provided table of average SAT scores in the United States in 2023, which has breakdowns by race/ethnicity of the test-taker, and clearly shows Asians with the highest average score out of any of the racial categories in the chart, is evidence for something that you could pithily summarize as "Asians have the highest IQ". The relationship between SAT scores and IQ and intelligence in an everyday sense; and how representative people whose racial categorization went into this chart are of everyone on the planet who could also be grouped into that racial category; are more complicated questions. Nonetheless, the hypothesis that there are genetic differences between people of different racial groups that affect their intelligence in a similar way to how they affect more obvious racial correlates such as hair and skin color, is not obviously wrong.

zimpenfish · 2025-11-08T10:09:31 1762596571

> This US-government provided table of average SAT scores in the United States in 2023

If you look at their source[0], there's no information about how they controlled for confounders (because it's impossible as they acknowledge[1].)

There's a strong correlation between "education of parents" and "SAT score"[2] which implies that family wealth is a strong contribution to a child's SAT score (something we all know anyway); that's also backed up by [3].

(I'd suggest that [4] also contributes to the positive correlation between familial wealth and test scores but perhaps in a more oblique "the higher goals are aimed at by kids who have the backing to contemplate them because of family support structure, tutoring, ability to pay for the degree(s), etc." way.)

(Similarly for [5], I suppose - there's a distinct correlation between what I'd say was "perceived difficulty of major" and the mean SAT scores. Again probably down to familial wealth, support, tutoring, etc.)

Someone who's an actual statistician would probably rip this apart much more thoroughly and with more rigour than I, of course.

[0] https://reports.collegeboard.org/media/pdf/2023-total-group-...

[1] "Relationships between test scores and other background or Evidence-Based Reading and Writing (ERW) and Math contextual factors are complex and interdependent. Caution is sections of each assessment in the SAT Suite: warranted when using scores to compare or evaluate teachers, schools, districts, or states, because of differences in participation and test taker populations."

[2] Bottom of page 4: "Highest Level of Parental Education"

[3] Bottom of page 5: "Median Family Income"

[4] Bottom of page 8: "Degree-Level Goal"

[5] Top of page 8: "Intended College Major"

balamatom · 2025-11-08T00:24:36 1762561476

>There is no concrete evidence that intelligence is purely a social construct, nor that it is genetic.

There isn't even any concrete evidence that it's a good thing.

pfdietz · 2025-11-08T02:02:08 1762567328

Intelligence has a significant genetic component, otherwise it couldn't have evolved.

"Intelligence isn't genetic" is the left's version of creationism.

krapp · 2025-11-08T02:25:47 1762568747

No one is claiming that intelligence isn't genetic. Certainly not "the left."

The claim is that race as commonly understood and defined (specifically by eugenicists like Watson) has no genetic basis, and therefore claims which follow from that definition such as "Asians have higher IQ" are not scientifically valid, and do not prove the validity of Watson's racial views.

For some reason sparkie just decided to reframe my comment around a claim I didn't make and now here we are litigating a "leftist" strawman.

I'm so tired.