...and on the Haswell PEXT runs in *one* uop with a latency of 3! That is nothin...

ant6n · on Oct 8, 2017

ARMV8 has bit field extract/insert, which are fairly powerful, and I'd say more generic than "shl ax,N", which is really only there for legacy reasons (it's an instruction with 16-bit operator prefix).

RISC generally tries to keep instructions as generic/useful as possible to keep silicon small.

to3m · on Oct 8, 2017

I was thinking about this today. Imagine treating it N bits at a time. N=4, perhaps. 16 cases, so it won't take up much space, and you can just write out each case independently. Perhaps it's easy to have a table? - then you might do 8 bits at a time, maybe.

This suboperation does an N-bit version of the total operation: takes N bits of the mask, the corresponding N bits of the input, and produces an S-bit result, R. (S is the population count of that part of the mask.)

Suppose N=4. Then for a 64-bit value, you can do 16 of these in parallel, and you've got 16 intermediate results, R0...Rf, and 16 result sizes, S0...Sf. Then combine them. Overall result = R0 | R1<<S0 | R2<<S0+S1 | R3<<S0+S1+S2 | ... | Rf<<S0+S1+S2+...+Se.

I've no idea whether this is actually easy to implement this way, but at least it requires no loop ;)

----

12-bit example where N=4.

Mask is %000101011000, and value is %lkjihgfedcba. (I've labelled each bit of the value by its position, since that's the key part.)

Intermediate results: R0=%000d, S0=1. R1=%00ge, S1=2. R2=%000i, S2=1.

R0's contribution to the overall result = R0 = %d.

R1's = R1<<S0 = R1<<1 = %ge0.

R2's = R2<<S0+S1 = R<<3 = %i000.

So the overall result is %000d|%ge0|%i000 = %iged.