Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

...and on the Haswell PEXT runs in one uop with a latency of 3! That is nothing short of amazing for an operation which, from its description, would seem to require a microcode loop or at least a few more cycles to collect an arbitrary number of bits with arbitrary gaps between them:

http://www.felixcloutier.com/x86/PEXT.html

The only unfortunate thing is that it was introduced quite recently in terms of x86 history (instead of being present early on but just microcoded), so earlier software can't take advantage of it nor any subsequent improvements, and uses a pretty complex encoding (VEX) only usable in protected mode. But if anything, this is another datapoint in the argument in favour of CISC --- try doing this on a MIPS, RISC-V or even ARM! With less powerful ISAs, even something as simple as the "shl ax, 4" (shift the lower 16 bits only) turns into a multi-instruction sequence of masking and combining.



ARMV8 has bit field extract/insert, which are fairly powerful, and I'd say more generic than "shl ax,N", which is really only there for legacy reasons (it's an instruction with 16-bit operator prefix).

RISC generally tries to keep instructions as generic/useful as possible to keep silicon small.


I was thinking about this today. Imagine treating it N bits at a time. N=4, perhaps. 16 cases, so it won't take up much space, and you can just write out each case independently. Perhaps it's easy to have a table? - then you might do 8 bits at a time, maybe.

This suboperation does an N-bit version of the total operation: takes N bits of the mask, the corresponding N bits of the input, and produces an S-bit result, R. (S is the population count of that part of the mask.)

Suppose N=4. Then for a 64-bit value, you can do 16 of these in parallel, and you've got 16 intermediate results, R0...Rf, and 16 result sizes, S0...Sf. Then combine them. Overall result = R0 | R1<<S0 | R2<<S0+S1 | R3<<S0+S1+S2 | ... | Rf<<S0+S1+S2+...+Se.

I've no idea whether this is actually easy to implement this way, but at least it requires no loop ;)

----

12-bit example where N=4.

Mask is %000101011000, and value is %lkjihgfedcba. (I've labelled each bit of the value by its position, since that's the key part.)

Intermediate results: R0=%000d, S0=1. R1=%00ge, S1=2. R2=%000i, S2=1.

R0's contribution to the overall result = R0 = %d.

R1's = R1<<S0 = R1<<1 = %ge0.

R2's = R2<<S0+S1 = R<<3 = %i000.

So the overall result is %000d|%ge0|%i000 = %iged.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: