'What are assembly instructions like PEXT actually used for?

I watched a youtube video on the Top 10 Craziest Assembly Language Instructions and some of these instructions have no obvious application to me. What's the point of something like PEXT, which takes only the bits from the second argument which match indices of 1s in the first argument? How would the compiler know when to use this instruction? Same/similar questions about carry-less multiplication.

Disclaimer: I know little to nothing about assembly language. Maybe I should read up on it!

I hope this question is stackoverflow-appropriate.



Solution 1:[1]

You can find some applications listed in the paper regarding the hardware unit for PDEP/PEXT

There are many emerging applications, such as cryptography, imaging and biometrics, where more advanced bit manipulation operations are needed. While these can be built from the simpler logical and shift operations, the applications using these advanced bit manipulation operations are significantly sped up if the processor can support more powerful bit manipulation instructions. Such operations include arbitrary bit permutations, performing multiple bit-field extract operations in parallel, and performing multiple bit-field deposit operations in parallel. We call these permutation (perm), parallel extract (pex) or bit gather, and parallel deposit (pdep) or bit scatter operations, respectively.

Performing Advanced Bit Manipulations Efficiently in General-Purpose Processors

Bit permutation is extremely common in bitboards, for example reverse bytes/words or mirror bit arrays. There are lots of algorithms in it that require extensive bit manipulation and people had to get creative to do that before the era of PEXT/PDEP. Later many card game engines also use that technique to deal with a single game set in just one or a few registers

PDEP/PEXT is also used to greatly improve bit interleaving performance, which is common in algorithms like Morton code. Some examples on this:

The multiplication technique invented for bitboards is also commonly used in many algorithms in Bit Twiddling Hacks, for example interleave bits with 64-bit multiply. This technique is no longer needed when PDEP/PEXT is available

You can find more detailed information in Bit permutations and Hacker's Delight

Another usage for PDEP/PEXT is to extract/combine fields where the bits are not in contiguous positions, for example disassemble RISC-V instructions where immediates scatter around to make hardware design simpler but also make it a bit messier to work with on software without PDEP/PEXT

Some other applications:

I think the pext / pdep instructions have HUGE implications to 4-coloring problem, 3-SAT, Constraint Solvers, etc. etc. More researchers probably should look into those two instructions.

Just look at Binary Decision Diagrams, and other such combinatorial data structures, and you can definitely see the potential uses of PEXT / PDEP all over the place.

https://news.ycombinator.com/item?id=19137260


How would the compiler know when to use this instruction?

Compilers can recognize common patterns and optimize the instruction sequence, but for advanced things like this then programmers usually need to explicitly call intrinsics from high level code

Solution 2:[2]

PDEP (Parallel Deposit) and PEXT (Parallel Extract) are meant to be a convenient way to extract and deposit bit fields. I'd bet there are good low level use cases for them.

For actual uses - I wrote a Sudoku solver that used PEXT in couple functions to extract bit values. Thanks to PEXT I was able to extract 4 elements in a single instruction (vs 1 for normal approach). It was really convenient. If you'd really want I could put up a code snippet on Compiler Explorer to show the difference.

Solution 3:[3]

The following isn't directly related to the usag of PDEP / PEXT since it's about the performance - but it affects if its usage makes sense. I've got a Zen2 Ryzen Threadripper 3990X CPU under Windows 11 and I tested the througput of PDEP and PEXT with the intrinsics of MSVC++ and Intel C++ under Windows and clang++ and g++ under Linux. Here's the code:

#include <iostream>
#include <vector>
#include <chrono>
#include <random>
#include <cstdint>
#include <atomic>
#if defined(_MSC_VER)
    #include <intrin.h>
#elif defined(__GNUC__) || defined(__llvm__)
    #include <immintrin.h>
#endif

using namespace std;
using namespace chrono;

atomic_uint64_t aSum( 0 );

int main()
{
    constexpr size_t
        N = 0x1000,
        ROUNDS = 10'000;
    vector<uint64_t> data( N, 0 );
    mt19937_64 mt;
    uniform_int_distribution<uint64_t> uid( 0, -1 );
    for( uint64_t &d : data )
        d = uid( mt );
    auto pdep = []( uint64_t data, uint64_t mask ) -> uint64_t { return _pdep_u64( data, mask ); };
    auto pext = []( uint64_t data, uint64_t mask ) -> uint64_t { return _pext_u64( data, mask ); };
    auto bench = [&]<typename Permute>( Permute permute ) -> double
    {
        uint64_t sum = 0;
        auto start = high_resolution_clock::now();
        constexpr uint64_t MASK = 0x5555555555555555u;
        for( size_t r = ROUNDS; r--; )
            for( uint64_t d : data )
                sum += permute( d, MASK );
        double ns = (double)(int64_t)duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count() / ((double)N * ROUNDS);
        ::aSum = sum;
        return ns;
    };
    cout << bench( pdep ) << endl;
    cout << bench( pext ) << endl;
}

According to the data on agner.org PDEP / PEXT should have a latency and througput of slightly below 20 clock cycles on my Zen2 CPU. On Intel since Haswell CPUs the latency is only 3 clock cycles and the throughput is a whopping one clock cycle.
But according to my measurements each instruction takes about 35ns, i.e. about 150 clock cycles on my CPU. There's no measurement error and the disassembly I checked matches what you'd write in assembly. So I'm curious about the data of other CPUs. Maybe you'll report it here. It would be helpful to assess if the usage of PDEP or PEXT makes sense.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Sep Roland
Solution 2 Dom324
Solution 3 Bonita Montero