The __byte_perm function in CUDA is a powerful intrinsic designed for byte-level manipulation of 32-bit integers, commonly used in GPU-accelerated applications like cryptographic operations, data formatting, and Bitcoin script construction. It allows developers to select and rearrange bytes from two 32-bit inputs, x and y, into a new 32-bit output using a selector s. This guide provides a detailed explanation of how __byte_perm works, with a focus on constructing selector values, and includes practical examples to clarify its usage and address common pitfalls, particularly in the context of your Bitcoin P2SH script construction.
__device__ unsigned int __byte_perm(unsigned int x, unsigned int y, unsigned int s);- Parameters:
x: A 32-bit unsigned integer providing the first four bytes of the source.y: A 32-bit unsigned integer providing the next four bytes of the source.s: A 32-bit selector that determines which bytes are chosen for the output.
- Return: A 32-bit unsigned integer composed of four bytes selected from the 8-byte source based on
s.
The __byte_perm function combines x and y into an 8-byte source, from which four bytes are selected to form the result. The selector s specifies which bytes to pick and their order in the output. Here’s a step-by-step breakdown:
The 8-byte source is formed by concatenating x and y:
- Bytes 0–3: From
x, where byte 0 is the least significant byte (LSB, bits 7:0) and byte 3 is the most significant byte (MSB, bits 31:24). - Bytes 4–7: From
y, where byte 4 is the LSB (bits 7:0) and byte 7 is the MSB (bits 31:24).
For example, if x = 0x12345678 and y = 0x9ABCDEF0, the source bytes are:
| Index | Byte | Source |
|---|---|---|
| 0 | 0x78 | x[7:0] |
| 1 | 0x56 | x[15:8] |
| 2 | 0x34 | x[23:16] |
| 3 | 0x12 | x[31:24] |
| 4 | 0xF0 | y[7:0] |
| 5 | 0xDE | y[15:8] |
| 6 | 0xBC | y[23:16] |
| 7 | 0x9A | y[31:24] |
The selector s is a 32-bit integer, but only the lower 16 bits are used, divided into four 3-bit fields:
- Bits 2:0: Select the byte for result[7:0] (LSB).
- Bits 6:4: Select the byte for result[15:8].
- Bits 10:8: Select the byte for result[23:16].
- Bits 14:12: Select the byte for result[31:24] (MSB).
Each 3-bit field (values 0–7) specifies an index into the 8-byte source. The result is constructed as:
result[7:0] = source[(s >> 0) & 0x7]result[15:8] = source[(s >> 4) & 0x7]result[23:16] = source[(s >> 8) & 0x7]result[31:24] = source[(s >> 12) & 0x7]
In your code, selectors are often written in binary with 3-bit groups separated by apostrophes (e.g., 0b011'001'000'100). This represents a 12-bit or 16-bit value, where each group corresponds to a field:
- Leftmost group: Bits 14:12 (for result[31:24]).
- Next group: Bits 10:8 (for result[23:16]).
- Next group: Bits 6:4 (for result[15:8]).
- Rightmost group: Bits 2:0 (for result[7:0]).
For example, s = 0b011'001'000'100 translates to 0x3104 in hex, with fields 3, 1, 0, 4.
The result is a 32-bit integer stored in little-endian memory (e.g., [b0, b1, b2, b3]). When printed with your logic:
printf("%02x %02x %02x %02x ", (word >> 24) & 0xFF, (word >> 16) & 0xFF, (word >> 8) & 0xFF, word & 0xFF);It displays bytes in big-endian order: MSB (bits 31:24) first, LSB (bits 7:0) last. For result = 0x76a914c5, it prints 76 a9 14 c5.
To achieve a specific output, follow these steps:
- Identify Desired Bytes: Determine which bytes from the source are needed for each position in the result.
- Assign Indices: Map each desired byte to its source index (0–7).
- Build Selector: Place each index in the appropriate 3-bit field of
s:s = (index3 << 12) | (index2 << 8) | (index1 << 4) | index0index0for result[7:0],index1for result[15:8],index2for result[23:16],index3for result[31:24].
- Write in Binary: Optionally, express
sas a binary literal with 3-bit groups (e.g.,0b011'001'000'100).
Below are examples to illustrate __byte_perm usage, including cases from your code and additional scenarios to highlight common patterns and pitfalls.
Goal: Output all bytes of x = 0x12345678.
- Inputs:
x = 0x12345678(bytes[78, 56, 34, 12]),y = 0x9ABCDEF0(bytes[F0, DE, BC, 9A]). - Source:
[78, 56, 34, 12, F0, DE, BC, 9A]. - Desired Result:
[78, 56, 34, 12](hex0x12345678). - Selectors: Choose bytes 0, 1, 2, 3.
result[7:0] = source[0] = 78→ index 0.result[15:8] = source[1] = 56→ index 1.result[23:16] = source[2] = 34→ index 2.result[31:24] = source[3] = 12→ index 3.
- Selector:
s = (3 << 12) | (2 << 8) | (1 << 4) | 0 = 0x3210. - Binary:
0b0011'0010'0001'0000. - Code:
unsigned int result = __byte_perm(0x12345678, 0x9ABCDEF0, 0x3210); // result = 0x12345678, prints 12 34 56 78
Goal: Output all bytes of y = 0x9ABCDEF0.
- Inputs: Same as above.
- Desired Result:
[F0, DE, BC, 9A](hex0x9ABCDEF0). - Selectors: Choose bytes 4, 5, 6, 7.
- Selector:
s = (7 << 12) | (6 << 8) | (5 << 4) | 4 = 0x7654. - Binary:
0b0111'0110'0101'0100. - Code:
unsigned int result = __byte_perm(0x12345678, 0x9ABCDEF0, 0x7654); // result = 0x9ABCDEF0, prints 9A BC DE F0
Goal: Output 76 a9 14 c5 with x = 0x7600a914, y = h[0] = 0x069f34c5.
- Inputs:
x = 0x7600a914→ bytes[14, a9, 00, 76].y = 0x069f34c5→ bytes[c5, 34, 9f, 06].- Source:
[14, a9, 00, 76, c5, 34, 9f, 06].
- Desired Result:
[c5, 14, a9, 76](hex0x76a914c5, prints76 a9 14 c5). - Selectors:
result[7:0] = c5 = source[4]→ index 4.result[15:8] = 14 = source[0]→ index 0.result[23:16] = a9 = source[1]→ index 1.result[31:24] = 76 = source[3]→ index 3.
- Selector:
s = (3 << 12) | (1 << 8) | (0 << 4) | 4 = 0x3104. - Binary:
0b0011'0001'0000'0100or0b011'001'000'100. - Code:
unsigned int result = __byte_perm(0x7600a914, 0x069f34c5, 0x3104); // result = 0x76a914c5, prints 76 a9 14 c5
Goal: Reproduce your working example with x = 0x0014a976.
- Inputs:
x = 0x0014a976→ bytes[76, a9, 14, 00].y = 0x069f34c5→ bytes[c5, 34, 9f, 06].- Source:
[76, a9, 14, 00, c5, 34, 9f, 06].
- Desired Result:
[c5, 14, a9, 76](prints76 a9 14 c5). - Selector:
s = 0b0000'0001'0010'0100 = 0x0124.- Bits 2:0:
100= 4 → source[4] =c5. - Bits 6:4:
010= 2 → source[2] =14. - Bits 10:8:
001= 1 → source[1] =a9. - Bits 14:12:
000= 0 → source[0] =76.
- Bits 2:0:
- Code:
unsigned int result = __byte_perm(0x0014a976, 0x069f34c5, 0x0124); // result = 0x76a914c5, prints 76 a9 14 c5
Goal: Understand why s = 0b011'001'000'111 failed.
- Inputs: Same as Example 3.
- Selector:
s = 0b011'001'000'111 = 0x3107.- Bits 2:0:
111= 7 → source[7] =06. - Bits 6:4:
000= 0 → source[0] =14. - Bits 10:8:
001= 1 → source[1] =a9. - Bits 14:12:
011= 3 → source[3] =76.
- Bits 2:0:
- Result:
[06, 14, a9, 76](hex0x76a91406, prints76 a9 14 06). - Issue: You reported
14 9f c5 06, suggesting a possible mismatch inh[0](e.g.,h[0] = 0x14c59f06) or a printing error. The selector picked byte 7 (06) instead of byte 4 (c5).
In your code, __byte_perm is used to construct a Bitcoin P2SH-P2WPKH script, where scriptBytes[0] needs to start with specific opcodes (76 a9 14) followed by a hash byte (c5). The working example:
scriptBytes[0] = __byte_perm(h[0], 0x14, 0x5401);Produces 00 14 c5 34, showing how __byte_perm interleaves bytes from h[0] and a constant. For your case, adjusting the selector to 0x3104 with x = 0x7600a914 aligns the bytes correctly.
- Incorrect Selector Fields:
- Problem: Using wrong indices (e.g.,
0x3107selecting byte 7 instead of 4). - Solution: Double-check the source byte indices and align with desired output positions.
- Problem: Using wrong indices (e.g.,
- Byte Order Confusion:
- Problem: Misinterpreting
xorybyte positions (e.g., assuming big-endian). - Solution: Always list bytes as LSB to MSB for
xandy.
- Problem: Misinterpreting
- Printing Logic:
- Problem: Printing in big-endian order can confuse memory order expectations.
- Solution: Verify result in memory (little-endian) matches print order (big-endian).
- Hash Byte Misalignment:
- Problem: Assuming
h[0]bytes are in a different order. - Solution: Confirm
h[0]’s byte order (e.g.,0x069f34c5→[c5, 34, 9f, 06]).
- Problem: Assuming
- Use Hex for Selectors: Hex values (e.g.,
0x3104) are less error-prone than binary literals. - Verify Byte Order: Always confirm whether the desired output refers to memory layout or print order. For big-endian printing, reverse the byte order in memory.
- Test with Known Values: Use simple inputs to validate selector logic.
- Document Selectors: Comment the intended byte selections (e.g.,
// selects [76, a9, 14, c5]). - Test Selectors Incrementally: Use temporary print statements to verify each
__byte_permoutput. - Account for Endianness: When working with little-endian
uint32_tinputs and big-endian printing, calculate selectors based on the memory layout needed for the print order.
The __byte_perm function is a versatile tool for byte manipulation in CUDA, but its selector mechanism requires careful construction. By understanding the 8-byte source and 3-bit field selectors, you can achieve precise byte arrangements. The examples provided, especially tailored to your Bitcoin script use case, should help you avoid common errors and use __byte_perm effectively in your CUDA programs.