Efficient minimal PCG32 implementation for CUDA-enabled GPUs

May 01, 2024

Written by Patryk Pochmara, Supervised by course leader Krzysztof Kaczmarski Ph.D.

Introduction

The purpose of this project is to find an efficient implementation of PCG32 ² PRNG algorithm on CUDA-enabled GPUs. Main focus lies on providing a CUDA kernel, which fills an array in device memory with generated numbers in the same order as a classical CPU implementation would.

This paper will not analyse the statistical or cryptographic properties of PCG family generators. Nor will it compare the performance to other PRNG algorithms. It’s focus lies on finding the fastest implementation of PCG32 algorithm.

PCG32

The minimal PCG32 standard specifies that internal state of the generator consists of two 64-bit integers denoted $S$ and $I$ respectively. These integers are the seed of the generated sequence. $I$ remains constant throughout the lifetime of generator and $S$ changes with each produced number. The standard also defines a constant $M = 6364136223846793005$ .

Two functions that are used to generate a series of random numbers based on the internal state are:

The state-transition function: $S_{n+1} = S_n \cdot M + I$ which offsets generator state by 1.
And the output function which takes $S$ and produces one pseudo-random 32-bit number.

Algorithm

General idea

We can generalise the state offset function to allow for any positive offset like so: $\begin{aligned} S_{n} &= S_0 \cdot M^n + I \cdot (M^{n-1} + M^{n-2} + ... + M^0) \\ S_{n} &= S_0 \cdot M^n + I \cdot \sum_{i=0}^{n-1} \left( M^i \right)\end{aligned}$ Aside from $S$ this function also takes a second argument - $n$ .

Output function, while defined by PCG32 standard, is not important in context of efficient sequence generation since its complexity is constant and its result is a pseudo-random number, which is not used to calculate subsequent numbers.

This means that efficiently generating the random sequence, translates to generating the sequence of values of $S$ , which breaks down to calculating the subsequent powers: $M^1, M^2, M^3, ...$ and sums: $M^0, M^0 + M^1, M^0 + M^1 + M^2, ...$ .

Brown’s algorithm

In order to calculate $S_n$ faster than naively (linear complexity) we can utilise Brown’s algorithm ¹ which has the complexity of $O(\log_2 n)$ . It can be thought of as an extension to the well-known fast integer exponentiation algorithm.

sum = 0
power = 1
increment = 1
multiplier = M
while (n != 0)
    if (n % 2 == 1)
        power = power * multiplier
        sum = sum * multiplier + increment
    increment = increment * (multiplier + 1)
    multiplier = multiplier * multiplier
    n = n / 2
S = S * power + I * sum

One-number-per-thread approach

To generate $N$ random numbers a classic CPU implementation would perform only $O(N)$ operations. This is because linear single-thread algorithm calculates each value of $S_i$ only once and immediately uses it to calculate $S_{i+1}$ .

On the other hand, if we execute a CUDA kernel, where each thread calculates one random number using Brown’s algorithm, each thread would perform $O(\log_2 N)$ operations. That gives in total $O(N\log_2 N)$ operations, which is not work-efficient, because most computations are repeated by different threads. For instance half of the threads would calculate $S_1$ and a quarter would calculate $S_3$ as intermediary values while calculating their target $S_t$ , where $t$ is the thread’s index. Profiling also shows that this approach’s performance is limited by computation throughput.

Multi-linear approach

Once $power_n = M^n$ and $sum_n = \sum_{i=0}^{n-1} \left( M^i \right)$ have been calculated for particular $n$ they can be reused. Which means generating a sequence $S_n, S_{2n}, S_{3n}, ..., S_{mn}$ requires only $2m$ multiplications and $m$ additions aside from the initial calculations of $power_n$ and $sum_n$ .

If in a CUDA kernel each thread generates $M$ out of $N$ random numbers (see figure 1), we can use Brown’s algorithm to calculate $power_n$ and $sum_n$ just once. Such kernel would be composed of $T = \lceil \frac{N}{M} \rceil$ threads, each of which would calculate their own $S_t$ and then using $power_T$ and $sum_T$ generate the sequence $S_t, S_{t+T}, ...$ . In total all threads would perform $O(T(M + \log_2 T) ) = O(N + T\log_2 T)$ operations. This means work-efficiency increases as number of threads decreases, but at a cost of parallelism.

$Illustration of multi-linear idea. Each of 4 threads generates 3 to 4 numbers for a total of 14. Initial S_t are calculated with Brown's algorithm (0. iteration), subsequent S_{t+4i} are generated using power_4 and sum_4, which were also calculated using Brown's algorithm.$

Parallelism, i.e. maximum number of concurrent threads, is still limited on a GPU, despite being many orders of magnitudes bigger than on a CPU. If a multi-linear kernel is tuned to utilise all of its cores while not creating too many threads, we can achieve maximum performance.

Optimisations

Profiling shows that utilising vectorized memory access allows for better interleave of computations and memory stores, which in turn yields a small performance increase.

Illustration of multi-linear idea using vectorized memory
access.

It is also worth noting that multi-linear approach is the most efficient when $T$ is a power of 2. This is because Brown’s algorithm is based on treating the exponent as a sum of powers of 2, which means that $power_T$ and $sum_T$ are simply the values of $multiplier$ and $increment$ after the $\log_2 T$ loop iteration. If $T$ is not a power of 2 then it is necessary to use Brown’s algorithm twice - once to calculate $S_t$ and once to calculate $power_T$ and $sum_T$ , unless an optimal (potentially GPU-specific) $T$ is hard-coded and these two values are precomputed.

Additionally the $multiplier$ and $increment$ for each iteration of Brown’s algorithm can be precomputed and loaded to GPU’s constant memory. If we limit the number of iterations to 16 (thus limiting the number of threads to $2^{16}$ ) then storage of these constants takes only $16 \cdot 2 \cdot 8 = 256$ bytes.

Implementation

Fragments of source code included in this paper have been adapted for readability from benchmarked implementation of multi-linear approach ⁴. For benchmark results see section 2.6.

This implementation takes initial generator state ( $S_0$ and $I$ ) as an argument and does not return final generator state ( $S_N$ ). However it is simple to modify it to read $S_0$ from device memory and/or write $S_N$ to device memory for subsequent kernel to utilise. Alternatively a global offset to generated numbers’ indices can be easily added as a kernel parameter.

Assumptions

Code presented below makes use of pcg32_random_t type, which is used by official PCG32 minimal C implementation ³, to represent a generator instance containing its internal state $S$ and $I$ referenced as .state and .inc.

These code listings also assume all proposed optimisations (2.4.1) are present, i.e. memory access is vectorized, number of threads is a power of 2 and there exist two globally accessible constant arrays: uint64_t multipliers[] and uint64_t increments[], which are populated with precomputed values for consecutive iterations of Brown’s algorithm.

Output function

uint32_t pcg32_xsr(uint64_t state)
{
    uint32_t xsh = (uint32_t)(((state >> 18u) ^ state) >> 27u);
    uint32_t rot = state >> 59u;
    return (xsh >> rot) | (xsh << ((-rot) & 31));
}

Brown’s algorithm

void offset_state(pcg32_random_t &gen, int64_t index)
{
    uint64_t power = gen.state;
    uint64_t sum = 0;
    for (int it = 2; index != 0; it++, index >>= 1)
    {
        if (index & 1)
        {
            power = power * multiplier[it];
            sum = sum * multiplier[it] + increment[it];
        }
    }
    gen.state = power + gen.inc * sum;
}

Note the ”int it = 2” within loop initialisation. This makes each thread calculate $S_{4t}$ instead of $S_t$ as its starting state.

Multi-linear approach

void generate_numbers(uint32_t *results, pcg32_random_t gen, int64_t index,
                      const int64_t numbers, const int threads_log2)
{
    const int64_t step = 1LL << threads_log2;
    while (index * 4 + 3 < numbers)
    {
        uint4 result_vec;
        uint64_t s = gen.state;
        result_vec.x = pcg32_xsr(s);
        s = s * multiplier[0] + gen.inc;
        result_vec.y = pcg32_xsr(s);
        s = s * multiplier[0] + gen.inc;
        result_vec.z = pcg32_xsr(s);
        s = s * multiplier[0] + gen.inc;
        result_vec.w = pcg32_xsr(s);
        ((uint4*)results)[index] = result_vec;
        
        gen.state = gen.state * multiplier[threads_log2]
                    + gen.inc * increment[threads_log2];
        index += step;
    }
    for (int i = 0; index * 4 + i < numbers; i++)
    {
        results[index * 4 + i] = pcg32_xsr(gen.state);
        gen.state = gen.state * multiplier[0] + gen.inc;
    }
}

Kernel

void kernel(uint32_t *results, pcg32_random_t gen,
            const int64_t numbers, const int threads_log2)
{
    const int64_t index = threadIdx.x + blockIdx.x * (int64_t)blockDim.x;
    offset_state(gen, index);
    generate_numbers(results, gen, index, numbers, threads_log2);
}

Benchmarks

Presented here are the results of benchmarks of multi-linear approach with all proposed optimisations implemented. For technical specification of benchmark setup see section 2.6.1.

$Average time [ps] per number in a multi-linear kernel generating 2^{30} numbers. Time values are placed on a logarithmic scale (base 2) grouped by threads in a kernel.$

Figure 3 shows, that time per number generated is the smallest when total number of threads in a kernel is between $2^{14}$ and $2^{22}$ . Below this range the GPU cannot be fully utilised, while above it the computational cost of repeated work becomes significant.

In order to measure GPU’s memory throughput a simple kernel was benchmarked, where each thread stores a single 4x4-byte vector to device memory. Total memory usage is equal to amount of memory, which numbers generated by corresponding multi-linear kernel occupy.

$Average execution time [ns] of a multi-linear kernel with 2^{14} threads in a kernel. "Throughput" kernel for reference. Time values are placed on a logarithmic scale (base 2) grouped by numbers generated.$

The results of benchmarks show that multi-linear kernel’s execution time is comparable to that of the “throughput” kernel. This proves that multi-linear algorithm is close to achieving maximum GPU throughput.

Technical specifications of machines, on which benchmarks were performed

GPU	VRAM	CPU	RAM
1x Tesla K20m	1x 5 GiB	2x Intel Xeon E5649	15 GiB
4x Tesla P100	4x 16 GiB	2x Intel Xeon E5-2695 v4	512 GiB
8x NVIDIA A100-SMX4	8x 40 GiB	2x AMD EPYC 7742 64-Core	1 TiB

Table: Technical specifications of benchmark machines

While the machines may be equipped with more than one CPU or GPU, benchmark utilised only one of each.

Rejected ideas

Alternative to Brown’s algorithm

An alternative algorithm with logarithmic complexity was developed to calculate the offset state $S_n$ , but, despite being more complex, it was less efficient. The algorithm was based on following observations:

Sum of geometric series: $\sum_{i=0}^{n-1} \left( M^i \right) = \dfrac{M^n - 1}{M-1}$
$\dfrac{x}{y} \mod{z} = \dfrac{x \mod{(y \cdot z)}}{y}$ , so here:
$\dfrac{M^n - 1}{M-1} \mod{2^{64}} = \dfrac{(M^n - 1) \mod{((M-1) \cdot 2^{64})}}{M-1}$
From Chinese remainder theorem ⁵:
$(M^n - 1) \mod{((M-1) \cdot 2^{64})} = ((M^n - 1) \mod{p}) \cdot q\overline{q} + ((M^n - 1) \mod{q}) \cdot p\overline{p}$
where $p = \frac{M-1}{4}, q = 2^{66}$ and $p\overline{p} + q\overline{q} = 1$ (Bézout’s identity)
$(M^n - 1) \mod{(M-1)} = 0$ , so $(M^n - 1) \mod{\frac{M-1}{4}} = 0$

To calculate $S_n$ , first lower 66 bits of $M^n$ were computed and then, using reasoning explained above, $\sum_{i=0}^{n-1} \left( M^i \right)$ was calculated. This algorithm required extended precision integer arithmetic, including division, which was implemented using Barret’s algorithm. The benchmarked version featured multiple optimisations including PTX routines for arithmetic, but was unable to match the performance of Brown’s algorithm.

Opportunistic reuse of calculated offset states

In order to reduce the overlap (i.e. identical computations) between calculations of different threads an idea was explored for threads to publish calculated offset states to global memory. When a new block began execution it would use the latest block’s published states as basis for its own offset state calculations.

This solution did provide performance gain over one-number-per-thread approach, but required reading and storing 8-byte values to global memory, which was significantly slower than multi-linear approach which only performs 4-byte stores. Another downside is the need for second stage kernel which would perform the final calculation of random numbers from published sequence of states, i.e. apply PCG32 output function.

Summary

The goal of this project has been accomplished. Proposed implementation ⁴ provides a way to generate a sequence of PCG32 random numbers on a CUDA-enabled GPU in the same order as a CPU implementation. It achieves perfect coalescing and very high utilisation of memory bandwidth, which results in near-speed-of-light performance. Work-efficiency is also similar to the linear approach, with overhead depending only on the number of threads in the kernel.

Bibliography

1. 1.Melissa E. O’Neill. PCG: A Family of Simple Fast Space-Efficient Statistically Good Algorithms for Random Number Generation. Tech. rep. HMC-CS-2014-0905. Claremont, CA: Harvey Mudd College, Sept. 2014. url: https://www.cs.hmc.edu/tr/hmc-cs-2014-0905.pdf
↩
1. 2.Forrest B. Brown. Random number generation with arbitrary strides. In: Transactions of the American Nuclear Society 71.ISSN 0003-018X (Dec. 1994), pp. 202–203. url: https://www.osti.gov/biblio/89100
↩
1. 4.Patryk Pochmara. CUDA PCG32. Github Repository. 2024. url: https://github.com/Patonymous/cuda-pcg32
↩
1. 3.Melissa E. O’Neill. PCG32 minimal C implementation. [Accessed 21-May-2024]. 2018. url:https://www.pcg-random.org/download.html#minimal-c-implementation
↩
1. 5.K. H. Rosen. Elementary Number Theory and Its Applications. 1993. url: https://books.google.pl/books?id=gfnuAAAAMAAJ
↩