OpenRAND
0.9
OpenRAND: A C++ Library for Reproducible Random Number Generation in Parallel Computing Environments
|
OpenRAND comes with four generator classes: Philox
, Tyche
, Threefry
and Squares
. They all have a similar interface: given a seed
and counter
, the generator can produce a stream of random numbers (upto 2^32 numbers per object). seed
should be self-explanatory, we'll introduce counter
later.
You can also draw random numbers of other types, such as long long int
, double
, float
, etc.
The API follows numpy style. If you want a floating point from a normal distribution, use randn
:
You can pair these generators seamlessly with the built-in distribution functions from C++'s <random> library:
In a parellel program, it's a good idea to let all your threads have their local random streams. In an OpenMP program you can -
Make sure the seeds are unique, OpenRAND will ensure these streams will not have any correlation between them. This is an important property of OpenRAND that sets it apart from many libraries out there.
Another thing to remember is that these generator objects are very cheap to create, destroy and have a very small memory footprint. So don't worry about creating a new generator object for each thread even if you've got a million threads. Which brings us to-
It's time to introduce the counter
argument.
You can think of a single seed
capable of producing a unique stream of length 2^64. This stream is divided into 2^32 substreams, each of length 2^32. The counter
argument is used to select a substream.
For reproducibility, on GPU, it's better to think in terms of work-unit or processing element instead of thread. For example, in a history-based simulation, a work-unit models the entire lifespan of one particle. In a ray tracing renderer, it can be a pixel index. A work-unit can undergo multiple kerel launches in it's lifetime.
One random number stream per work-item isn't ideal, it's often better to have a unique stream at each kernel launch for a work-item. In this way, you can avoid the overhead of a seperate kernel launch just to initialize random states, loading them from global memory inside each kernel, and saving back the modified state etc. counter
helps you get around all that, by cheaply creating a substream for each kernel launch.
Here's an example of a typical use-case on GPU: a monte carlo paticle simulation code that runs for 10,000 time steps. We can simply use the iteration number as counter
. For seed
, we assume each thread below has a unique global id atribute called pid
.
Interestingly, notice we've spent 0 bits of global memory to keep track of the random number states. In theory, you will need about 96 bits of memory (64 bit seed + 32 bit counter) per work-item to maintain the state of a random number generator. This is already pretty small. But we expect that in majoriy of cases, like the example above, even that won't be necessary.