RC 2022 Day 13 - SIMD

2022-08-24

Today I pair programmed with Isaac on understanding some C++ code from a library that he's calling via the library's Rust bindings. The part we looked at was processing some data and applying a Fourier Transform to it.

We came across some code that instructed the compiler to process it by Single Instruction Multiple Data (SIMD) parallelism. This is the first time I've come across SIMD or such instructions to the compiler, and both concepts were a bit strange for me, so I decided to look into it a bit more!

Pragmas

A 'pragma' is a language construct that tells the compiler how to process input (the name comes from the same Greek word as 'pragmatic'). I find this interesting because in languages I'm familiar with, it's up to the compiler to decide how to execute a program.

C++ has a #pragma directive which tells the compiler to use implementation dependent processor features if available. Here's an example:

#pragma omp parallel for simd
for (i=0, i<x, i++)
{
 // code here can be executed in parallel
}
#pragma omp end parallel for simd

SIMD

SIMD means that the same instruction is applied to multiple pieces of data simultaneously, and it's generally supported on modern CPUs. It's used for operating on vector data efficiently, and so is used in GPUs - for example changing the brightness on a screen means adding the same value to large arrays of data representing pixels.

There are two processor types that are used to implement this:

array processors: these work at the same time in different spaces
vector processors: these work in consecutive time in different spaces, but they can also be stacked/ overlapped to speed things up

What I found confusing about this initially is understanding why we would do this rather than applying multiple instructions simultaneously? I think the answer is that SIMD is an efficient way to do operations on contiguous data without having to use full concurrency. It's efficient on contiguous data because it's more efficient to load multiple adjacent items in the same vector simultaneously, rather than load them one by one.