Performance improvements with SIMD (Single Instruction multiple data)

“Single Instruction Multiple Data” (SIMD) refers to a type of parallel data processing in which a CPU or processor performs multiple identical operations on multiple data simultaneously. This method is often used when very large and similar amounts of data need to be processed efficiently.

To visualize this in more detail, we can look at the example of vector addition. Normally, we would go through each element of the vectors individually and add them to get the result vector.

In the code example, the implementation would look like this:

C
int main(){
  float a[4] = {1.0f, 2.0f, 3.0f, 4.0f};
  float b[4] = {5.0f, 6.0f, 7.0f, 8.0f};
  float c[4];

  for (int i = 0; i < 4; i++) {
    c[i] = a[i] + b[i];
  }

  return 0;
}

It can be clearly seen that the operator is always the same (single instruction). However, the data changes with each loop pass (multiple data).

SIMD allows the same addition to be performed in only one step instead of four steps by copying the arrays into SIMD registers. These registers typically have sizes of 128, 256, or 512 bits. In this example, a 128-bit SIMD register is used because there is room for exactly 4 float values in it.

C
#include <xmmintrin.h>

int main() {
  float a[4] = {1.0f, 2.0f, 3.0f, 4.0f};
  float b[4] = {5.0f, 6.0f, 7.0f, 8.0f};
  float c[4];

  __m128 vecA = _mm_load_ps(a);
  __m128 vecB = _mm_load_ps(b);
  __m128 vecC = _mm_add_ps(vecA, vecB);
  _mm_store_ps(c, vecC);

  return 0;
}

Do I need to customize my code to benefit from SIMD?

Yes and no. Modern compilers are able to identify such constructs and automatically convert them into SIMD code (Automatic vectorization). Even if this is possible in theory, compilers do not recognize every place, especially in complicated code places. Therefore it is advisable to write these explicitly with the C compiler intrinsics.

Hardware implementation

SIMD is implemented in modern CPUs and processors by special circuits called SIMD units or vector units. These circuits are specially designed for parallel processing of data and can process a high number of data simultaneously.

In CPUs from Intel and AMD, SIMD is implemented by the SSE (Streaming SIMD Extensions) and AVX (Advanced Vector Extensions). SSE technology supports SIMD operations on 128-bit data blocks, while AVX supports up to 256-bit data blocks. AVX-512 extends the technology to 512-bit data blocks.

In Arm processors, SIMD functionalities are added by the Neon instruction set. These have a width of 128 bits.

Notes

It is important to consider that copying data between the normal and SIMD registers causes additional overhead that can affect the runtime of the program. Therefore, copying should be done as infrequently as possible to avoid negative impact on overall runtime.

Additionally, the use of C compiler intrinsics limits the target platforms since not all processors have SIMD registers. In such cases, it may be necessary to implement an alternative code version without SIMD to support processors without these registers.

Leave a Reply

Your email address will not be published. Required fields are marked *