Stream benchmark - Altimesh

The stream benchmark aims at measuring memory bandwidth performance of the architecture. We simply add two vectors, in what is our “Hello World” program. Simplified version of the C# code can be found on our SDK on github.

As for Expm1 benchmark we run the generated code against the best handwritten native code we could write. This allows us to compare performance differences between hybridized code and handwritten native code using intrinsics.

[EntryPoint("run")]
public static void Run(int N, double[] a, double[] b)
{
    int start = threadIdx.x + blockDim.x * blockIdx.x;
    int step = blockDim.x * gridDim.x;
    for(int k = start; k < N; k += step) 
    {
        a[k] += b[k];
    }
}

No Hint

In a first version, we don’t give any hint about alignment or index coalescence. This has little to no effect on CUDA machines, and hardware prefetch looks doing a very good job on Intel Xeon machines. However, this leads to a gather instruction, which has a little performance penalty associated on Xeon Phi. We’ll see later how to restore performance.

Architecture	Generated	Handwritten	Ratio
NVIDIA- P100	479.9	495.4	96.8%
NVIDIA – K20C	167	185	90.2%
INTEL – XEON PHI – 7210	327.7	381.7	85.8%
INTEL – Xeon E5 1620 v3 – 3.5 GHz	34.4	35	98.3%

Coalesced – Aligned

In a second version, we specify that our backend storage is an aligned memory location, and that index will stay coalesced and aligned:

[EntryPoint("StreamDouble")]
public static void StreamDouble(alignedstorage_double a, alignedstorage_double b, int offset, int count)
{
    alignedindex start = offset + threadIdx.x + blockIdx.x * blockDim.x ;
    alignedindex end = count + offset;
    for (alignedindex i = start; i < end; i += blockDim.x * gridDim.x)
    {
        a[i] = a[i] + b[i];
    }
}

Architecture	Generated	Handwritten	Ratio
INTEL – XEON PHI – 7210	373.8	381.7	97.9%

Tags: Bandwidth, Benchmark, Stream