The stream benchmark aims at measuring memory bandwidth performance of the architecture. We simply add two vectors, in what is our “Hello World” program. Simplified version of the C# code can be found on our SDK on github.
As for Expm1 benchmark we run the generated code against the best handwritten native code we could write. This allows us to compare performance differences between hybridized code and handwritten native code using intrinsics.
[EntryPoint("run")] public static void Run(int N, double[] a, double[] b) { int start = threadIdx.x + blockDim.x * blockIdx.x; int step = blockDim.x * gridDim.x; for(int k = start; k < N; k += step) { a[k] += b[k]; } }
No Hint
In a first version, we don’t give any hint about alignment or index coalescence. This has little to no effect on CUDA machines, and hardware prefetch looks doing a very good job on Intel Xeon machines. However, this leads to a gather instruction, which has a little performance penalty associated on Xeon Phi. We’ll see later how to restore performance.
Architecture | Generated | Handwritten | Ratio |
---|---|---|---|
NVIDIA- P100 | 479.9 | 495.4 | 96.8% |
NVIDIA – K20C | 167 | 185 | 90.2% |
INTEL – XEON PHI – 7210 | 327.7 | 381.7 | 85.8% |
INTEL – Xeon E5 1620 v3 – 3.5 GHz | 34.4 | 35 | 98.3% |
Coalesced – Aligned
In a second version, we specify that our backend storage is an aligned memory location, and that index will stay coalesced and aligned:
[EntryPoint("StreamDouble")] public static void StreamDouble(alignedstorage_double a, alignedstorage_double b, int offset, int count) { alignedindex start = offset + threadIdx.x + blockIdx.x * blockDim.x ; alignedindex end = count + offset; for (alignedindex i = start; i < end; i += blockDim.x * gridDim.x) { a[i] = a[i] + b[i]; } }
Architecture | Generated | Handwritten | Ratio |
---|---|---|---|
INTEL – XEON PHI – 7210 | 373.8 | 381.7 | 97.9% |