The Intel Xeon PHI is an implementation of the MIC (Many Integrated Core) architecture.
It holds several independent cores (61 in our setup), with 512 bits vector units[1]. Each core is hyper-threaded with up to four threads. Vector operations are very similar to SSE or AVX, yet much more complete. Moreover the new gather and scatter operations ease the vector access to memory performing a lookup in a single instruction.
Memory bound or compute bound
One of the metrics we analyse is the ratio between compute raw performance and memory bandwidth. This provides, as an asymptotic behaviour, the number of operations that can be performed per memory operation. It helps defining the limit between memory-bound and compute-bound problems.
Chip | Bandwidth | Single Precision | ratio | Double Precision | ratio |
---|---|---|---|---|---|
SE10P | 352 | 2130 | 24.2 | 1065 | 24.2 |
Bandwidth benchmark
We analyse the read bandwidth of the architecture (intel xeon phi), with two tests: ECC and no-ECC, depending on the criticity of the reliability of the memory.
Chip | Peak | ECC | ratio | No-ECC | ratio |
---|---|---|---|---|---|
SE10P | 352 | 162.08 | 46.0% | 168.04 | 47.9% |
Note on madd and GFLOPS
Not every algorithm can make full use of the madd operation. In this document, we rather consider madd as another floating-point operation kind. Most architectures have one-cycle madd, or at least same cycle-count than add or mul; we thus consider it as a single flop. In that concern, the raw compute power of hardware is halved compared to marketing figures. Algorithms reconstructing multiply-add instructions based on evaluation graph are well spread in compilers.
Compute benchmark
For this benchmark, we use a Taylor expansion of the expm1 function. We know the number of operations, and no branching occurs.
Chip | Peak (SP) | Single Precision | ratio | Peak (DP) | Double Precision | ratio |
---|---|---|---|---|---|---|
SE10P | 1065 | 879 | 82.5% | 533 | 440 | 82.5% |
Memory-Compute limit revisited
We finally revisit the first metric, with the achieved performances.
Chip | Bandwidth | Single Precision | ratio | Double Precision | ratio |
---|---|---|---|---|---|
SE10P | 168.02 | 879 | 22 | 440 | 22 |