January 2013 - Altimesh

The Intel Xeon PHI is an implementation of the MIC (Many Integrated Core) architecture.

It holds several independent cores (61 in our setup), with 512 bits vector units[1]. Each core is hyper-threaded with up to four threads. Vector operations are very similar to SSE or AVX, yet much more complete. Moreover the new gather and scatter operations ease the vector access to memory performing a lookup in a single instruction.

Memory bound or compute bound

One of the metrics we analyse is the ratio between compute raw performance and memory bandwidth. This provides, as an asymptotic behaviour, the number of operations that can be performed per memory operation. It helps defining the limit between memory-bound and compute-bound problems.

Chip	Bandwidth	Single Precision	ratio	Double Precision	ratio
SE10P	352	2130	24.2	1065	24.2

Bandwidth benchmark

We analyse the read bandwidth of the architecture (intel xeon phi), with two tests: ECC and no-ECC, depending on the criticity of the reliability of the memory.

Chip	Peak	ECC	ratio	No-ECC	ratio
SE10P	352	162.08	46.0%	168.04	47.9%

Note on madd and GFLOPS

Not every algorithm can make full use of the madd operation. In this document, we rather consider madd as another floating-point operation kind. Most architectures have one-cycle madd, or at least same cycle-count than add or mul; we thus consider it as a single flop. In that concern, the raw compute power of hardware is halved compared to marketing figures. Algorithms reconstructing multiply-add instructions based on evaluation graph are well spread in compilers.

Compute benchmark

For this benchmark, we use a Taylor expansion of the expm1 function. We know the number of operations, and no branching occurs.

Chip	Peak (SP)	Single Precision	ratio	Peak (DP)	Double Precision	ratio
SE10P	1065	879	82.5%	533	440	82.5%

Memory-Compute limit revisited

We finally revisit the first metric, with the achieved performances.

Chip	Bandwidth	Single Precision	ratio	Double Precision	ratio
SE10P	168.02	879	22	440	22

[1] Instruction set available at https://software.intel.com/sites/default/files/forum/278102/327364001en.pdf