IBM announced at HotChip 2013 conference the latest generation of Power processor, the IBM Power 8. Altimesh plans to implement support for that Power processor for the Hybridizer. We present here early results of our first experiments on a sample of Power 8 processor. Results have been obtained in the technology center of Montpellier, with remote access to a Power 8 machine. Within a few days of experiment, we have been able to achieve between 75 and 83% of usage of peak compute performance and bandwidth performance, with an estimated of 80% usage for a computational finance use-case.
Experiments context
We have been given access to a system equipped with Power 8 processors at the Montpellier technology center. The system we had access to holds two 4.116 GHz Power 8 chips, with 12 cores each, and all the memory banks filled. SMT 8 has been activated, and virtual cores (counting up to 2*12*8 = 192) are further referred-to as cores.
We experimented hand-written code, with a linux operating system. We used GCC 4.8.2 20140120 (Red Hat 4.8.2-12) compiler.
We study two performance indicators: compute and bandwidth.
Compute benchmark
For compute, we have two experiments: the first is a pure raw performance test which has no application in practice. It is based on the Whetstone benchmark. The second is inspired from a real use-case where we implement an approximation of expm1 function using Taylor series approximation.
Whetstone
For this Whetstone derivative, we run a very large number of iterations on a small set of input values (256 times the number of cores), to account for a maximum usage of the system. The pseudo-code is the following:
double Whet (int N, double ix1, double ix2, double ix3, double ix4, double t) { double x1 = ix1 ; double x2 = ix2 ; double x3 = ix3 ; double x4 = ix4 ; double xx ; for (int j = 0 ; j < N ; ++j) { xx = x3 - x4 ; x1 = (x1 + x2 + xx) * t ; x2 = (x1 + x2 - xx) * t ; xx = x1 - x2 ; x3 = (x3 + x4 + xx) * t ; x4 = (x3 + x4 - xx) * t ; } return x1 + x2 + x3 + x4 ; }
Each iteration accounts for 14 floating-point operations: 10 add or subs and 4 multiplies. Note that these algorithms cannot benefit from fused multiply-add.
Expm1
For this experiment, we also use a large multiple of the number of cores, and iterate the operation twelve times to ensure we are compute bound with respect to global memory access.
The pseudo-code is the following:
double expm1(double x) { return ((((((((((((((15.0 + x) * x + 210.0) * x + 2730.0) * x + 32760.0) * x + 360360.0) * x + 3603600.0) * x + 32432400.0) * x + 259459200.0) * x + 1816214400.0) * x + 10897286400.0) * x + 54486432000.0) * x + 217945728000.0) * x + 653837184000.0) * x + 1307674368000.0) * x * 7.6471637318198164759011319857881e-13; }
Each iteration accounts for 1 add, 2 mul and 13 multiply-add.
Fused multiply add and GFLOPS
Some algorithms, such as our whetstone test and expm1 cannot inherently take benefit at 100% from the fused multiply-add. As a results, the obtained FLOPS cannot reach peak, not because of the system but rather of the algorithm. In order to best measure usage of the system, we verify the instructions we use in the assembly, and measure performances based on complex flops operations (CFLOPS), for which fused-multiply add is 1 CFLOP, since the Power 8 can achieve this instruction at the same throughput as a mul or add.
The Power 8 processor has various vector and scalar units. We assume that the design of execution pipes is similar to the design of Power 7: Reading [1], we can see that at core has two execution pipes, each of which can perform a complex FLOP in one cycle. This holds for a total of 4 double precision multiply-adds per cycle per core. For two 4.116 GHz Power 8 with 12 cores each, this results in 395.136 GCFLOPS (here, 1G=1e9).
We ran the tests 20 times, and took the best run.
For the Whetstone test, we present a single test configuration which is the best we could achieve (actually, the code generated has the best configuration – no copy to memory at any point, but obviously instruction dependency). For the expm1 test, we used different code constructs and present the best obtained result in the above table.
Optimization flags for g++-4.8: -O3 -mvsx -maltivec –fopenmp -mtune=power8 -mcpu=power8 -mpower8-vector
Test | Peak | GFLOPS | GCFLOPS | ratio |
---|---|---|---|---|
Whetstone | 395.14 | 326.86 | 326.86 | 82.7% |
Expm1 – double | 395.14 | 540.03 | 297.94 | 75.4% |
Expm1 – single | 790.28 | 1041.51 | 574.63 | 72.7% |
Bandwidth Test
We achieved three bandwidth tests, inspired from stream benchmark [2]. One is read-only, another is accumulation (two reads, one write in same page); the last one is copy (one read and one write in different page).
The hardware platform we tested had its bank filled with an announced peak of 368 GB/s: 256 GB/s for read, and 128 GB/s for write.
In this section, since we use GHz to compute bandwidth, we use 10^9 bytes/s = 1 GB/s.
Read test
The read test is the sum of all elements within a vector. Since the nineties, the compute performance has outperformed the bandwidth performance, the summation of all the elements within a vector is memory bound.
Read/Write in place
This test is a read-write in place: the same page is used for read and for write. In this test, we will focus on aggregated bandwidth and split bandwidth, that is: is it possible to perform a read and a write operation within the same cycle.
Copy
This last test is a copy test which performs two reads in one place, and a write in another location. This test, in conjunction with the Read/Write in place will help us understand the behavior of the cache system.
Test | T | R | W | Read time | Write time | ratio |
---|---|---|---|---|---|---|
Read | 0.004759638 | 1 | 0 | 0.00390625 | 0 | 82% |
R/W (in place) | 0.01162115 | 2 | 1 | 0.0078125 | 0.0078125 | 67% |
R/W (copy) | 0.01635323 | 3 | 1 | 0.01171875 | 0.0078125 | 72% |
The measured read bandwidth is 210.1 GB/s, which is 82% utilization of the peak. Obtaining such performance with a naïve implementation is very satisfactory (note: we needed to dispatch memory given CPU affinity on OpenMP threads to achieve such performances on a multi-processor system).
The Read/Write in place makes use of 67%; hence our assumption on the concurrent read-write is valid. We need to read twice as much data as we can write. However, it seems that we need a better understanding of the implementation of the paging system and cache invalidation to hide latency induced by the cache misses. Maybe some form of prefetching would benefit the system.
The final test proves that we need to load a page before being able to write to it, for cache consistency.
None of these test experimented transactional memory or any other related feature. That could be a further experiment.
System
-S824 24 POWER8 cores %40 4.1 GHz
-Fully populated DDR3 memory banks
-8 threads per core
-RedHat Linux operating system
-Open Source gcc compiler