July 2014 - Altimesh

The Kepler K20[1] is made of SMX, which can be closest compared to CPU cores. Each SMX has its own cache, instruction dispatching units, memory interface. Kepler SMX (counting 14 on K20X) holds 192 single precision floating-point units, each of which can do a multiply-add in a single cycle (732MHz for the clock of the K20X). As a result, the announced peak performance is 3.95 Tflops. It also holds 64 double precision floating-point units, with same instruction throughput, announcing 1.31 Tflops.

Work distribution on a Kepler is organized in warps of 32 entries. Each thread within the same warp doing the same operation, with potential skipping, we can risk an analogy to CPU vector units (current AVX systems having 8 single precision entries). Each SMX has four Warp Schedulers with two Dispatch units each. Each warp can schedule up to two instructions per cycle[2].

Each SMX can run several contexts at the same time. This context distribution is somehow flexible, but is best performed if instructions are the same (note the single instruction cache per SMX). The total number of “threads” ran at the same time is up to 2048, that would count for 64 warps at the same time. Hiding latency of some operations (such as access to memory) requires a maximization of warps active at the same time.

Note the number of registers available is 2Mbits for each SMX, for a rough total of 26 Mbits for a K20c. This large register file has to be shared amongst the active warps, narrowing it down to 1024 bits per entry; that is 32 registers of 32 bits.

Memory bound or compute bound

One of the metrics we analyse is the ratio between compute raw performance and memory bandwidth. This provides, as an asymptotic behaviour, the number of operations that can be performed per memory operation. It helps defining the limit between memory-bound and compute-bound problems.

Chip	Bandwidth	Single Precision	Ratio	Double Precision	Ratio
K20C	208	3519	67.7	1173	45.1
K20X	250	3951	63.2	1317	42.1
K40	288	4291	59.6	1430	39.7

Bandwidth benchmark

We analyse the read bandwidth of the architecture, with two tests: ECC and no-ECC, depending on the criticity of the reliability of the memory.

Chip	Peak	ECC	Ratio	No-ECC	Ratio
K20C	208	154.30	74.2%	184.99	88.9%
K20X	250	182.68	73.2	220.12	88.2%
K40	288	192.65	68.6%	217.29	81.0%

Note on madd and GFLOPS

Not every algorithm can make full use of the madd operation. In this document, we rather consider madd as another floating-point operation kind. Most architectures have one-cycle madd, or at least same cycle-count than add or mul; we thus consider it as a single flop. In that concern, the raw compute power of hardware is halved compared to marketing figures. Algorithms reconstructing multiply-add instructions based on evaluation graph are well spread in compilers.

Compute benchmark

For this benchmark, we use a Taylor expansion of the expm1 function. We know the number of operations, and no branching occurs.

Nvidia Kepler

On Kepler, there are 4 warp schedulers and 6 warp instruction units. Hence using more than 66.6% of the hardware requires the usage of Instruction Level Parallelism (ILP). This feature is not available programmatically, we rather need to provide the compiler and driver with opportunities to use it.

Chip	Peak (SP)	Single Precision	ratio	Peak (DP)	Double Precision	ratio
K20C	1760	1418	80.6%	586	540	92.2%
K20X	1968	1599	81.3%	656	591	90.1%
K40	2146	1608	74.9%	715	632	88.4%

Memory-Compute limit revisited

We finally revisit the first metric, with the achieved performances.

Chip	Bandwidth	Single Precision	Ratio	Double Precision	Ratio
K20C	154.30	1418	37	540	28
K20X	182.68	1599	35	591	26
K40	197.65	1608	33	632	26

[1] Nvidia Kepler Whitepaper

[2] “Furthermore, some degree of ILP in conjuction with TLP is required by Kepler GPUs in order to approach peak performance, since SMX’s warp scheduler issues one or two independent instructions from each of four warps per clock.” – from Kepler Tunig Guide

Month: July 2014

NVIDIA Kepler

Memory bound or compute bound

Bandwidth benchmark

Note on madd and GFLOPS

Compute benchmark

Memory-Compute limit revisited