regis portalez, Author at Altimesh

Hybridizer HOWTO – My First Project

Posted on May 1, 2017March 19, 2019 by regis portalez

Hybridizer is a compiler that lets you run a single version of your C# or Java code on any harware.
In this tutorial, we will explain how to create a first project in C# targeting GPU. We will illustrate with hybridizer essentials.

Warning/Disclaimer

We don’t support the entire C# language or .Net Framework. Main known limitations are:

Heap allocation (new A()) from device thread (except for arrays)
System.Collection is not supported
string type is not supported

Prerequisites

Software

You first need to install the following software:

Visual Studio 2012, 2013, 2015 or 2017. Warning, in Visual Studio 2017 with CUDA 9.2 or earlier you need to install v140 toolset from Visual Studio Installer.
Ensure your Visual installation supports C++ and not just C#.
CUDA toolkit 8.0, 9.0, 9.1, 9.2 or 10.0
Any version of the Hybridizer, including the free version, Hybridizer Essentials

License

You need to request a Hybridizer Subscription.
Subscriptions are our new licensing model for Hybridizer Essentials. They can migrate from one machine to another (only one machine being authorized at a time).
Trial are unique and attached to your email address, while you can purchase as many commercial subscriptions as you want.
Either you already purchased one, or you can request a trial for Hybridizer Essentials. To do that, click on Hybridizer->License Settings in Visual Studio:

If you opted for the trial, provide you email address and click Subscribe:

You should receive your license in your mailbox soon. If not, please contact us or create an issue on github.

Open your mailbox, and select the license text as follow:

Paste this text in the license textbox in Hybridizer Configuration, and click Refresh License.

Hybridizer should validate the subscription, assign a license to your machine, and tell you the following:

First project

You have two options:

Brand new project

From Visual Studio, click File, New, Project. Choose C#, Altimesh:

Build C# project, then native generated project, and run.

From existing C# project

First create or open and existing C# console application.

Right click on the project in the solution explorer, and select “Hybridize Project”:

This step will create a native CUDA project and add it your solution. It will handle the files generated by Hybridizer from your managed C# project.

Fill the requested fields and click “Generate”:

If everything worked correctly, several things happened in the background:

Your C# project now references Hybridizer.Runtime.CUDAImports. This assembly provides all the necessary attributes to hybridize methods, a CUDA wrapper, and a memory marshaller. We will come back on those in later posts.
A native project has been created and added to your solution. This project references two files, hybridizer.generated.cpp and hybridizer.wrappers.cu. The first one will contain a cubin module. The second will export native symbols.
If not already existing, an x64 platform configuration has been added to your solution.

Before building anything, change configuration to x64. 32 bit support is indeed being deprecated by NVIDIA, and nvrtc requires 64 bits.

Create a kernel

In your main class, add the following code:

        [EntryPoint]
        public static void Hello()
        {
            Console.Out.Write("Hello from GPU");
        }

The EntryPoint attribute tells the hybridizer to generate a CUDA kernel, as if you wrote:

__global__ void Hello() {
    printf("Hello from GPU\n");
}

You can now build the C# project, and the the satellite project. You can inspect generated file to see what hybridizer generated:

hybridizer.generated.cpp contains a big array of bytes, which is the device code of your kernel.

hybridizer.wrappers.cu exports a symbol:

extern "C" DLL_PUBLIC int ConsoleApplication3x46Programx46Hello_ExternCWrapper_CUDA(...)
{
	CUresult cures ;                                                                                 
	if (__hybridizer__gs_module.module_data == 0)                                                    
	{                                                                                              
		cures = cuModuleLoadData (&(__hybridizer__gs_module.module), __hybridizer_cubin_module_data) ; 
		if (cures != CUDA_SUCCESS) return (int)cures ;                                                 
	}                                                                                              
	                                                                                                 
	CUfunction __hybridizer__cufunc ;                                                                
	                                                                                                 
	cures = cuModuleGetFunction (&__hybridizer__cufunc, __hybridizer__gs_module.module, "ConsoleApplication3x46Programx46Hello") ;   
	if (cures != CUDA_SUCCESS) return (int)cures ;                                                   
	  // more generated code ...                                                                                                                                             
	cures = cuLaunchKernel (__hybridizer__cufunc, ...) ; 
	if (cures != CUDA_SUCCESS) return (int)cures ; 
	int cudaLaunchRes = (int)::cudaPeekAtLastError ();                                                                                                     
	if (cudaLaunchRes != 0) return cudaLaunchRes;                                                                                                          
	int __synchronizeRes = (int)::cudaDeviceSynchronize () ;                                                                                               
	return __synchronizeRes ;                                                                                                                              

}

Run it

In your main method, add the following boilerplate code:

        static void Main(string[] args)
        {
            cuda.DeviceSynchronize();
            HybRunner runner = HybRunner.Cuda("ConsoleApplication3_CUDA.vs2015.dll").SetDistrib(1, 2);
            runner.Wrap(new Program()).Hello();
        }

with the appropriate generated dll name. This code:

registers the generated dll as a CUDA dll: HybRunner.Cuda(“ConsoleApplication3_CUDA.vs2015.dll”)
configure kernels calls to run with 1 block of 2 threads: SetDistrib(1, 2)
registers the current object as a kernel container: runner.Wrap(new Program())
runs the generated method.

Then run:

Congratulations! You just successfully ran your first C# kernel on the GPU!

IBM Power 8

Posted on November 7, 2014October 23, 2017 by regis portalez

IBM announced at HotChip 2013 conference the latest generation of Power processor, the IBM Power 8. Altimesh plans to implement support for that Power processor for the Hybridizer. We present here early results of our first experiments on a sample of Power 8 processor. Results have been obtained in the technology center of Montpellier, with remote access to a Power 8 machine. Within a few days of experiment, we have been able to achieve between 75 and 83% of usage of peak compute performance and bandwidth performance, with an estimated of 80% usage for a computational finance use-case.

Experiments context

We have been given access to a system equipped with Power 8 processors at the Montpellier technology center. The system we had access to holds two 4.116 GHz Power 8 chips, with 12 cores each, and all the memory banks filled. SMT 8 has been activated, and virtual cores (counting up to 2*12*8 = 192) are further referred-to as cores.

We experimented hand-written code, with a linux operating system. We used GCC 4.8.2 20140120 (Red Hat 4.8.2-12) compiler.

We study two performance indicators: compute and bandwidth.

Compute benchmark

For compute, we have two experiments: the first is a pure raw performance test which has no application in practice. It is based on the Whetstone benchmark. The second is inspired from a real use-case where we implement an approximation of expm1 function using Taylor series approximation.

Whetstone

For this Whetstone derivative, we run a very large number of iterations on a small set of input values (256 times the number of cores), to account for a maximum usage of the system. The pseudo-code is the following:

double Whet (int N, double ix1, double ix2, 
                    double ix3, double ix4, double t)
{
            double x1 = ix1 ;
            double x2 = ix2 ;
            double x3 = ix3 ;
            double x4 = ix4 ;
            double xx ;
            for (int j = 0 ; j < N ; ++j)
       {
             xx = x3 - x4 ;
             x1 = (x1 + x2 + xx) * t ;
             x2 = (x1 + x2 - xx) * t ;
             xx = x1 - x2 ;
             x3 = (x3 + x4 + xx) * t ;
             x4 = (x3 + x4 - xx) * t ;
       }

       return x1 + x2 + x3 + x4 ;
}

Each iteration accounts for 14 floating-point operations: 10 add or subs and 4 multiplies. Note that these algorithms cannot benefit from fused multiply-add.

Expm1

For this experiment, we also use a large multiple of the number of cores, and iterate the operation twelve times to ensure we are compute bound with respect to global memory access.

The pseudo-code is the following:

double expm1(double x)
{
            return ((((((((((((((15.0 + x)
                * x + 210.0)
                * x + 2730.0)
                * x + 32760.0)
                * x + 360360.0)
                * x + 3603600.0)
                * x + 32432400.0)
                * x + 259459200.0)
                * x + 1816214400.0)
                * x + 10897286400.0)
                * x + 54486432000.0)
                * x + 217945728000.0)
                * x + 653837184000.0)
                * x + 1307674368000.0)
                * x * 7.6471637318198164759011319857881e-13;
}

Each iteration accounts for 1 add, 2 mul and 13 multiply-add.

Fused multiply add and GFLOPS

Some algorithms, such as our whetstone test and expm1 cannot inherently take benefit at 100% from the fused multiply-add. As a results, the obtained FLOPS cannot reach peak, not because of the system but rather of the algorithm. In order to best measure usage of the system, we verify the instructions we use in the assembly, and measure performances based on complex flops operations (CFLOPS), for which fused-multiply add is 1 CFLOP, since the Power 8 can achieve this instruction at the same throughput as a mul or add.

The Power 8 processor has various vector and scalar units. We assume that the design of execution pipes is similar to the design of Power 7: Reading [1], we can see that at core has two execution pipes, each of which can perform a complex FLOP in one cycle. This holds for a total of 4 double precision multiply-adds per cycle per core. For two 4.116 GHz Power 8 with 12 cores each, this results in 395.136 GCFLOPS (here, 1G=1e9).

We ran the tests 20 times, and took the best run.

For the Whetstone test, we present a single test configuration which is the best we could achieve (actually, the code generated has the best configuration – no copy to memory at any point, but obviously instruction dependency). For the expm1 test, we used different code constructs and present the best obtained result in the above table.

Optimization flags for g++-4.8: -O3 -mvsx -maltivec –fopenmp -mtune=power8 -mcpu=power8 -mpower8-vector

Test	Peak	GFLOPS	GCFLOPS	ratio
Whetstone	395.14	326.86	326.86	82.7%
Expm1 – double	395.14	540.03	297.94	75.4%
Expm1 – single	790.28	1041.51	574.63	72.7%

Bandwidth Test

We achieved three bandwidth tests, inspired from stream benchmark [2]. One is read-only, another is accumulation (two reads, one write in same page); the last one is copy (one read and one write in different page).

The hardware platform we tested had its bank filled with an announced peak of 368 GB/s: 256 GB/s for read, and 128 GB/s for write.

In this section, since we use GHz to compute bandwidth, we use 10^9 bytes/s = 1 GB/s.

Read test

The read test is the sum of all elements within a vector. Since the nineties, the compute performance has outperformed the bandwidth performance, the summation of all the elements within a vector is memory bound.

Read/Write in place

This test is a read-write in place: the same page is used for read and for write. In this test, we will focus on aggregated bandwidth and split bandwidth, that is: is it possible to perform a read and a write operation within the same cycle.

Copy

This last test is a copy test which performs two reads in one place, and a write in another location. This test, in conjunction with the Read/Write in place will help us understand the behavior of the cache system.

Test	T	R	W	Read time	Write time	ratio
Read	0.004759638	1	0	0.00390625	0	82%
R/W (in place)	0.01162115	2	1	0.0078125	0.0078125	67%
R/W (copy)	0.01635323	3	1	0.01171875	0.0078125	72%

The measured read bandwidth is 210.1 GB/s, which is 82% utilization of the peak. Obtaining such performance with a naïve implementation is very satisfactory (note: we needed to dispatch memory given CPU affinity on OpenMP threads to achieve such performances on a multi-processor system).

The Read/Write in place makes use of 67%; hence our assumption on the concurrent read-write is valid. We need to read twice as much data as we can write. However, it seems that we need a better understanding of the implementation of the paging system and cache invalidation to hide latency induced by the cache misses. Maybe some form of prefetching would benefit the system.

The final test proves that we need to load a page before being able to write to it, for cache consistency.

None of these test experimented transactional memory or any other related feature. That could be a further experiment.

System

-S824 24 POWER8 cores %40 4.1 GHz
-Fully populated DDR3 memory banks
-8 threads per core
-RedHat Linux operating system
-Open Source gcc compiler

NVIDIA Kepler

Posted on July 7, 2014October 23, 2017 by regis portalez

The Kepler K20[1] is made of SMX, which can be closest compared to CPU cores. Each SMX has its own cache, instruction dispatching units, memory interface. Kepler SMX (counting 14 on K20X) holds 192 single precision floating-point units, each of which can do a multiply-add in a single cycle (732MHz for the clock of the K20X). As a result, the announced peak performance is 3.95 Tflops. It also holds 64 double precision floating-point units, with same instruction throughput, announcing 1.31 Tflops.

Work distribution on a Kepler is organized in warps of 32 entries. Each thread within the same warp doing the same operation, with potential skipping, we can risk an analogy to CPU vector units (current AVX systems having 8 single precision entries). Each SMX has four Warp Schedulers with two Dispatch units each. Each warp can schedule up to two instructions per cycle[2].

Each SMX can run several contexts at the same time. This context distribution is somehow flexible, but is best performed if instructions are the same (note the single instruction cache per SMX). The total number of “threads” ran at the same time is up to 2048, that would count for 64 warps at the same time. Hiding latency of some operations (such as access to memory) requires a maximization of warps active at the same time.

Note the number of registers available is 2Mbits for each SMX, for a rough total of 26 Mbits for a K20c. This large register file has to be shared amongst the active warps, narrowing it down to 1024 bits per entry; that is 32 registers of 32 bits.

Memory bound or compute bound

One of the metrics we analyse is the ratio between compute raw performance and memory bandwidth. This provides, as an asymptotic behaviour, the number of operations that can be performed per memory operation. It helps defining the limit between memory-bound and compute-bound problems.

Chip	Bandwidth	Single Precision	Ratio	Double Precision	Ratio
K20C	208	3519	67.7	1173	45.1
K20X	250	3951	63.2	1317	42.1
K40	288	4291	59.6	1430	39.7

Bandwidth benchmark

We analyse the read bandwidth of the architecture, with two tests: ECC and no-ECC, depending on the criticity of the reliability of the memory.

Chip	Peak	ECC	Ratio	No-ECC	Ratio
K20C	208	154.30	74.2%	184.99	88.9%
K20X	250	182.68	73.2	220.12	88.2%
K40	288	192.65	68.6%	217.29	81.0%

Note on madd and GFLOPS

Not every algorithm can make full use of the madd operation. In this document, we rather consider madd as another floating-point operation kind. Most architectures have one-cycle madd, or at least same cycle-count than add or mul; we thus consider it as a single flop. In that concern, the raw compute power of hardware is halved compared to marketing figures. Algorithms reconstructing multiply-add instructions based on evaluation graph are well spread in compilers.

Compute benchmark

For this benchmark, we use a Taylor expansion of the expm1 function. We know the number of operations, and no branching occurs.

Nvidia Kepler

On Kepler, there are 4 warp schedulers and 6 warp instruction units. Hence using more than 66.6% of the hardware requires the usage of Instruction Level Parallelism (ILP). This feature is not available programmatically, we rather need to provide the compiler and driver with opportunities to use it.

Chip	Peak (SP)	Single Precision	ratio	Peak (DP)	Double Precision	ratio
K20C	1760	1418	80.6%	586	540	92.2%
K20X	1968	1599	81.3%	656	591	90.1%
K40	2146	1608	74.9%	715	632	88.4%

Memory-Compute limit revisited

We finally revisit the first metric, with the achieved performances.

Chip	Bandwidth	Single Precision	Ratio	Double Precision	Ratio
K20C	154.30	1418	37	540	28
K20X	182.68	1599	35	591	26
K40	197.65	1608	33	632	26

[1] Nvidia Kepler Whitepaper

[2] “Furthermore, some degree of ILP in conjuction with TLP is required by Kepler GPUs in order to approach peak performance, since SMX’s warp scheduler issues one or two independent instructions from each of four warps per clock.” – from Kepler Tunig Guide

Intel Xeon PHI

Posted on January 7, 2013October 23, 2017 by regis portalez

The Intel Xeon PHI is an implementation of the MIC (Many Integrated Core) architecture.

It holds several independent cores (61 in our setup), with 512 bits vector units[1]. Each core is hyper-threaded with up to four threads. Vector operations are very similar to SSE or AVX, yet much more complete. Moreover the new gather and scatter operations ease the vector access to memory performing a lookup in a single instruction.

Memory bound or compute bound

Chip	Bandwidth	Single Precision	ratio	Double Precision	ratio
SE10P	352	2130	24.2	1065	24.2

Bandwidth benchmark

We analyse the read bandwidth of the architecture (intel xeon phi), with two tests: ECC and no-ECC, depending on the criticity of the reliability of the memory.

Chip	Peak	ECC	ratio	No-ECC	ratio
SE10P	352	162.08	46.0%	168.04	47.9%

Note on madd and GFLOPS

Compute benchmark

For this benchmark, we use a Taylor expansion of the expm1 function. We know the number of operations, and no branching occurs.

Chip	Peak (SP)	Single Precision	ratio	Peak (DP)	Double Precision	ratio
SE10P	1065	879	82.5%	533	440	82.5%

Memory-Compute limit revisited

We finally revisit the first metric, with the achieved performances.

Chip	Bandwidth	Single Precision	ratio	Double Precision	ratio
SE10P	168.02	879	22	440	22

[1] Instruction set available at https://software.intel.com/sites/default/files/forum/278102/327364001en.pdf