Hybridizer HOWTO — Printf and Builtins

The concept of intrinsics allows extensions and more control on how the code is generated. We also extend this concept for existing methods for which we do not have control on the code. The equivalent of the attribute is described in a builtin file. This allows the use of Console.Out.Write and generate printf. It also allows us to use System.Math.Exp which would be replaced by exp from cmath.


[EntryPoint("TestPrintf")]
public void testPrintf()
{
    Console.Out.WriteLine(
            "Comment from Thread {0} Block {1}",
        threadIdx.x, blockIdx.x);
}

[EntryPoint("TestExp")]
public void TestExp()
{
    exp = System.Math.Exp(1.0);
}

Hybridizer HOWTO — Libraries Integration

It is also possible to use/integrate existing libraries for which device functions are defined, hence extending the concept of intrinsic functions to intrinsic types.

Note that in that case, the functions do not need an implementation if no behavior is expected in plain C#.


[IntrinsicType("curandStateMRG32k3a_t")]
[IntrinsicIncludeCUDA("curand_kernel.h")]
[StructLayout(LayoutKind.Sequential)]
public unsafe struct curandStateMRG32k3a_t
{
    public fixed double s1[3];
    public fixed double s2[3];
    public int boxmuller_flag;
    public int boxmuller_flag_double;
    public float boxmuller_extra;
    public double boxmuller_extra_double;
    [IntrinsicFunction("curand_init")]
    public static void curand_init(ulong seed,
        ulong subsequence, ulong offset,
        out curandStateMRG32k3a_t state)
    { throw new NotImplementedException(); }
    [IntrinsicFunction("curand")] public uint curand()
    { throw new NotImplementedException(); }
    [IntrinsicFunction("curand_log_normal")]
    public float curand_log_normal(float mean, float stdev)
    { throw new NotImplementedException(); }

} 

Hybridizer HOW TO — Intrinsics

It’s often useful to use intrinsics or builtin functions provided by CUDA. You might also already have a very optimized cuda header which you’d like to reuse from your C# application. This can be done using IntrinsicFunction attribute. When generating the source code, the function call is replaced by the IntrinsicFunction name, and the contents of that function is ignored.

Intrinsics functions

Consider this code sample:

class IntrinsicFunction
    {
        [IntrinsicFunction("printf")]
        public static void printf(string format, double val)
        {
            Console.WriteLine(val);
        }

        [IntrinsicFunction("erf")]
        private static double Erf(double x)
        {  
            double ax = x > 0.0 ? x : -x;
            const double a1 = 0.254829592;
            const double a2 = -0.284496736;
            const double a3 = 1.421413741;
            const double a4 = -1.453152027;
            const double a5 = 1.061405429;
            const double p = 0.3275911;
            double t = 1.0 / (1.0 + p * x);
            double y = 1.0 - (((((a5 * t + a4) * t) + a3) * t + a2) * t + a1) * t * Math.Exp(-ax * ax);

            return x > 0.0 ? y : -y;
        }

        [EntryPoint]
        public static void run()
        {
            printf("%.17lf\n", Erf(1.0));
        }

        public static void Run()
        {
            Console.WriteLine("IntrinsicFunction :: ");
            Console.WriteLine(":: C# :: ");
            run();

            HybRunner runner = HybRunner.Cuda("ConsoleApplication96_CUDA.vs2015.dll").SetDistrib(1, 1);
            dynamic wrapped = runner.Wrap(new IntrinsicFunction());

            Console.WriteLine(":: CUDA :: ");
            cuda.DeviceSynchronize();
            wrapped.run();
        }
    }

The C# implementation of Erf is too simple to be bug free and accurate. It’s better to rely on the CUDA implementation.
To do that, we decorate the C# function with the IntrinsicFunction attribute. Hybridizer will get the name property of that attribute, and replace calls to Erf by calls to the native erf function from cmath.

Intrinsics Type

Sometimes, we already have a good native CUDA implementation of some function. In that case we don’t want Hybridizer to process our C# symbol. We rather want it to use the one we provide in a custom header.

[IntrinsicInclude("myheader.cuh")]
    class IntrinsicType
    {
        [IntrinsicFunction("myfunction")]
        private static double myfunction(double x)
        {
            return 42.0;
        }

        [IntrinsicFunction("printf")]
        public static void printf(string format, double val)
        {
            Console.WriteLine(val);
        }

        [EntryPoint]
        public static void run()
        {
            printf("%.17lf\n", myfunction(3.0));
        }

        public static void Run()
        {
            Console.WriteLine("IntrinsicType :: ");
            Console.WriteLine(":: C# :: ");
            run();

            HybRunner runner = HybRunner.Cuda("ConsoleApplication96_CUDA.vs2015.dll").SetDistrib(1, 1);
            dynamic wrapped = runner.Wrap(new IntrinsicType());

            Console.WriteLine(":: CUDA :: ");
            cuda.DeviceSynchronize();
            wrapped.run();
        }
    }

The IntrinsicInclude attribute on the class will tell the Hybridizer to include this header in the generated file.
We therefore write a custom header:
#pragma once
__device__ inline double myfunction(double x) {
	return x * x + 2.0;
}

The C# version of myfunction won’t be processed and myheader.cuh will be included.
With Hybridizer Software Suite, no further modification is needed. However Hybridizer Essentials relies on nvrtc which doesn’t include headers automatically. You then need to provide the path of headers to Hybridizer, using “Additional JITTER Headers” options:
Additional Jitter headers

Results

You can download the above code from this archive. Running it will show that C# code have been properly replaced by native functions:

intrinsics calls

Hybridizer HOWTO — Hello World

Our hello world is the addition of two vector of elements. The C# code is downloadable from our github.

Hello World : simple work distribution

We start with a simple way to express parallelism: the Parallel.For construct, which is natively proposed by .Net. We place the EntryPoint attribute on the method tro trigger hybridization:

[EntryPoint]
public static void VectorAdd(double[] a, double[] b, int N)
{
    Parallel.For(0, N, (i) => { a[i] += b[i] });
}

As usual, we need to invoke this method with some boilerplate code.

Hello World : explicit work distribution

We can also use explicit work distribution, which is done using a CUDA-like syntax: threadIdx/blockDim, blockIdx/gridDim. This is customizable and names can be changed, but the concept is similar:


[EntryPoint]
public static void VectorAdd(double[] a, double[] b, int N)
    for (int k = threadIdx.x + blockDim.x * blockIdx.x ;
        k < count ; k += blockDim.x * gridDim.x)
    {
        a[k] += b[k];
    }
}

Explicit work distribution can be used (for example) to distribute work among a 2D-grid.

Grid configuration

To achieve hich bandwidth, we need to properly configure the grid. Using enough blocks and threads increases occupancy and can mask latency by running concurrent blocks. We do that as we would in CUDA:

cudaDeviceProp prop;
cuda.GetDeviceProperties(out prop, 0);
HybRunner runner = HybRunner.Cuda("HelloWorld_CUDA.dll").SetDistrib(prop.multiProcessorCount * 16, 128);

Performance measurements

We can now compile this in Release|x64 and profile the execution with nsight. We reach very high occupancy:

Hello world occupancy

We reach 337.8GB/s on a GTX 1080Ti (Pascal), which is 96% of bandwidth test on this GPU:

Hello World Bandwidth on 1080 Ti