November 2017 - Altimesh

Fractals demo application

Posted on November 17, 2017 by regis portalez

We just released a demo app of fractals rendering. We hybridizer the same C# code in various flavors.

Hybridizer : Fractal demo application

Posted on November 17, 2017November 20, 2017 by regis portalez

As a way to demonstrate Hybridizer’s capabilities, we wrote a simple Windows demo application for fractal rendering.

Demo application

Compute code is written in C# and features floating point arithmetic, function calls, conditions, loops, Parallel.For and bit manipulations:

[Kernel]
public static int IterCount(double cx, double cy, int maxiter)
{
    int result = 0;
    double x = 0.0;
    double y = 0.0;
    double xx = 0.0, yy = 0.0;
    while (xx + yy <= 4.0 && result < maxiter)
    {
        xx = x * x;
        yy = y * y;
        double xtmp = xx - yy + cx;
        y = 2.0 * x * y + cy;
        x = xtmp;
        result += 1;
    }

    return result;
}

[EntryPoint]
public static unsafe void Render(uint* output, 
                                 double fx, double fy, double sx, double sy,
                                 int height, int width, int maxiter)
{
    Parallel.For(0, width * height, tid =>
    {
        int i = tid / width;
        int j = tid - i * width;
        double hx = sx / (double)width;
        double hy = sy / (double)height;
        double cx = fx + hx * j;
        double cy = fy + hy * i;
        output[tid] = GetColor(IterCount(cx, cy, maxiter), maxiter);
    });
}

[Kernel]
public static uint GetColor(int iterCount, int maxiter)
{
    if(iterCount == maxiter)
    {
        return 0;
    }

    return ((uint) (iterCount * (255.0 / (double)(maxiter - 1)))) << 8;
}

This is embedded in a simple Windows Form application with user controls (zoom/unzoom, increase iter count). User can choose which code is executed using radio buttons on the left:

There are four options:

C# : plain C# code
CUDA : CUDA code generated by Hybridizer, running on the most recent GPU in your machine
AVX : native C++ generated by Hybridizer, specialized for AVX instructions
AVX2 : native C++ generated by Hybridizer, specialized for AVX2 instructions (Fused Multiply-Add are reconstructed)

For each frame, a high resolution clock measures computation time and displays it:

Performances

For more accurate measures of Hybridizer performances, please see our blog posts about mandelbrot and hybridizer versus Numerics.Vector.
In this example, code is not that much optimized. However, we demonstrate dramatic speed-ups without touching a line of the C# compute code.

Measured in double precision with 10K iterations
GeForce 1080 Ti — Core i7 4770S
Flavor	Rendering time (ms)	Speed-up
C#	2871	1
AVX	945	3.03
CUDA	197	14.6

If we have a look at generated assembly, we can see that most of the code is vectorized for AVX2:

C5 9D 59 ED        vmulpd      ymm5,ymm12,ymm5
C4 E2 C5 B8 AC 24  vfmadd231pd ymm5,ymm7,ymmword ptr [rsp+520h]
C5 D5 58 AC 24 20  vaddpd      ymm5,ymm5,ymmword ptr [rsp+0B20h]
C5 7D 28 AC 24 20  vmovapd     ymm13,ymmword ptr [rsp+0D20h]
C4 41 15 58 ED     vaddpd      ymm13,ymm13,ymm13
C4 62 D5 98 AC 24  vfmadd132pd ymm13,ymm5,ymmword ptr [rsp+720h]
C5 E5 5E 84 24 80  vdivpd      ymm0,ymm3,ymmword ptr [rsp+180h]
C5 FD 29 84 24 C0  vmovapd     ymmword ptr [rsp+0C0h],ymm0
C4 C1 15 5C EC     vsubpd      ymm5,ymm13,ymm12
C4 41 65 5E E8     vdivpd      ymm13,ymm3,ymm8
C5 DD 5C D2        vsubpd      ymm2,ymm4,ymm2
C4 41 65 5E F1     vdivpd      ymm14,ymm3,ymm9
C5 CD 5C E1        vsubpd      ymm4,ymm6,ymm1
C5 7D 28 84 24 A0  vmovapd     ymm8,ymmword ptr [rsp+0AA0h]

Run it

If you have an AVX-compatible CPU or a CUDA-enabled GPU, you can download and run this pre-built binaries:
FractalRenderer

Fixed Cash Flow

Posted on November 14, 2017November 14, 2017 by regis portalez

In financial applications, many premium, greek evaluations and risk analysis use discounting. This benchmark is our hello world to discounting: we discount some number of cash flows, for a large number of simulations. Yield curves are stored as their rates, linearly interpolated. Pseudo code is given below.

float price = 0.0f;
for (int k = 0; k < _cashFlowsCount; ++k)
{
    if (simulDate >= _paymentDate[k]) 
        continue;

    cashFlowCount++;
    price += _values[k] * _yc.GetDiscountFactor(simId, timePoint, _paymentDate[k] - simulDate) ;
}

return price ;

The number of cash flows is 2000, the number of tenors, that is number of rate values per yield curve is 31. We compute the sum of discounted cash flows for a number of simulations, and a number of dates in the future (500 in our case).

We count the number of actual discounting to have a cash flows per second metric.

System	Flavor	Millions Discounts per second (SP)	Millions Discounts per second (DP)	Comment
INTEL – i7-4770S – 3.1 GHz	[C#]	325	359
INTEL – i7-4770S – 3.1 GHz	AVX	1165	1111
NVIDIA – K20C	CUDA	23628	9426	usage of float4 in C#
NVIDIA – K40	CUDA	23757	11646	usage of float4 in C#

Stream benchmark

Posted on November 14, 2017November 14, 2017 by regis portalez

The stream benchmark aims at measuring memory bandwidth performance of the architecture. We simply add two vectors, in what is our “Hello World” program. Simplified version of the C# code can be found on our SDK on github.

As for Expm1 benchmark we run the generated code against the best handwritten native code we could write. This allows us to compare performance differences between hybridized code and handwritten native code using intrinsics.

[EntryPoint("run")]
public static void Run(int N, double[] a, double[] b)
{
    int start = threadIdx.x + blockDim.x * blockIdx.x;
    int step = blockDim.x * gridDim.x;
    for(int k = start; k < N; k += step) 
    {
        a[k] += b[k];
    }
}

No Hint

In a first version, we don’t give any hint about alignment or index coalescence. This has little to no effect on CUDA machines, and hardware prefetch looks doing a very good job on Intel Xeon machines. However, this leads to a gather instruction, which has a little performance penalty associated on Xeon Phi. We’ll see later how to restore performance.

Architecture	Generated	Handwritten	Ratio
NVIDIA- P100	479.9	495.4	96.8%
NVIDIA – K20C	167	185	90.2%
INTEL – XEON PHI – 7210	327.7	381.7	85.8%
INTEL – Xeon E5 1620 v3 – 3.5 GHz	34.4	35	98.3%

Coalesced – Aligned

In a second version, we specify that our backend storage is an aligned memory location, and that index will stay coalesced and aligned:

[EntryPoint("StreamDouble")]
public static void StreamDouble(alignedstorage_double a, alignedstorage_double b, int offset, int count)
{
    alignedindex start = offset + threadIdx.x + blockIdx.x * blockDim.x ;
    alignedindex end = count + offset;
    for (alignedindex i = start; i < end; i += blockDim.x * gridDim.x)
    {
        a[i] = a[i] + b[i];
    }
}

Architecture	Generated	Handwritten	Ratio
INTEL – XEON PHI – 7210	373.8	381.7	97.9%