Poster GTC 2017 – Hybrid Vector Library—From Memory Bound to Compute Bound with NVVM

When submitting small tasks to the GPU, grid scheduling and synchronization costs may be much higher than computations, even on a CPU. In this case, the benefit of GPU computing is lost. Leveraging runtime compilation, we illustate an approach that generates source code to replace a list of library API calls into a single kernel call. The benefits are twoflod: (1) scheduling costs are reduced to a minimum, result of merging several calls into a single one, (2) execution on vector of values of an aggregate kernel result in a compute-bound implementation.

GTC On-Demand link

gtc-2017-poster

Talk GTC Europe 2016 – How Pascal And Power 8 Will Accelerate Counterparty Risk Calculations at BNP Paribas

Since the financial crisis of 2008, regulators have been increasingly demanding in terms of risk analysis and stress scenario simulations. In this talk, we present an approach for counterparty risk calculations based on Directed Acyclic Graphs. Calculations are arranged in a tree, where nodes are simulation parts. Nodes hold temporary data that may be reused for other calculations further in the graph. This technique offers great flexibillity, benefits from hardware capability improvements and is resilient to new regulatory requirements and demands. We will illustrate the potential benefits of Pascal according to performance expectations of NVLink, and how these features are helpful in the DAG compute environment.

GTC Europe 2016 — S6155 — Presentation

Link to conference agenda

Talk GTC 2016 – Java Image Processing: How Runtime Compilation Transforms Memory-Bound into Compute-Bound

A wide variety of image processing algorithms are typically parallel. However, depending on filter-size or neighborhood search pattern, memory access is critical for performances. We’ll show how loop reordering and memory locality fine-tuning help achieve best performance. Using Hybridizer to automate Java byte-code transformation to CUDA source code, and using new CUDA feature Run Time Compilation, we transformed execution from memory-bound to compute-bound. Applying this technique to oil and gas image processing algorithms results in interactive response time on production-size datasets.

see it at GTC On-Demand — ID S6314

Poster GTC 2016 – Using CLANG/LLVM Vectorization to Generate Mixed Precision Source Code

At Supercomputing 2015, NVIDIA announced Jetson TX1. This platform is the first available to natively expose mixed precision instructions. However, this instruction set requires that operations on 16-bit precision floating points are done in pairs, requiring usage of the half2 type which pairs two values in a single register.

see it at GTC On-Demand — ID P6352

gtc-2016-poster