Thursday, 7 May 2015

SMAUG Updates and GPU Training

In this post I describe codes used at Sheffield for Magnetohydrodynamics in particular I mention GPU codes and describe some of the training and resources that are available for GPU computing. 

Parallel magnetohydrodynamic (MHD) algorithms are important for numerical modelling of highly inhomogeneous solar and astrophysical plasmas. SMAUG is the Sheffi eld Magnetohydrodynamics Algorithm Using GPUs. SMAUG is a 1-3D MHD code capable of modelling magnetised and gravitationally strati fied magnetised plasma

SMAUG is based on the Sheffield Advanced Code (SAC), which is a novel fully non-linear MHD code, designed for simulations of linear and non-linear wave propagation in gravitationally strongly stratified magnetised plasma. See the  reference at the Astronomy Abstracts Service. A full description of SMAUG is available in the paper entitled "A Fast MHD Code for Gravitationally Stratified Media using Graphical Processing Units: SMAUG".
The development version of SMAUG is available at CCP forge. The devlopment code can be obtained using SVN, as follows

svn checkout

SMAUG is published  on google code and the release version is available using the following SVN command (see checkout the SMAUG release version)

# Non-members may check out a read-only working copy anonymously over HTTP.
svn checkout smaug-read-only

The key priority for the development SMAUG code is the correct implementation of a scalable multi-GPU version. Significant effort has been made to develop this using the Wilkes, Emerald and the Iceberg cluster at The University of Sheffield.

In summary, the main tasks are as follows
One of the best updated guides is the Programming guide at the NVIDIA developer zone there is a good collection of CUDA links on Delicious. There is support for GPU computing through the NVIDIA GPU research centre at The University of Sheffield. In addition to the provision of GPUs on the iceberg HPC cluster the GPU research centre provides training in GPU computing.

The recent introduction to CUDA course organised at Sheffield is based on a course provided by the EPCC. As well as an introduction to basic CUDA programming the hands on exercise focuses on the techniques that can be used to drive the performance of a highly threaded code such as that running on a GPU. As a GPU is a co-processor used for code acceleration it is necessary to consider the movement of date between the GPU and the host processor. For example we look at the need to consider the following inhibitors.
  1. Data transfer to and from the GPU
  2. Device under utilisation - a GPU has many cores we need to ensure that our computation provides many threads of execution fully utilising all of the cores.
  3. GPU memory bandwidth -  as with any processing unit data must be moved into memory (e.g. registers and caches) used for floating point operations. One of the reasons for the excellent performance of a GPU is its data bandwidth and the ability to move data to the processing cores. On a GPU it is best to move data in a few large chunks rather than in many small chunks. The checks of memory should also continuous portions of memory rather than scattered fragments. In GPU speak this is referred to as memory coalescing.
  4. Code branching operations such as if statements are dependent on the number of threads executed within a processing core on a GPU. It is generally best to group these calls together.
The course exercises provide excellent practice in understanding these points.The problem is based on edge detection algorithm for image processing. A Gauss-Seidel iterative algorithm is used to reconstruct an image from edge data. The exercise identifies the data transfer bottleneck for example rather than copying data between device and host at every iteration a memory address swapping technique is used and data copying at every step avoided. The example considers occupancy by comparing performance when each thread is used to process first a single row of the image matrix, a single column of the image matrix and finally each thread processes an individual element of the  matrix, thereby improving the occupancy and arithmetic intensity. Arithmetic intensity is the ratio of computation to memory accesses. The third example considers memory coalescing and compares the performance of the algorithm when threads are used to compute each row and then each column of the matrix. The row case is clearly faster because the matrix elements used by the thread are in adjacent memory locations i.e. coalesced. The column case is slower!  GPUs group threads into collections called thread blocks the GPU programmer can control the number of thread blocks and the number of threads which execute per thread block. The final example investigates how the performance is influenced by the sizes of the thread block sizes.

Once armed with an understanding of these ideas a programmer can begin to develop GPU codes the course problem is a great help in understanding these concepts.

No comments:

Post a Comment