Thursday, 7 May 2015

SMAUG Updates and GPU Training

In this post I describe codes used at Sheffield for Magnetohydrodynamics in particular I mention GPU codes and describe some of the training and resources that are available for GPU computing. 

Parallel magnetohydrodynamic (MHD) algorithms are important for numerical modelling of highly inhomogeneous solar and astrophysical plasmas. SMAUG is the Sheffi eld Magnetohydrodynamics Algorithm Using GPUs. SMAUG is a 1-3D MHD code capable of modelling magnetised and gravitationally strati fied magnetised plasma

SMAUG is based on the Sheffield Advanced Code (SAC), which is a novel fully non-linear MHD code, designed for simulations of linear and non-linear wave propagation in gravitationally strongly stratified magnetised plasma. See the  reference at the Astronomy Abstracts Service. A full description of SMAUG is available in the paper entitled "A Fast MHD Code for Gravitationally Stratified Media using Graphical Processing Units: SMAUG".
The development version of SMAUG is available at CCP forge. The devlopment code can be obtained using SVN, as follows

svn checkout

SMAUG is published  on google code and the release version is available using the following SVN command (see checkout the SMAUG release version)

# Non-members may check out a read-only working copy anonymously over HTTP.
svn checkout smaug-read-only

The key priority for the development SMAUG code is the correct implementation of a scalable multi-GPU version. Significant effort has been made to develop this using the Wilkes, Emerald and the Iceberg cluster at The University of Sheffield.

In summary, the main tasks are as follows
One of the best updated guides is the Programming guide at the NVIDIA developer zone there is a good collection of CUDA links on Delicious. There is support for GPU computing through the NVIDIA GPU research centre at The University of Sheffield. In addition to the provision of GPUs on the iceberg HPC cluster the GPU research centre provides training in GPU computing.

The recent introduction to CUDA course organised at Sheffield is based on a course provided by the EPCC. As well as an introduction to basic CUDA programming the hands on exercise focuses on the techniques that can be used to drive the performance of a highly threaded code such as that running on a GPU. As a GPU is a co-processor used for code acceleration it is necessary to consider the movement of date between the GPU and the host processor. For example we look at the need to consider the following inhibitors.
  1. Data transfer to and from the GPU
  2. Device under utilisation - a GPU has many cores we need to ensure that our computation provides many threads of execution fully utilising all of the cores.
  3. GPU memory bandwidth -  as with any processing unit data must be moved into memory (e.g. registers and caches) used for floating point operations. One of the reasons for the excellent performance of a GPU is its data bandwidth and the ability to move data to the processing cores. On a GPU it is best to move data in a few large chunks rather than in many small chunks. The checks of memory should also continuous portions of memory rather than scattered fragments. In GPU speak this is referred to as memory coalescing.
  4. Code branching operations such as if statements are dependent on the number of threads executed within a processing core on a GPU. It is generally best to group these calls together.
The course exercises provide excellent practice in understanding these points.The problem is based on edge detection algorithm for image processing. A Gauss-Seidel iterative algorithm is used to reconstruct an image from edge data. The exercise identifies the data transfer bottleneck for example rather than copying data between device and host at every iteration a memory address swapping technique is used and data copying at every step avoided. The example considers occupancy by comparing performance when each thread is used to process first a single row of the image matrix, a single column of the image matrix and finally each thread processes an individual element of the  matrix, thereby improving the occupancy and arithmetic intensity. Arithmetic intensity is the ratio of computation to memory accesses. The third example considers memory coalescing and compares the performance of the algorithm when threads are used to compute each row and then each column of the matrix. The row case is clearly faster because the matrix elements used by the thread are in adjacent memory locations i.e. coalesced. The column case is slower!  GPUs group threads into collections called thread blocks the GPU programmer can control the number of thread blocks and the number of threads which execute per thread block. The final example investigates how the performance is influenced by the sizes of the thread block sizes.

Once armed with an understanding of these ideas a programmer can begin to develop GPU codes the course problem is a great help in understanding these concepts.

Wednesday, 6 May 2015

Remote Visualisation: A Guide and Example

The need to explore and visualise large data sets is an important capability for many areas of research including computational solar physics. Since many of the data sets are stored on remote servers and not local workstations  it is necessary to use a remote visualisation technique delivering visual output from the data set and delivering a rendered imagery to the users desktop. Remote Visualisation is undertaken using thin clients accessing remote high quality visualisation hardware. Remote visualisation removes the need to transfer data and allows researchers to visualise data sets on remote visualisation servers attached to the high performance computer and its storage facility.

Just as OpenGL is the standard used for 3D graphical visualisation, VirtualGL is the standard used fro remote visualisation. VirtualGL is an open source package which gives any UNIX or Linux remote display software the ability to run 3D applications with full hardware accelerations. VirtualGL can also be used in conjunction with remote display software such as VNC to provide 3D hardware accelerated rendering for OpenGL applications. VirtualGL is very useful in providing remote display to thin clients which lack the 3D hardware acceleration.
Protocols used to communicate graphical information over a network.

A VirtualGL Client Runs on the User Workstation Which is Served with Graphics from a High Quality High Performance Rendering Device Such as an NVIDIA Graphics Card e.g. the Fermi M2070Q 
To use the NVIDIA Fermi M2070Q Graphical Processing Unit on the iceberg High Perfromance Computer at The University of Sheffield a number of options are available for starting a virtualGL client. With Iceberg we make use of, TigerVNC and the  Sheffield application portal, Sun Global Desktop (SGD). To use this capability it is necessary for users to join the gpu visualisation group (gpu-vis ) by emailing

To initiate a remote visualisation session the following steps should be followed:
Star a browser and goto

–login to Sun Global Desktop (as shown below)

•Under Iceberg Applications start the Remote visualisation session

•This opens a shell providing a port number (XXXX) and instructions to either open a web

browser or to start the tigerVNC client.

           –Open a browser and enter the address

Start Tiger VNCViewer on your desktop

Use the address

XXXX is a port address provided on the iceberg terminal

When requested use your usual iceberg user credentials

From a Microsoft windows work station, the SSH client Putty and Tiger VNC can be used as follows.

Login in to iceberg using putty

•At the prompt type qsh-vis

•This opens a shell with instructions to either

Open a browser and enter the address

Start Tiger VNCViewer on your desktop

  • Use the address
  • XXXX is a port address provided on the iceberg terminal
  • When requested use your usual iceberg user credentials
Putty Session Starting the VirtualGL Server (note the instructions provided)

Starting Tiger VNC
Typical VirtualGL Remote Visualisation Session
Instead of using the qsh-vis command an alternative is to start the vncjob with the following SGE command e.g. if there is a different memory requirement
qrsh -l gfx=1 -l mem=16G -P gpu-vis -q gpu-vis.q /usr/local/bin/startvncjob