Wednesday, 4 February 2015

Data Standards for Computational Modelling of Astrophysical Plasma: The Hierarchical Data Format

In this post we present Matlab based routines for translating data input formats used in our MHD codes. By using Matlab we have had an easy route to understand the HDF5 standard and to provide an introduction to the structure of API's used for the manipulation of HDF5 format data.  Standards and formats for data input and output is important for many reasons.

  1. Enables us to get data into and out of a simulation.
  2. Read data into and out of data analysis and visualisation routines.
  3. Determine the size of the data stored on the file system
  4. Determines the speed with which data may be read into and out of an application.
  5. Determine how data is divided between processors when running an application over a distributed array of processors
The ideal scenario is file format specification enabling interoperability between the simulation and analysis routines without the need for  data transformation routines. One of the standards we consider here is known as HDF5. Many astrophysical simulation codes are already supported by HDF5 here is a list.

The stellar atmosphere simulation code Bifrost: Code description and validation
Flash - MHD code
Athena MHD code
Athena might include modules to output HDF5 format
Pencil MHD code
A module has been implemented enabling HDF5 support
Tristan-MP Particle in cell code

Many MHD and plasma physics codes are available (List of MHD/HD codes) not all of them support the same standards and it is necessary to provide routines facilitating data transformations between the different data format standards that are used. There are many data standards including

By using the HDF5 standard it is possible to improve an code interoperability (and thereby improve enhance the software sustainability of that particular code). HDF is a file format standard for storing large quantities of numerical data, since the HDF format is self describing it is possible to interpret the structure and file contents without  outside information. The HDF4 standard  was limited by its use of 32 bit integers and therefore this limited the possible file sizes, there was also a limitation with the lack of an object model. The HDf5 standard improved on this by providing two types of object.
  • Datasets (multi-dimensional arrays of a homogeneous type)
  • Groups (container structures which can hold datasets and other groups)
Metadata is stored in the form of user-defined, named attributes attached to the groups. There are many tools and software for manipulating, managing and interfacing with HDF5  data. For the SAC/SMAUG codes we are adopting the GDF file standard which is based on HDF5. Full details of the format are provide by the yt-project (see reference 3). Here we illustrate the GDF standard through  output from the HDFview application when used to view a sample GDF format input file used for SAC/SMAUG.
The underlying datasets and groups used for the GDF standard

The data group comprises a single grid with 13 data sets each holding the fields use in our MHD simulation
The field types comprise attribute information for each of the field types these attributes contain information about the units and the name of the actual field.

The simulation parameters is and HDF group of size 0 bu has a set of attributes providing information used to set the parameters for the simulation including the iteration number and the current time.
The gridded data format is and HDF group of size zero but has attribute information used to contain the information about the software used to generate the GDF file.
We have written the following functions using Matlab and its HDF5 function library. The library is split into a low level and high level library. The high level library functions are to read and write hdf5 files, to create data sets and to read/write attributes. The low level functions allow direct interaction with the HDF5 API itself. It was necessary to use the low level functions in order to specify the structure of the groups, the data sets and attributes.

We have provided a number of functions and script files which are documented briefly below

function writegdf3D(filename, simparams, simgridinfo, simdata)

This function takes as input the name of a file and the name of the main gdf groups, simparams, simgridinfo and simdata. Each of these groups correspond to the main groups used in the gdf standard as described above. The function will create the structures and the data in the main gdf groups will be written to the gdf file.

function writesac3D(filename, simparams, simgridinfo, simdata, mode)

This function takes as input the name of a file and the name of the main gdf groups, simparams, simgridinfo and simdata. Each of these groups correspond to the main groups used in the gdf standard as described above. The function will write the data contained in simparams, simgridinfo and simdata into a SAC format data file. Data will be written in binary format if mode is specified as 'binary' and in ascii format if mode is specified as 'ascii'.

The script files hdf5tosac3D.m and sac3Dtohdf5.m provide an illutration of how we use the above functions.

hdf5tosac3D.m

As suggested by the name this script reads an GDF format file, writes the data and parameters into the simparams, simgridinfo and simdata objects. These are then written to the ouput SAC format file using the writesac3D function. The user needs to change the name of the gdffilename and sacfilename at the start of the script.

sac3Dtohdf5.m

As suggested by the name this script reads a SAC format file, writes the data and parameters into the simparams, simgridinfo and simdata objects. These are then written to the ouput gdf file using the writegdf3D function. The user needs to change the name of the gdffilename and sacfilename at the start of the script.

sac3Dgetpic.m

This script reads a single SAC format binary file into a set of structures.

Matlab Classes

The HDF5 functionality is implemented in 3 matlab classes which do the work of creating and writing the objects simparams, simgridinfo and simdata. They have read and write functions for the HDF structures. One of the most complex objects is the simgridinfo class which has methods for creating the H5 structures. These routines made use of the low level HDF5 library, although the Matlab documentation is excellent some effort was required to get this correct.

These routines are available on github, see reference 10 below.

Links



No comments:

Post a Comment