Parallel I/O: Difference between revisions
Line 68: | Line 68: | ||
The rest of the file formats incoporating multiple files also have a cental metadata file associated with them. These are listed in the [[#Common File]] section. | The rest of the file formats incoporating multiple files also have a cental metadata file associated with them. These are listed in the [[#Common File]] section. | ||
: <font color="blue"> Doesn't the Exodus reader read the metadata from the first file on all processes?</font> [[User:Berk|Berk]] 10:29, 8 Feb 2007 (EST) | |||
== Parallel I/O Layer == | == Parallel I/O Layer == |
Revision as of 10:29, 8 February 2007
As the data sets we process with ParaView get bigger, we are finding that a significant portion of the wait time in ParaView is I/O for many of our users. We had originally assumed that moving our clusters to parallel disk drives with greater overall read performance would largely fix the problem. Unfortunately, in practice we have found that often times our read rates from the storage system are far below its potential.
This document analyzes the parallel I/O operations performed by VTK and ParaView (specifically reads since they are by far the most common), hypothesizes on how these might impede I/O performance, and proposes a mechanism that will improve the I/O performance.
File Access Patterns
VTK and ParaView, by design, support many different readers with many different formats. Rather than iterate over every different reader that is or could be, we categorize the file access patterns that they have here. We also try to identify which readers perform which access patterns. Readers that do not read in parallel data will not be considered for obvious reasons. Also out of consideration are readers that are thin wrappers over other I/O libraries (e.g. hdf5). We have little control over how these readers other than report problems and hope they get fixed.
- We may have control over how these formats are used, however. For example, when using hdf5, we have control over whether we use groups and datasets to break the data into pieces or whether we use the hyperslab mechanism provided by hdf5. I heard, for example, that using large number of datasets might degrade performance. This is why the Chombo folks use 1 dataset to store all of their AMR block data Berk 10:08, 8 Feb 2007 (EST)
Common File
There are some cases where all the processes in a parallel job will each read in the entire contents of a single file. Although the parallel data readers are usually smart enough to only read in the portion of the data that they need. However, there is usually a collection of "metadata" that all processes require to read in the actual data. This metadata includes things like domain extents, number formats, and what data is attached. Oftentimes this metadata is all packaged up into its own file (along with pointers to #Individual Files that hold the actual data).
In this case, every process independently reads the entire metadata file because all the readers are designed to perform reads with no communication amongst them. This is so that the readers will also work in a sequential parallel read mode where data is broken up over time rather than across processors and there are no other processors with which to collaborate.
Why it is inefficient:
The storage system is being bombarded with a bunch of similar read requests at about the same time. All the requests are likely to be near each other (with respect to offsets in the file) and hence on the same disk. Unless the read requests are synchronized well and the parallel storage device has really good caching that works across multiple clients (neither of which is very likely), then storage system will be forced to constantly move the head on the disk to satisfy all the incoming requests without starving any of them. Moving the head causes a delay measured in milliseconds, which slows the reading to a crawl.
What readers do this:
All the readers that have a metadata file use this access pattern. This includes the PVDReader as well as all of the XMLP*Readers and the pvtkfile (partitioned legacy VTK files).
- The spy plot reader used to have a metadata file that listed all of the actual data files. Does it still have this? Also, is there an EnSight equivalent to this (maybe the SOS file)?
- --Ken 13:59, 29 Jan 2007 (EST)
- The spyplot reader no longer requires a metadata file. Ensight SOS file is exactly what you describe and is needed for distributed Ensight files. In all of these cases, the metafile is very small and is accessed once (or maybe twice) to read it as a whole. How slow is reading such a small file going to be? Berk 10:23, 8 Feb 2007 (EST)
Monolithic File
Data formats that handle small to medium sized data are almost always contained in a single file for convenience. If these data formats can be scaled up, they often are. For example, raw image data can be placed in an image file of any size. Although generating this large single file data often requires some dedicated processing, it is sometimes done for the convenience and highest usability.
For this type of file to be read into ParaView in a parallel environment, it is important that the reader be able to read in a portion of the data. For example, the raw data image reader accepts extents for the data, and then reads in only the parts of the file associated with those extents.
Why it is inefficient:
Like in the case of the #Common File, we have a lot of read accesses happening more-or-less simultaneously to a single file because each process is trying to read its portion of the data. However, in this case the problems are compounded because each process is reading a different portion of the data. Thus, whatever caching strategy the I/O system is implementing has little to no chance of providing multiple requests with a single read.
To make matters worse, the reads are often set up to minimize I/O performance. The reason for this is that even although extents are logically cohesive, the pieces are actually scattered over various portions of the disk. For example, in the image below, a 2D image is divided amongst four processors. Although each piece is contiguous in two dimensions, each partition is scattered in small pieces throughout the linear file on disk.
So, in practice, the storage system is bombarded by many requests of short segments of data. With no a-priori knowledge, the storage is required to fulfill the requests in a more-or-less ordered manner so that no process gets starved. To satisfy the these requests, the disk is forced to move it's reading head back and forth, thereby slowing the read down to a crawl.
What readers do this:
Any reader that can read sub-pieces from a single file applies into this category. The raw image data format most completely applies as it contains no meta data and is completely divided amongst the loading processes.
There are also many data formats that contain some meta data, which is loaded in the manner described in the #Common File, but the vast majority of the file is actual data that is divided amongst processors. These readers include all of the XML*Readers (serial, not parallel versions), legacy VTK format reader, and STL reader.
- I believe that the XDMF files are most often in this category. Although there is no requirement for all blocks/pieces to be in the same file, most of the time 1 hdf5 file is used. I am guessing they assume that hdf5 parallel IO functionality will handle scalability. Do these parallel file systems distribute large files over several disks, similar to RAID 0? If that is the case, there may some performance gain over a serial file system. Berk 10:27, 8 Feb 2007 (EST)
Individual Files
Often times large amounts of data are split amongst many files. Sometimes this is because the original data format does not scale, but most times it is simply convienient. Large data sets are almost always generated by parallel programs, and it is easy and efficent for each process in the original parallel job to write out its own file.
How the individual files are divided amongst processors varies from reader to reader. If there are many more files than processors, then the sharing is minimal. For smaller numbers of larger files, the amount of sharing obviously becomes more likely.
Why it is inefficient:
Usually it is not. Having each process read from its own independent files means that individual reads are usually longer (providing the storage system with easier optimization) and that individual files are more likely to be loaded from individual disks.
Of course, ParaView is at the mercy of how the files were written to the storage system. It could be the case that files are distributed in a strange stripped manner that encourages many short reads from a single disks. But this situation is less likely than a more organized layout.
There is also the circumstance that multiple processors still access a single file. This opens the same issues discussed in #Monolithic File. However, in this situation we have reduced the number of processors accessing an individual file, so the effects are less dramatic.
What readers do this:
The Exodus reader is the purest example of this form. Not only is the meta data repeated for each file, but this particular reader is designed such that no two processors read parts from the same file.
The rest of the file formats incoporating multiple files also have a cental metadata file associated with them. These are listed in the #Common File section.
- Doesn't the Exodus reader read the metadata from the first file on all processes? Berk 10:29, 8 Feb 2007 (EST)
Parallel I/O Layer
Before we are able to optimize our readers for parallel reading from parallel disk, we need an I/O layer that will provide the basic essential tools for doing so. This layer will collect multiple reads from multiple processes and combine them into larger reads that can be streamed from disk more efficiently and distributed to the processes.
There exist several implementations of such an I/O layer, and we certainly do not want to implement our own from scratch. A good choice is the I/O layer that is part of the MPI version 2 specification and is available on most MPI distributions today. Due to its wide available and efficency, this is the layer that we would like to use.
However, ParaView must work in both serial and parallel environments. This means that MPI (and hence its I/O layer) is not always available. One way around this is to have multiple versions of each reader: one that reads using standard I/O and one optimized for parallel reads. Unfortunatly, this approach would lead to a maintanance nightmare causing more bugs and worse support.
Instead, we will build our own I/O layer which is really just a thin veneer over the MPI I/O layer. When MPI is available, we will use the MPI library. When it is not, we will swap out the layer for something that simply calls POSIX I/O functions.