Data Array Strides and Iteration
One of the most important issues we have run into with our in-situ work is the ability to shallow copy arrays with strides that are different than those native to the vtkDataArrays. That is, the simulation stores its data in a different way than VTK does.
We want vtkDataArray to conform to data layouts that are provided by simulations. Our current solution is to subclass the appropriate data array and handle the different layout in the subclass. That works fine so long as the calling codes call the virtual accessor methods (such as GetTuple). However, using these methods is slow (see timings below), so many filters opt to grab a properly cast C pointer and access the array directly. If an application does this, then there is no choice but to deep copy the array, which is back to the bad situation we had to begin with. Another problem which may come up is that there are several accessor methods in vtkDataArrayTemplate and its subclasses (such as GetValue) that are not virtual. If something calls one of these methods on a modified data array subclass, something wrong will happen.
This document explores ways to handle different striding in data arrays while allowing algorithms efficient access to the data.
Data Layouts to Support
Let us briefly consider the different ways we expect do find data to be lain out. We can then postulate on a general storage mechanism that handles all of them.
VTK Layout
Everyone knows VTK's Layout, right. Components for each tuple are in contiguous areas of memory. Tuples are packed tightly in order.
|
|
... |
|
Spaced Out Tuples
It might be the case that each tuple is separated by some amount of space. For example, the components in a 3-tuple may actually be 6 spaces away from each other.
|
|
... |
|
This type of storage can happen if a simulation is storing all its variables together as one large tuple.
Component Major Order
We have found that it is also common to have each component to be stored together. That is, there is a separate array for each component that stores the component for each tuple.
T0C0 | T1C0 | T2C0 | ... | TnC0 |
T0C1 | T1C1 | T2C1 | ... | TnC1 |
T0C2 | T1C2 | T2C2 | ... | TnC2 |
A simple example of this type of layout is the Exodus file format. Each component of a vector is stored in its own array.
Contiguous Chunks
It is sometimes the case that not all of the values are stored in contiguous arrays. It can be that a block of values are stored in one array, another block stored in another array, and so on. This is the case for the internal CTH data. The data for a 3D block is stored in contiguous planes. Each plane is stored in a separate area of memory.
Layouts Not Considered
Here are some of the data layouts that we will not necessarily support.
- Variable spacing between components. If there is no constant distance between component values for a reasonably large number of values, there is no real efficient way to access them without creating lookup arrays as large as the data itself.
General Storage
Let us consider a general storage pattern as follows. The array for each component is considered as being separate. Each component has a list of blocks of data. (We will assume that the number of blocks is small and the size of the blocks is large. A general implementation can support large numbers of blocks, but it is OK to be inefficient to access or store.) Within each block we know the address of the first component and the distance between each component. It is then a simple formula to find the address of any component within a block.
Note that this general storage should work efficiently for the current VTK layout. All the component arrays would point to the same single contiguous block of memory (offset by the location of the component).
Access Timing
Before we can reasonably talk about the efficiency of data array iteration, we should understand how long it takes to use the different accessors. To do this, I a created a simple program that does a simple computation on a set of double arrays. It times how long it takes to access the data through Get/SetTuple1, Get/SetValue, or direct array access (from the array returned with GetPointer). Here is the program:
#include "vtkDoubleArray.h" #include "vtkMath.h" #include "vtkTimerLog.h" #define NUM_ARRAY_ENTRIES 10000000 #define NUM_REPETITIONS 100 #define TUPLE_ACCESS 0 #define VALUE_ACCESS 1 #define DIRECT_ACCESS 0 int main(int, char **) { vtkTimerLog *timer = vtkTimerLog::New(); vtkDoubleArray *a = vtkDoubleArray::New(); vtkDoubleArray *b = vtkDoubleArray::New(); vtkDoubleArray *c = vtkDoubleArray::New(); a->SetNumberOfTuples(NUM_ARRAY_ENTRIES); b->SetNumberOfTuples(NUM_ARRAY_ENTRIES); c->SetNumberOfTuples(NUM_ARRAY_ENTRIES); for (vtkIdType i = 0; i < NUM_ARRAY_ENTRIES; i++) { a->SetValue(i, vtkMath::Random()); b->SetValue(i, vtkMath::Random()); } #if DIRECT_ACCESS double *aarray = a->GetPointer(0); double *barray = b->GetPointer(0); double *carray = c->WritePointer(0, NUM_ARRAY_ENTRIES); #endif timer->StartTimer(); for (int j = 0; j < NUM_REPETITIONS; j++) { for (vtkIdType i = 0; i < NUM_ARRAY_ENTRIES; i++) { double aval, bval, cval; #if TUPLE_ACCESS aval = a->GetTuple1(i); bval = b->GetTuple1(i); #endif #if VALUE_ACCESS aval = a->GetValue(i); bval = b->GetValue(i); #endif #if DIRECT_ACCESS aval = aarray[i]; bval = barray[i]; #endif cval = aval + bval; #if TUPLE_ACCESS c->SetTuple1(i, cval); #endif #if VALUE_ACCESS c->SetValue(i, cval); #endif #if DIRECT_ACCESS carray[i] = cval; #endif } } timer->StopTimer(); cout << "Time to compute: " << timer->GetElapsedTime() << endl; timer->Delete(); a->Delete(); b->Delete(); c->Delete(); return 0; }
I compiled this in release mode (with compiler optimizations on) and linked it against a similarly compiled VTK. I ran the test for the Get/SetValue access twice: once for the current implementation and once for if those methods are declared virtual. The latter is more representative of the speed for accessing a subclasses array since we would have to declare them virtual for them to work anyway. I ran the program for arrays containing 10,000,000 and iterating over them 100 times. Here are the results.
Access Type | Time (sec) |
---|---|
GetTuple1/SetTuple1 | 36.5643 |
GetValue/SetValue - inline | 7.16258 |
GetValue/SetValue - virtual | 10.904 |
Direct array access | 7.29754 |
In summary, the inlined Get/SetValue is just as fast as direct memory access. The compiler seems to be doing its job in inline these methods. Making the methods virtual adds about a 50% overhead to the memory access. The Get/SetTuple methods are crap in terms of speed. They take over 5 times as long as direct memory access.
Minimal Changes
One of the problems with the current implementation is that a coder, when faced with a vtkDataArray of unknown type, is faced with two options. The first is to use the generic Get/SetTuple methods, which, as demonstrated above, are slow. The second option is to use the vtkTemplateMacro to cast a pointer and call a templated method. That results in code like this.
switch (dataArray->GetDataType()) { vtkTemplateMacro(myTemplatedFunction(static_cast<VTK_TT>(dataArray->GetVoidPointer(0)), dataArray->GetNumberOfTuples(), dataArray->GetNumberOfComponents())); }
So a coder either has to accept inefficiency or do raw pointer access. Thus, we are actually encouraging coders to call GetVoidPointer, which forces us to do deep memory copies in in-situ.
A better approach is to encourage users to down cast the array to the appropriate subclass and then use the more efficient typed accessors to the data. This has the double advantage of making code management easier for coders and giving us more control over how the arrays are accessed. The simplest way to implement this is to add another typedef to vtkTemplateMacro that defines the array type.
switch (dataArray->GetDataType()) { vtkTemplateMacro(myTemplatedFunction(VTK_ARRAYT::SafeDownCast(dataArray))); }
The only disadvantage to this approach that I can see is that the templated function is no longer templated on type for which the computation is happening. This can also be easily circumvented by adding something like typedef T ValueType
to vtkDataArrayTemplate. Then the templated function can get properly cast values as follows.
template<class T> void myTemplatedFunction(T *dataArray) { T::DataType value = dataArray->GetValue(0);
Possible Workarounds
In this section, we explore ways to support efficient access to different data layouts.
Support Multiple Layouts in vtkDataArrayTemplate
We could generalize the memory storage in vtkDataArrayTemplate. By default it would store data as it is now. However, it would also allow components in separate arrays and of different strides and split into blocks.
Part of this should also add better typed accessor methods. Basically, there should be equivalent Get/SetTuple and Get/SetComponent that are properly typed and efficient.
Pros
- Easiest for the in-situ coder. You no longer have to have a vtkDataArray subclass to support a new kind of data layout.
- Do not have to make methods virtual. Methods like GetValue can remain inline and therefore still have efficient access.
Cons
- Complicates the implementation of vtkDataArrayTemplate a lot.
- The generalized data layout may make all access slower. It's hard to tell how much until we test it.
- What will the interface to SetArray look like? How do we specify all of the different arrays? How do we specify memory management? How will it make sense to the coder?
Create a Templated Iterator
What if we left the current implementation of vtkDataArray alone but created a templated iterator that could efficiently iterate over data either in VTK layout or in some other layout. The interface could look something like this:
template<class arrayT> class vtkComponentIterator { public: typedef arrayT ArrayType; typedef ArrayType::ValueType ValueType; vtkComponentIterator(ArrayType *source, int component); inline int GetComponentIndex(); inline vtkIdType GetTupleIndex(); inline void SetTupleIndex(vtkIdType index); inline void NextTuple(); inline void PreviousTuple(); inline int OnValidTuple(); ValueType GetValue(); ... };
The ArrayType is assumed to be a subclass of vtkDataArrayTemplate. Note that the ValueType typedef requires the addition of a ValueType typedef in vtkDataArrayTemplate as suggested above.
This iterator could work efficiently if vtkDataArrayTemplate supported a virtual function that took a tuple and component index and returned a pointer where that component is stored in memory, a stride for the component, and a range of tuple values for which this array is valid.
For efficiency, the iterator would get a C array pointer and keep it around until it gets a tuple index outside of that range. Thus, it will only call that virtual function when absolutely necessary. If a user is simply marching through the data in order, the call will happen a minimal amount of times.
For convenience, vtkDataArrayTemplate should also declare typedef vtkComponentIterator<T> ComponentIterator
and a method ComponentIterator GetComponentIterator(int component);
It would also be easy and potentially useful to have a vtkTupleIterator that used component iterators under the covers.
Pros
- By abstracting away array iteration, it should make the code both efficient and easier to read.
- Should have a small effect on existing code. We are adding functionality, not changing it.
Cons
- Will the iterator access really be as fast as accessing the array directly (or close to it)? Hard to tell until it is implemented and tested.
- Requires in-situ coders to make their own subclasses of vtkDataArray.
Do Almost Nothing
Make accessor methods virtual (so that subclasses actually work). Also add better typed accessor methods. Basically, there should be equivalent Get/SetTuple and Get/SetComponent that are properly typed and efficient.
Pros
- Easy to implement and maintain.
- Will have the smallest effect on existing code.
Cons
- Changing methods virtual will make all access slower. In that case, why not just add the iterator for more direct access?
- Requires in-situ coders to make their own subclasses of vtkDataArray.
Acknowledgments
This work was done in part at Sandia National Laboratories. Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy's National Nuclear Security Administration under contract DE-AC04-94AL85000.