Data Array Strides and Iteration

From ParaQ Wiki
Jump to navigationJump to search

One of the most important issues we have run into with our in-situ work is the ability to shallow copy arrays with strides that are different than those native to the vtkDataArrays. That is, the simulation stores its data in a different way than VTK does.

We want vtkDataArray to conform to data layouts that are provided by simulations. Our current solution is to subclass the appropriate data array and handle the different layout in the subclass. That works fine so long as the calling codes call the virtual accessor methods (such as GetTuple). However, using these methods is slow (see timings below), so many filters opt to grab a properly cast C pointer and access the array directly. If an application does this, then there is no choice but to deep copy the array, which is back to the bad situation we had to begin with. Another problem which may come up is that there are several accessor methods in vtkDataArrayTemplate and its subclasses (such as GetValue) that are not virtual. If something calls one of these methods on a modified data array subclass, something wrong will happen.

This document explores ways to handle different striding in data arrays while allowing algorithms efficient access to the data.

Data Layouts to Support

Let us briefly consider the different ways we expect do find data to be lain out. We can then postulate on a general storage mechanism that handles all of them.

VTK Layout

Everyone knows VTK's Layout, right. Components for each tuple are in contiguous areas of memory. Tuples are packed tightly in order.

T0C0 T0C1 T0C2
T1C0 T1C1 T1C2
...
TnC0 TnC1 TnC2

Spaced Out Tuples

It might be the case that each tuple is separated by some amount of space. For example, the components in a 3-tuple may actually be 6 spaces away from each other.

T0C0 T0C1 T0C2
T1C0 T1C1 T1C2
...
TnC0 TnC1

This type of storage can happen if a simulation is storing all its variables together as one large tuple.

Component Major Order

We have found that it is also common to have each component to be stored together. That is, there is a separate array for each component that stores the component for each tuple.

T0C0 T1C0 T2C0 ... TnC0
T0C1 T1C1 T2C1 ... TnC1
T0C2 T1C2 T2C2 ... TnC2

A simple example of this type of layout is the Exodus file format. Each component of a vector is stored in its own array.

Contiguous Chunks

It is sometimes the case that not all of the values are stored in contiguous arrays. It can be that a block of values are stored in one array, another block stored in another array, and so on. This is the case for the internal CTH data. The data for a 3D block is stored in contiguous planes. Each plane is stored in a separate area of memory.

Layouts Not Considered

Here are some of the data layouts that we will not necessarily support.

  • Variable spacing between components. If there is no constant distance between component values for a reasonably large number of values, there is no real efficient way to access them without creating lookup arrays as large as the data itself.

General Storage

Let us consider a general storage pattern as follows. The array for each component is considered as being separate. Each component has a list of blocks of data. (We will assume that the number of blocks is small and the size of the blocks is large. A general implementation can support large numbers of blocks, but it is OK to be inefficient to access or store.) Within each block we know the address of the first component and the distance between each component. It is then a simple formula to find the address of any component within a block.

Note that this general storage should work efficiently for the current VTK layout. All the component arrays would point to the same single contiguous block of memory (offset by the location of the component).

Access Timing

Before we can reasonably talk about the efficiency of data array iteration, we should understand how long it takes to use the different accessors. To do this, I a created a simple program that does a simple computation on a set of double arrays. It times how long it takes to access the data through Get/SetTuple1, Get/SetValue, or direct array access (from the array returned with GetPointer). Here is the program:

#include "vtkDoubleArray.h"
#include "vtkMath.h"
#include "vtkTimerLog.h"

#define NUM_ARRAY_ENTRIES	10000000
#define NUM_REPETITIONS		100

#define TUPLE_ACCESS  0
#define VALUE_ACCESS  1
#define DIRECT_ACCESS 0

int main(int, char **)
{
  vtkTimerLog *timer = vtkTimerLog::New();
  vtkDoubleArray *a = vtkDoubleArray::New();
  vtkDoubleArray *b = vtkDoubleArray::New();
  vtkDoubleArray *c = vtkDoubleArray::New();
  a->SetNumberOfTuples(NUM_ARRAY_ENTRIES);
  b->SetNumberOfTuples(NUM_ARRAY_ENTRIES);
  c->SetNumberOfTuples(NUM_ARRAY_ENTRIES);

  for (vtkIdType i = 0; i < NUM_ARRAY_ENTRIES; i++)
    {
    a->SetValue(i, vtkMath::Random());
    b->SetValue(i, vtkMath::Random());
    }

#if DIRECT_ACCESS
  double *aarray = a->GetPointer(0);
  double *barray = b->GetPointer(0);
  double *carray = c->WritePointer(0, NUM_ARRAY_ENTRIES);
#endif

  timer->StartTimer();
  for (int j = 0; j < NUM_REPETITIONS; j++)
    {
    for (vtkIdType i = 0; i < NUM_ARRAY_ENTRIES; i++)
      {
      double aval, bval, cval;

#if TUPLE_ACCESS
      aval = a->GetTuple1(i);
      bval = b->GetTuple1(i);
#endif
#if VALUE_ACCESS
      aval = a->GetValue(i);
      bval = b->GetValue(i);
#endif
#if DIRECT_ACCESS
      aval = aarray[i];
      bval = barray[i];
#endif

      cval = aval + bval;

#if TUPLE_ACCESS
      c->SetTuple1(i, cval);
#endif
#if VALUE_ACCESS
      c->SetValue(i, cval);
#endif
#if DIRECT_ACCESS
      carray[i] = cval;
#endif
      }
    }
  timer->StopTimer();

  cout << "Time to compute: " << timer->GetElapsedTime() << endl;

  timer->Delete();
  a->Delete();
  b->Delete();
  c->Delete();
  return 0;
}

I compiled this in release mode (with compiler optimizations on) and linked it against a similarly compiled VTK. I ran the test for the Get/SetValue access twice: once for the current implementation and once for if those methods are declared virtual. The latter is more representative of the speed for accessing a subclasses array since we would have to declare them virtual for them to work anyway. I ran the program for arrays containing 10,000,000 and iterating over them 100 times. Here are the results.

Access Type Time (sec)
GetTuple1/SetTuple1 36.5643
GetValue/SetValue - inline 7.16258
GetValue/SetValue - virtual 10.904
Direct array access 7.29754

In summary, the inlined Get/SetValue is just as fast as direct memory access. The compiler seems to be doing its job in inline these methods. Making the methods virtual adds about a 50% overhead to the memory access. The Get/SetTuple methods are crap in terms of speed. They take over 5 times as long as direct memory access.

Minimal Changes

One of the problems with the current implementation is that a coder, when faced with a vtkDataArray of unknown type, is faced with two options. The first is to use the generic Get/SetTuple methods, which, as demonstrated above, are slow. The second option is to use the vtkTemplateMacro to cast a pointer and call a templated method. That results in code like this.

switch (dataArray->GetDataType())
  {
  vtkTemplateMacro(myTemplatedFunction(static_cast<VTK_TT>(dataArray->GetVoidPointer(0)),
                                       dataArray->GetNumberOfTuples(),
                                       dataArray->GetNumberOfComponents()));
  }

So a coder either has to accept inefficiency or do raw pointer access. Thus, we are actually encouraging coders to call GetVoidPointer, which forces us to do deep memory copies in in-situ.

A better approach is to encourage users to down cast the array to the appropriate subclass and then use the more efficient typed accessors to the data. This has the double advantage of making code management easier for coders and giving us more control over how the arrays are accessed. The simplest way to implement this is to add another typedef to vtkTemplateMacro that defines the array type.

switch (dataArray->GetDataType())
  {
  vtkTemplateMacro(myTemplatedFunction(VTK_ARRAYT::SafeDownCast(dataArray)));
  }

The only disadvantage to this approach that I can see is that the templated function is no longer templated on type for which the computation is happening. This can also be easily circumvented by adding something like typedef T ValueType to vtkDataArrayTemplate. Then the templated function can get properly cast values as follows.

template<class T> void myTemplatedFunction(T *dataArray)
{
  T::DataType value = dataArray->GetValue(0);

Possible Workarounds

In this section, we explore ways to support efficient access to different data layouts.

Support Multiple Layouts in vtkDataArrayTemplate

We could generalize the memory storage in vtkDataArrayTemplate. By default it would store data as it is now. However, it would also allow components in separate arrays and of different strides and split into blocks.

Part of this should also add better typed accessor methods. Basically, there should be equivalent Get/SetTuple and Get/SetComponent that are properly typed and efficient.

Pros

  • Easiest for the in-situ coder. You no longer have to have a vtkDataArray subclass to support a new kind of data layout.
  • Do not have to make methods virtual. Methods like GetValue can remain inline and therefore still have efficient access.

Cons

  • Complicates the implementation of vtkDataArrayTemplate a lot.
  • The generalized data layout may make all access slower. It's hard to tell how much until we test it.
  • What will the interface to SetArray look like? How do we specify all of the different arrays? How do we specify memory management? How will it make sense to the coder?

Create a Templated Iterator

What if we left the current implementation of vtkDataArray alone but created a templated iterator that could efficiently iterate over data either in VTK layout or in some other layout. The interface could look something like this:

template<class arrayT>
class vtkComponentIterator
{
public:
  typedef arrayT ArrayType;
  typedef ArrayType::ValueType ValueType;
  vtkComponentIterator(ArrayType *source, int component);
  inline int GetComponentIndex();
  inline vtkIdType GetTupleIndex();
  inline void SetTupleIndex(vtkIdType index);
  inline void NextTuple();
  inline void PreviousTuple();
  inline int OnValidTuple();
  ValueType GetValue();
  ...
};

The ArrayType is assumed to be a subclass of vtkDataArrayTemplate. Note that the ValueType typedef requires the addition of a ValueType typedef in vtkDataArrayTemplate as suggested above.

This iterator could work efficiently if vtkDataArrayTemplate supported a virtual function that took a tuple and component index and returned a pointer where that component is stored in memory, a stride for the component, and a range of tuple values for which this array is valid.

For efficiency, the iterator would get a C array pointer and keep it around until it gets a tuple index outside of that range. Thus, it will only call that virtual function when absolutely necessary. If a user is simply marching through the data in order, the call will happen a minimal amount of times.

For convenience, vtkDataArrayTemplate should also declare typedef vtkComponentIterator<T> ComponentIterator and a method ComponentIterator GetComponentIterator(int component); It would also be easy and potentially useful to have a vtkTupleIterator that used component iterators under the covers.

Pros

  • By abstracting away array iteration, it should make the code both efficient and easier to read.
  • Should have a small effect on existing code. We are adding functionality, not changing it.

Cons

  • Will the iterator access really be as fast as accessing the array directly (or close to it)? Hard to tell until it is implemented and tested.
  • Requires in-situ coders to make their own subclasses of vtkDataArray.

Do Almost Nothing

Make accessor methods virtual (so that subclasses actually work). Also add better typed accessor methods. Basically, there should be equivalent Get/SetTuple and Get/SetComponent that are properly typed and efficient.

Pros

  • Easy to implement and maintain.
  • Will have the smallest effect on existing code.

Cons

  • Changing methods virtual will make all access slower. In that case, why not just add the iterator for more direct access?
  • Requires in-situ coders to make their own subclasses of vtkDataArray.

Acknowledgments

This work was done in part at Sandia National Laboratories. Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy's National Nuclear Security Administration under contract DE-AC04-94AL85000.