[Paraview] 3.98 MPI_Finalize out of order in pvbatch

Kyle Lutz kyle.lutz at kitware.com
Mon Dec 10 09:22:49 EST 2012


On Fri, Dec 7, 2012 at 12:13 PM, Burlen Loring <bloring at lbl.gov> wrote:
> Hi Kyle et al.
>
> below are stack traces where PV is hung. I'm stumped by this, and can get no
> foothold. I still have one chance if we can get valgrind to run with MPI on
> nautilus. But it's a long shot, valgrinding pvbatch on my local system
> throws many hundreds of errors. I'm not sure which of these are valid
> reports.
>
> PV 3.14.1 doesn't hang in pvbatch, so I wondering if anyone knows of a
> change in 3.98 that may account for the new hang?
>
> Burlen
>
> rank 0
> #0  0x00002b0762b3f590 in gru_get_next_message () from
> /usr/lib64/libgru.so.0
> #1  0x00002b073a2f4bd2 in MPI_SGI_grudev_progress () at grudev.c:1780
> #2  0x00002b073a31cc25 in MPI_SGI_progress_devices () at progress.c:93
> #3  MPI_SGI_progress () at progress.c:207
> #4  0x00002b073a3244eb in MPI_SGI_request_finalize () at req.c:1548
> #5  0x00002b073a2b8bee in MPI_SGI_finalize () at adi.c:667
> #6  0x00002b073a2e3c04 in PMPI_Finalize () at finalize.c:27
> #7  0x00002b073969d96f in vtkProcessModule::Finalize () at
> /sw/analysis/paraview/3.98/sles11.1_intel11.1.038/ParaView/ParaViewCore/ClientServerCore/Core/vtkProcessModule.cxx:229
> #8  0x00002b0737bb0f9e in vtkInitializationHelper::Finalize () at
> /sw/analysis/paraview/3.98/sles11.1_intel11.1.038/ParaView/ParaViewCore/ServerManager/SMApplication/vtkInitializationHelper.cxx:145
> #9  0x0000000000403c50 in ParaViewPython::Run (processType=4, argc=2,
> argv=0x7fff06195c88) at
> /sw/analysis/paraview/3.98/sles11.1_intel11.1.038/ParaView/CommandLineExecutables/pvpython.h:124
> #10 0x0000000000403cd5 in main (argc=2, argv=0x7fff06195c88) at
> /sw/analysis/paraview/3.98/sles11.1_intel11.1.038/ParaView/CommandLineExecutables/pvbatch.cxx:21
>
> rank 1
> #0  0x00002b07391bde70 in __nanosleep_nocancel () from
> /lib64/libpthread.so.0
> #1  0x00002b073a32c898 in MPI_SGI_millisleep (milliseconds=<value optimized
> out>) at sleep.c:34
> #2  0x00002b073a326365 in MPI_SGI_slow_request_wait (request=0x7fff061959f8,
> status=0x7fff061959d0, set=0x7fff061959f4, gen_rc=0x7fff061959f0) at
> req.c:1460
> #3  0x00002b073a2c6ef3 in MPI_SGI_slow_barrier (comm=1) at barrier.c:275
> #4  0x00002b073a2b8bf8 in MPI_SGI_finalize () at adi.c:671
> #5  0x00002b073a2e3c04 in PMPI_Finalize () at finalize.c:27
> #6  0x00002b073969d96f in vtkProcessModule::Finalize () at
> /sw/analysis/paraview/3.98/sles11.1_intel11.1.038/ParaView/ParaViewCore/ClientServerCore/Core/vtkProcessModule.cxx:229
> #7  0x00002b0737bb0f9e in vtkInitializationHelper::Finalize () at
> /sw/analysis/paraview/3.98/sles11.1_intel11.1.038/ParaView/ParaViewCore/ServerManager/SMApplication/vtkInitializationHelper.cxx:145
> #8  0x0000000000403c50 in ParaViewPython::Run (processType=4, argc=2,
> argv=0x7fff06195c88) at
> /sw/analysis/paraview/3.98/sles11.1_intel11.1.038/ParaView/CommandLineExecutables/pvpython.h:124
> #9  0x0000000000403cd5 in main (argc=2, argv=0x7fff06195c88) at
> /sw/analysis/paraview/3.98/sles11.1_intel11.1.038/ParaView/CommandLineExecutables/pvbatch.cxx:21

Hi Burlen,

Thanks for getting these. I'll take a closer look today and see what I can find.

-kyle

>
>
>
> On 12/04/2012 05:15 PM, Burlen Loring wrote:
>>
>> Hi Kyle,
>>
>> I was wrong about MPI_Finalize being invoked twice, I had miss read the
>> code. I'm not sure why pvbatch is hanging in MPI_Finalize on Nautilus. I
>> haven't been able to find anything in the debugger. This is new for 3.98.
>>
>> Burlen
>>
>> On 12/03/2012 07:36 AM, Kyle Lutz wrote:
>>>
>>> Hi Burlen,
>>>
>>> On Thu, Nov 29, 2012 at 1:27 PM, Burlen Loring<bloring at lbl.gov>  wrote:
>>>>
>>>> it looks like pvserver is also impacted, hanging after the gui
>>>> disconnects.
>>>>
>>>>
>>>> On 11/28/2012 12:53 PM, Burlen Loring wrote:
>>>>>
>>>>> Hi All,
>>>>>
>>>>> some parallel tests have been failing for some time on Nautilus.
>>>>> http://open.cdash.org/viewTest.php?onlyfailed&buildid=2684614
>>>>>
>>>>> There are MPI calls made after finalize which cause deadlock issues on
>>>>> SGI
>>>>> MPT. It affects pvbatch for sure. The following snip-it shows the bug,
>>>>> and
>>>>> bug report here: http://paraview.org/Bug/view.php?id=13690
>>>>>
>>>>>
>>>>>
>>>>> //----------------------------------------------------------------------------
>>>>> bool vtkProcessModule::Finalize()
>>>>> {
>>>>>
>>>>>    ...
>>>>>
>>>>>    vtkProcessModule::GlobalController->Finalize(1);<-------mpi_finalize
>>>>> called here
>>>
>>> This shouldn't be calling MPI_Finalize() as the finalizedExternally
>>> argument is 1 and in vtkMPIController::Finalize():
>>>
>>>      if (finalizedExternally == 0)
>>>        {
>>>        MPI_Finalize();
>>>        }
>>>
>>> So my guess is that it's being invoked elsewhere.
>>>
>>>>>    ...
>>>>>
>>>>> #ifdef PARAVIEW_USE_MPI
>>>>>    if (vtkProcessModule::FinalizeMPI)
>>>>>      {
>>>>>      MPI_Barrier(MPI_COMM_WORLD);<-------------------------barrier
>>>>> after
>>>>> mpi_finalize
>>>>>      MPI_Finalize();<--------------------------------------second
>>>>> mpi_finalize
>>>>>      }
>>>>> #endif
>>>
>>> I've made a patch which should prevent this second of code from ever
>>> being called twice by setting the FinalizeMPI flag to false after
>>> calling MPI_Finalize(). Can you take a look here:
>>> http://review.source.kitware.com/#/t/1808/ and let me know if that
>>> helps the issue.
>>>
>>> Otherwise, would you be able to set a breakpoint on MPI_Finalize() and
>>> get a backtrace of where it gets invoked for the second time? That
>>> would be very helpful in tracking down the problem.
>>>
>>> Thanks,
>>> Kyle
>>
>>
>


More information about the ParaView mailing list