[Pw_forum] pw.x crash on LSF/mvapich
Kiss, Ioan
kissi at uni-mainz.de
Tue Oct 12 20:02:00 CEST 2010
Dear PWSCF users and developers,
I have a problem running pw.x in our computer center.
The MPI environment is mvapich_1.1, the queuing system is LSF, and I have
compiled PWSCF with the Intel compiler suite together with MKL libraries.
The threading via MKL is turned off by exporting OMP_NUM_THREADS=1.
The machines are 8 core Xeons with QDR Infiniband and 48GB of ECC memory/node.
I would like to perform some geometry optimizations on Cd doped
CuInSe2 with PWSCF version 4.1.2.
The FFT grid for the respective slab is 150:150:144, and it does run
on 24 CPUs (i.e. 3 nodes with 8 cores).
However, by taking the same binary and input file, if I would like to use 48, 72
or 144 CPU cores, than the job will crash right after the WFC initialization:
Self-consistent Calculation
iteration # 1 ecut= 25.00 Ry beta=0.70
Davidson diagonalization with overlap
Signal 15 received.
.
.
.
Signal 15 received.
Job /usr/local/lsf/7.0/linux2.6-glibc2.3-x86_64/bin/mvapich_wrapper VIADEV_USE_SHMEM_ALLREDUCE=0
VIADEV_USE_SHMEM_REDUCE=0 VIADEV_USE_SHMEM_BARRIER=0 DISABLE_RDMA_ALLTOALL=1
DISABLE_RDMA_ALLGATHER=1 DISABLE_RDMA_BARRIER=1 MV2_CPU_MAPPING=0:1:2:3:4:5:6:7 ./pwTest.x -in INP-PWSCF
TID HOST_NAME COMMAND_LINE STATUS TERMINATION_TIME
===== ========== ================ ======================= ===================
00000 moment1 /usr/local/lsf/l Exit (1) 10/12/2010 19:20:36
.
.
.
00001 moment1 /usr/local/lsf/l Exit (174) 10/12/2010 19:20:36
As you can see, I have already tried to deactivate the shared memory optimizations
implemented in mvapich in the Nemesis routines, but that did not help either.
Strangely, on the same machine I can run CPMD without any issues, so I am really wondering
what I am doing wrong or what should I change to fix this problem. I have tried several different
MKL versions and so forth, but to be honest it seems to me that I just cannot fix it.
Also, using the same input file and 48-72 CPUs the job will nicely finish in Juelich supercomputer
center and also in the department's tiny local cluster running OpenMPI.
Do you have some ideas why the machine under LSF/mvapich is not fully cooperating with
PWSCF above 24 CPU cores, or what should be done to remedy this issue?
Thanks in advance for any helpful comment,
Janos.
==========================================
Dr. Janos Kiss e-mail: kissi at uni-mainz.de
Johannes Gutenberg-Universitaet
Institut f. Anorg. u. Analyt. Chemie
AK Prof. Dr. Claudia Felser
Staudinger Weg 9 / Raum 01-230
55128 Mainz/ Germany
Phone: +49-(0)6131-39-22703
Fax: +49-(0)6131-39-26267
Web: http://www.superconductivity.de/
=========================================
More information about the Pw_forum
mailing list