[Pw_forum] PW taskgroups and a large run on a BG/P
David Farrell
davidfarrell2008 at u.northwestern.edu
Wed Feb 11 15:11:43 CET 2009
Yes, but only just in the last few days. I also CC'd this to the PWSCF
list.
I travelled out to a meeting at Argonne's LCF (where the BG is
hosted), and was able to begin to work more closely with Nichols
Romero on the issue.
I was able to try the following systems (a pw BOMD run on Li4BN3H10
crystal with an initial temperature and periodic boundaries, no vacuum
in the supercell) all run in vn mode.
432 atoms (960 electrons) on 1024, 4096, 8192 procs
576 atoms (1280 electrons) on 1024, 2048, 4096 procs
864 atoms (1920 electrons) on 1024, 2048, 4096 procs
What I found is (with with -ntg 8 and -ndiag 1):
- In all cases, the 1024 proc case was able to get into at the
electronic loop and complete at least one SCF step (before the runtime
got too long and it was killed)
- In all cases, the runs on greater than 1024 procs died before
outputing SCF step data. The complaint found in the stderr files was
the 'from n_plane_waves : error # 1' and 'no PWs!' error. In
the 8192 proc case (ntg 64 ndiag 1), the error was 'wrong ngm' - but I
think that is actually the same error (data spread too thin) - just
manifest in a different way.
- All of the systems on all processors had the following complaints in
the beginning of stdout:
Iterative solution of the eigenvalue problem
Too few electrons for parallel algorithm
we need at least as many bands as SQRT(nproc)
a serial algorithm will be used
Message from routine data_structure:
some processors have no planes
Message from routine data_structure:
some processors have no smooth planes
.... Regardless of if they were able to successfully enter the SCF
steps or not. The error about the number of bands I think may be
erroneous - but would have to dig through the code a bit more to be
sure.
- ndiag > 1 doesn't work in any case is pointing toward the idea that
the parallel orthogonalization isn't working. The error spat out are
like this (from the 432 atom case on 1024 procs with -ntg 8 -ndiag 128:
from pdpotf : error # 1
problems computing cholesky decomposition
I can try a 1008 atom case (2240 electrons) and the above system as a
1152 atom case (2560 electrons) - I have a feeling that may break down
because of memory issues even at 1024 procs, but I have a feeling
getting to the bottom of why the above cases fail so spectacularly
will be required if I want to get to a larger system on more procs. I
also haven't touched the taskgroup thing to see what, if any, trouble
that is causing.
I also tried the above with the CVS version with SCALAPACK and seemed
to get the same behavior. If I am able to get to run say ~2500-3000
electrons on 4192 cores, I will consider that a win. But right now, it
seems that 1024 is about as high as I can go.
Dave
On Feb 11, 2009, at 7:18 AM, Paolo Giannozzi wrote:
> Hi, any news on your BG problem? Paolo
>
> --
> Paolo Giannozzi, Democritos and University of Udine, Italy
David E. Farrell
Post-Doctoral Fellow
Department of Materials Science and Engineering
Northwestern University
email: d-farrell2 at northwestern.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.democritos.it/pipermail/pw_forum/attachments/20090211/284be401/attachment.htm
More information about the Pw_forum
mailing list