[Pw_forum] PW taskgroups and a large run on a BG/P

David Farrell davidfarrell2008 at u.northwestern.edu
Wed Feb 11 15:11:43 CET 2009


Yes, but only just in the last few days. I also CC'd this to the PWSCF  
list.

I travelled out to a meeting at Argonne's LCF (where the BG is  
hosted), and was able to begin to work more closely with Nichols  
Romero on the issue.

I was able to try the following systems (a pw BOMD run on Li4BN3H10  
crystal with an initial temperature and periodic boundaries, no vacuum  
in the supercell) all run in vn mode.

432 atoms (960 electrons) on 1024, 4096, 8192 procs
576 atoms (1280 electrons) on 1024, 2048, 4096 procs
864 atoms (1920 electrons) on 1024, 2048, 4096 procs

What I found is (with with -ntg 8 and -ndiag 1):

- In all cases, the 1024 proc case was able to get into at the  
electronic loop and complete at least one SCF step (before the runtime  
got too long and it was killed)

- In all cases, the runs on greater than 1024 procs died before  
outputing SCF step data. The complaint found in the stderr files was  
the 'from n_plane_waves : error #         1' and  'no PWs!' error. In  
the 8192 proc case (ntg 64 ndiag 1), the error was 'wrong ngm' - but I  
think that is actually the same error (data spread too thin) - just  
manifest in a different way.

- All of the systems on all processors had the following complaints in  
the beginning of stdout:

	Iterative solution of the eigenvalue problem
      		Too few electrons for parallel algorithm
        		we need at least as many bands as SQRT(nproc)

      		a serial algorithm will be used

      		Message from routine data_structure:
      		some processors have no planes
      		Message from routine data_structure:
      		some processors have no smooth planes

	.... Regardless of if they were able to successfully enter the SCF  
steps or not. The error about the number of bands I think may be  
erroneous - but would have to dig through the code a bit more to be  
sure.

- ndiag > 1 doesn't work in any case is pointing toward the idea that  
the parallel orthogonalization isn't working. The error spat out are  
like this (from the 432 atom case on 1024 procs with -ntg 8 -ndiag 128:
from  pdpotf  : error #         1
       problems computing cholesky decomposition


I can try a 1008 atom case (2240 electrons) and the above system as a  
1152 atom case (2560 electrons) - I have a feeling that may break down  
because of memory issues even at 1024 procs, but I have a feeling  
getting to the bottom of why the above cases fail so spectacularly  
will be required if I want to get to a larger system on more procs. I  
also haven't touched the taskgroup thing to see what, if any, trouble  
that is causing.

I also tried the above with the CVS version with SCALAPACK and seemed  
to get the same behavior. If I am able to get to run say ~2500-3000  
electrons on 4192 cores, I will consider that a win. But right now, it  
seems that 1024 is about as high as I can go.

Dave


On Feb 11, 2009, at 7:18 AM, Paolo Giannozzi wrote:

> Hi, any news on your BG problem? Paolo
>
> -- 
> Paolo Giannozzi, Democritos and University of Udine, Italy

David E. Farrell
Post-Doctoral Fellow
Department of Materials Science and Engineering
Northwestern University
email: d-farrell2 at northwestern.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.democritos.it/pipermail/pw_forum/attachments/20090211/284be401/attachment.htm 


More information about the Pw_forum mailing list