[Pw_forum] Parallelization of diaghg

Sat Jan 6 21:50:39 CET 2007

On Sunday 07 January 2007 02:19, Axel Kohlmeyer wrote:
> On 1/6/07, Alexander Shaposhnikov <shaposh at isp.nsc.ru> wrote:
> > Hi,
> >
> > yet another parallelization issue. As far as i understand, subroutine 
> > cdiaghg for davidson diagonalization  is not parallelized
> > by default (in recent versions) and this is for good, as enabling
> > parallel algorithm ('david + para ') only increases computation time
> > (almost always for me).
> > Profiling shows this routine takes 1/2 -2/3 exec times for big  jobs
> > running in parallel on 8-core dual Xeon clovertown 2.66GHz machine, so
> > some working parallelization algorithm could give sizeable performance
> > boost.
>
> well, on the other hand you have to consider the problem of memory
> contention, which must be a huge problem on a machine like yours.
>
> i suggest, you first do a test by starting a series of 1, 2, 3,
> 4,...,8 identical
> and not too small serial calculations simultaneously and compare the wall
> time of those jobs to get an impression of how much of a performance
> improvement from this kind of machine is maximally possible (this is
> an ideally parallel problem with no communication). i have not previous
> experience with that kind of hardware, but extrapolating from woodcrest
> performance numbers and the fact, that your cpus are for all practical
> purposes two woodcrest cpus glued on one socket, i'd expect that you'd
> see a significant performance degradation when running 8 processes.
> i would not be surprised if the optimum would be to use only two thirds
> or even half the cores for larger problems.

The SMP speedup for relatively large pw.x jobs (like scf calc. of 192 atom 
ZrO2 supercell with 50 ecut) is ~4.5. As i said, the cdiaghg does not 
parallelize and takes ~1/3 exec time of the whole 8-cpu job. Consider it 
could be efficiently run in parallel to get, say, 2X speedup on 8cpus - I 
could achieve ~5.5X total SMP speedup.
Thats the difference.

On the other hand, memory contention is indeed huge problem for cp/cpmd codes,
so this Xeon machine is barely faster than dual Opteron 280 2.4GHz (4cores) 
for cpmd. For pw.x , however, it is ~2.5 faster for large jobs -and could be 
made even better with some working diaghg parallelization algorithm.

> cheers,
>     axel.

Best Regards,
Alexander Shaposhnikov