[Pw_forum] Nonlinear scaling with pool parallelization

Tue Apr 5 18:06:22 CEST 2011

Dear QE users and developers,

for some larger calculations I just obtained a small Beowulf type 
cluster, consisting of three machines with hexacore i7 CPUs, connected 
with gigabit ethernet. It runs Ubuntu 64bit, QE 4.2.1 is compiled with 
GCC 4.4, and I have compiled OpenMPI 1.4.3. The code is linked against 
the pre-compiled libatlas-corei7sse3 from Ubuntu.

I want to perform calculations of quite large supercells for interface 
and surface studies of magnetic materials. Now, I'm testing the 
parallelization schemes. My understanding is that pool parallelization 
should scale approximately linearly with nodes. Indeed, the calculation 
of a bulk material scales nearly linearly with the number of nodes when 
I assign each node an individual pool. In contrast, if I do not assign 
pools, the calculations slow down extremely because of the communication 
overhead.

Now we come to the strange part: the slab calculation. I did some quick 
timing tests which I would like to share with you. The times given in 
seconds are just the numbers the code provides when it runs (checked 
them however with htop).

WITH pools:
np	npool	setup	first iteration
6	1	108s	250s
12	2	78s	180s
18	3	69s	152s

WITHOUT pools:
np	setup	first iteration
6	108s	250s
12	75s	186s
18	59s	152s

Without pools I have heavy load on the ethernet, but the calculations 
are about as fast as the ones with pools. With pools, there's almost no 
communication, apart from a few bursts. More importantly, the scaling of 
the calculation with pools is far from linear. With three machines, I 
get less than a factor of two in speed. The gain when going from two to 
three machines is just of the order of 25%.

My program call is:
mpirun -np 18 -hostfile ~/.mpi_hostfile 
~/espresso/espresso-4.2.1/bin/pw.x -npool 3 -in pw.in | tee pw.out

The pw.x program understands the call:

      Parallel version (MPI), running on    18 processors
      K-points division:     npool     =    3
      R & G space division:  proc/pool =    6

Can you explain this behavior? Is there anything I can tune to get a 
better scaling? Is there a known bottleneck for a setup like this? Can 
this be associated with the choice of k-point meshes? For bulk I have a 
shifted 8x8x8 mesh, for the slab I have a 8 8 1 1 1 0 setting.

If you would like to have the input files to reproduce the problem, 
please tell me.

With kind regards,
Markus Meinert

-- 
Dipl.-Phys. Markus Meinert

Thin Films and Physics of Nanostructures
Department of Physics
Bielefeld University
Universitätsstraße 25
33615 Bielefeld

Room D2-118
e-mail: meinert at physik.uni-bielefeld.de
Phone: +49 521 106 2661