[Pw_forum] Nonlinear scaling with pool parallelization
Markus Meinert
meinert at physik.uni-bielefeld.de
Tue Apr 5 18:06:22 CEST 2011
Dear QE users and developers,
for some larger calculations I just obtained a small Beowulf type
cluster, consisting of three machines with hexacore i7 CPUs, connected
with gigabit ethernet. It runs Ubuntu 64bit, QE 4.2.1 is compiled with
GCC 4.4, and I have compiled OpenMPI 1.4.3. The code is linked against
the pre-compiled libatlas-corei7sse3 from Ubuntu.
I want to perform calculations of quite large supercells for interface
and surface studies of magnetic materials. Now, I'm testing the
parallelization schemes. My understanding is that pool parallelization
should scale approximately linearly with nodes. Indeed, the calculation
of a bulk material scales nearly linearly with the number of nodes when
I assign each node an individual pool. In contrast, if I do not assign
pools, the calculations slow down extremely because of the communication
overhead.
Now we come to the strange part: the slab calculation. I did some quick
timing tests which I would like to share with you. The times given in
seconds are just the numbers the code provides when it runs (checked
them however with htop).
WITH pools:
np npool setup first iteration
6 1 108s 250s
12 2 78s 180s
18 3 69s 152s
WITHOUT pools:
np setup first iteration
6 108s 250s
12 75s 186s
18 59s 152s
Without pools I have heavy load on the ethernet, but the calculations
are about as fast as the ones with pools. With pools, there's almost no
communication, apart from a few bursts. More importantly, the scaling of
the calculation with pools is far from linear. With three machines, I
get less than a factor of two in speed. The gain when going from two to
three machines is just of the order of 25%.
My program call is:
mpirun -np 18 -hostfile ~/.mpi_hostfile
~/espresso/espresso-4.2.1/bin/pw.x -npool 3 -in pw.in | tee pw.out
The pw.x program understands the call:
Parallel version (MPI), running on 18 processors
K-points division: npool = 3
R & G space division: proc/pool = 6
Can you explain this behavior? Is there anything I can tune to get a
better scaling? Is there a known bottleneck for a setup like this? Can
this be associated with the choice of k-point meshes? For bulk I have a
shifted 8x8x8 mesh, for the slab I have a 8 8 1 1 1 0 setting.
If you would like to have the input files to reproduce the problem,
please tell me.
With kind regards,
Markus Meinert
--
Dipl.-Phys. Markus Meinert
Thin Films and Physics of Nanostructures
Department of Physics
Bielefeld University
Universitätsstraße 25
33615 Bielefeld
Room D2-118
e-mail: meinert at physik.uni-bielefeld.de
Phone: +49 521 106 2661
More information about the Pw_forum
mailing list