No subject

Tue Feb 17 11:48:01 CET 2009

This execute the PWscf code on 4096 processors, to simulate a system with 8
images, each of which is distributed across 512 processors. K-points are
distributed across 2 pools of 256 processors each, 3D FFT is performed using
8 task groups (64 processors each, so the 3D real-space grid is cut into 64
slices), and the diagonalization of the subspace Hamiltonian is distributed
to a square grid of 144 processors (12x12).

Default values are: -nimage 1 -npool 1 -ntg 1 ; ndiag is chosen by the code
as the fastest n^2 (n integer) that fits into the size of each pool.

*Massively parallel calculations*: For very large jobs (i.e. O(1000) atoms
or so) or for very long jobs to be run on massively parallel machines (e.g.
IBM BlueGene) it is crucial to use in an effective way both the "task group"
and the "ortho group" parallelization. Without a judicious choice of
parameters, large jobs will find a stumbling block in either memory or CPU
requirements. In particular, the "ortho group" parallelization is used in
the diagonalization of matrices in the subspace of Kohn-Sham states (whose
dimension is as a strict minumum equal to the number of occupied states).
These are stored as block-distributed matrixes (distributed across
processors) and diagonalized using custom-taylored diagonalization
algorithms that work on block-distributed matrixes.
Thanks

-- 
Eduardo Menendez
Departamento de Fisica
Facultad de Ciencias
Universidad de Chile
Phone: (56)(2)9787439
URL: http://fisica.ciencias.uchile.cl/~emenendez

--001e680f0a54ca5bfb046868120d
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<pre><br></pre>
<p>Hi, I found in the new, or maybe not so new, users guide, that 1000 atom=
s or so can be calculated, and new ways to paralelize.=A0</p><p>The example=
 in the manual is <br></p><pre><font size=3D"4"> mpirun -np 4096 ./pw.x -ni=
mage 8 -npool 2 -ntg 8 -ndiag 144 -input <a href=3D"http://myinput.in">myin=
put.in</a></font></pre>

I have played a bit, but not with a massive computer, and I have found t=
hat the default options are always better than my unexpert=A0 choices.<=
p>So, I would like to see some hints, in addition to what=A0 is reproduced =
below (from the users guide), about the good choices of -ntg and -ndiag. Ma=
ybe=A0 some examples is enough to understand it. 
 From the users guide: This execute the PWscf co=
de on 4096 processors, to simulate a system
with 8 images, each of which is distributed across 512 processors.
K-points are distributed across 2 pools of 256 processors each,=20
3D FFT is performed using 8 task groups (64 processors each, so
the 3D real-space grid is cut into 64 slices), and the diagonalization
of the subspace Hamiltonian is distributed to a square grid of 144
processors (12x12).
Default values are: -nimage 1 -npool 1 -ntg 1=A0; ndiag is chosen
by the code as the fastest n^2 (n integer) that fits into the size
of each pool.
Massively parallel calculations:
For very large jobs (i.e. O(1000) atoms or so) or for very long jobs to be =
run on massively=20
parallel machines (e.g. IBM BlueGene) it is crucial to use in an effective=
 way both the
&quot;task group&quot; and the &quot;ortho group&quot; parallelization. Wit=
hout a judicious choice of parameters,
large jobs will find a stumbling block in either memory or=20
CPU requirements. In particular, the &quot;ortho group&quot; parallelizatio=
n is used in the diagonalization=20
of matrices in the subspace of Kohn-Sham states (whose dimension is as a st=
rict minumum equal to=20
the number of occupied states). These are stored as block-distributed matri=
xes (distributed
across processors) and diagonalized using custom-taylored diagonalization a=
lgorithms=20
that work on block-distributed matrixes.
Thanks -- Eduardo Menendez Departamento de =
Fisica Facultad de Ciencias Universidad de Chile Phone: (56)(2)978=
7439 URL: <a href=3D"http://fisica.ciencias.uchile.cl/~emenendez">http:/=
/fisica.ciencias.uchile.cl/~emenendez</a>

--001e680f0a54ca5bfb046868120d--