[Pw_forum] How to using gamma point calculation with high efficiency
Janos Kiss
janos.kiss at theochem.ruhr-uni-bochum.de
Sun Jun 29 13:40:51 CEST 2008
Dear Vega Lew,
>I compile Q-E on my cluster with 10.1.015 vision of intel compilers
>successfully and correctly. Now my cluster can calculate
>very fast when calculating the structure relaxtion with 30-40 k-points. But
>on my cluster which has 5 quad-core CPU, I must
>using 20 pools to get the highest CPU usage (most of time 90%+, but it's
>unstable. 70%+ in average was shown by 'sar'
>command).
Does this mean, that you have single socket quad core CPU/machines, and you
have 5 nodes in your cluster? What is the interconnect between the machines?
I'm a novice user of PWSCF by myself, but as far as i understood (actually it
is made relative clear in the manual) if you do a calculation with K-points,
than you have two ways of parallelization: One over the G-space, and one
additional over k-points. I show you my own example how this cold be
exploited:
I use dual socket Xeon machines, each CPU have 4 CPU cores (8 CPU
cores/machine). The nodes (machines) are communicating via Gigabit LAN.
The mesh in the z direction of my supercell is having 180 points.
If you look into your output file, at the beginning you will see something
like:
Planes per process (thick) : nr3 =180 npp = 30 ncplane = 6400
Proc/ planes cols G planes cols G columns G
Pool (dense grid) (smooth grid) (wavefct grid)
1 30 711 83109 30 711 83109 189 11381
2 30 711 83111 30 711 83111 189 11379
3 30 711 83109 30 711 83109 190 11386
4 30 711 83111 30 711 83111 189 11377
5 30 712 83114 30 712 83114 189 11381
6 30 711 83107 30 711 83107 189 11377
0 180 4267 498661 180 4267 498661 1135 68281
This means that i use the parallelization over the G-space with 6 CPU cores
(this is dividing 180 into 30) in each machine. Each CPU core is calculating
30 z-planes. The communication is more crucial here than between K-points, so
it would be good if you look it up in your output how many CPU cores you can
use from a single machine to be appropriate for your mesh.
My supercell is relative large, so with six K-points the binding energy
for my setup is well converged.
Therefore, i set npool=3. This mean, that i use the parallelization over the
K-points (over three separate machines, using from each machine 6 CPU for the
G-space parallelization). The Gigabit LAN is acceptable for the communication
what you have between the machines for K-point parallelization.
>Thanks to Axel's advices, I set the environmental variable
>OMP_NUM_THREADS=1. The CPU usage in every 5 computers was always the same
>case. The calculation can be achieved fast.
>If I using 10 or 5 pools the CPU usage can't reach that high. Is this up to
>snuff?
Now please have a look on what should be the right value for your setup for
G-space/CPU cores and K-points/machines. Hopefully your 'CPU usage' gonna get
better.
>After testing the lattice optimizations, another questions rises. I need to
>calculate the surface structure with gamma point only, because of the system
>composed of ~80 atoms ( scientists always calculate gamma point optimization
>in my area of researching).
This depends on your system again (and that quantity what you are interested
in). You have to keep in mind, that doing calculations with proper K-point
sampling can be several times more expensive than a gamma point only
calculations.
If you get good enough results with Gamma point only with a larger supercell,
than it is a happy case. If your system does not have a wide-band gap (like
metallic systems and so on) the Gamma point only results are just far off,
and you are forced to do K-point sampling.
>But when I calculate the surface structure with
>gamma point only, I couldn't use many pools. Therefore the cpu usage for
>gamma point calculation is coming down, about ~20% again. How could I
>calculation with a high cpu useage?
If you do Gamma point only, than you can exploit the parallelization over
G-space only. If you wanna use the G-space parallelization over the whole
cluster than as i mentioned you need a fast interconnect (infiniband,
myrinet, SCI, quadrics and so on). Gigabit LAN is just too slow for that
(except for insanely large supercells). Even if you do have a fast
interconnect, you still better keep an eye on the right number of CPUs for
the mesh.
All the best,
Janos.
==================================================================
Janos Kiss e-mail: janos.kiss at theochem.ruhr-uni-bochum.de
Lehrstuhl fuer Theoretische Chemie Phone: +49 (0)234/32-26485
NC 03/297 +49 (0)234 32 26754
Ruhr-Universitaet Bochum Fax: +49 (0)234/32-14045
D-44780 Bochum http://www.theochem.ruhr-uni-bochum.de
==================================================================
More information about the Pw_forum
mailing list