[Pw_forum] openmp vs mpich performance with MKL 10.x

Wed May 7 23:47:43 CEST 2008

On Wed, 7 May 2008, Eduardo Ariel Menendez Proupin wrote:

EAM> Hi,
EAM> Plese, find attached my best make.sys, to be run serially. Try this in your
EAM> system. My timings are close to yours. Below are the details. However, it

ok, i tried running on my machine with the intel wrapper
i get a wall time of 13m43s and using a multi-threaded fftw3
i need a wall time of 16m13s to complete the job, but i have
not yet added the additional tunings that i added to CPMD
that finally made fftw3 faster.

in summary it looks as if on my hardware MPI is the winner. 
i would be interested to see if you get different timings 
with OpenMPI instead of MPICH. 

EAM> runs faster serially than using mpiexec -n 1.

[...]

EAM> >
EAM> > obviously, switching to the intel fft didn't help.
EAM> 
EAM> FOR ME, IT HELPS ONLY WHEN RUNNING SERIAL.

on CPMD i found that actually, using the multi-threaded fftw3 
is _even_ faster. you will need to add one function call to 
tell the fftw3-planner that all future plans should be generated 
for $OMP_NUM_THREADS threads.

the fact that it helps in the serial code only, is easily 
understandable if you look what QEs FFT modules do differently 
when running in serial or in parallel.

if you run in serial, QE calls a 3d-FFT directly instead 
of a sequence of 1d/2d-FFTs. with the 3d-fft you have the 
chance to parallelize in the same was as with MPI by using 
threads. if you run in parallel, you already call many 
small 1d-ffts and those don't parallelize well. instead
it would be required to distribute those calls across
threads to have a similar gain.

EAM> > your system with many states and only gamma point
EAM> > is definitely a case that benefits the most from
EAM> > multi-threaded BLAS/LAPACK.
EAM> 
EAM> TYPICAL FOR BO MOLECULAR DYNAMICS.
EAM> I WOULD SAY, AVOID MIXING MPI AND OPENMP. ALSO AVOID INTEL  FFTW WRAPPERS
EAM> WITH MPI, EVEN IF OMP_NUM_THREADS=1.
EAM> USE THREADED BLAS/LAPACK/FFTW2(3) FOR SERIAL RUNS.

i don't think that this can be said in general, because
your system is a best case scenario. in my experience a
serial executable is about 10% faster than a parallel one
for one task with plane-wave pseudopotential calculations.
the fact that you have a large system with only gamma
point gives you the maximum benefit from parallel LAPACK/BLAS
and the multi-threaded FFT. however, if you want to do
BO-dynamics i suspect that you may lose the performance
advantage, since the wavefunction extrapolation will cut
down the number of SCF cycles needed and at the same time
the force calculation is not multi-threaded at all.

to get a real benefit from a multi-core machine, additional
OpenMP directives need to be added to the QE code. the fact
that OpenMP libraries and MPI parallelization are somewhat
comparable, could indicate that there is some more room to 
improve the MPI parallelization. luckily for most QE-users
the first, simple level of parallelization across k-points
will apply and give them a lot of speedup without much and 
only _then_ the parallelization across the G-space, task groups
and finally threads/libraries/OpenMP directives should apply.

cheers,
   axel.

EAM> 
EAM> ANYWAY, THE DIFFERENCE BETWEEN THE BEST MPI AND THE BEST OPENMP IS LESS THAN
EAM> 10% (11m30s vs 12m43s)
EAM> 
EAM> >
EAM> >
EAM> > i'm curious to learn how these number match up
EAM> > with your performance measurements.
EAM> >
EAM> > cheers,
EAM> >   axel.
EAM> >
EAM> >
EAM> >
EAM> 
EAM> 
EAM> 

-- 
=======================================================================
Axel Kohlmeyer   akohlmey at cmm.chem.upenn.edu   http://www.cmm.upenn.edu
   Center for Molecular Modeling   --   University of Pennsylvania
Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
tel: 1-215-898-1582,  fax: 1-215-573-6233,  office-tel: 1-215-898-5425
=======================================================================
If you make something idiot-proof, the universe creates a better idiot.