[Pw_forum] Fwd: openmp vs mpich performance with MKL 10.x
Eduardo Ariel Menendez Proupin
eariel99 at gmail.com
Wed May 7 23:06:26 CEST 2008
---------- Forwarded message ----------
From: Eduardo Ariel Menendez Proupin <eariel99 at gmail.com>
Date: 2008/5/7
Subject: Re: [Pw_forum] openmp vs mpich performance with MKL 10.x
To: Axel Kohlmeyer <akohlmey at cmm.chem.upenn.edu>
Hi,
with serial MKL, serial FFTW-2.1.5 and OpenMPI with 4 mpi tasks.
> i get a wall time of 12m12s and cpu time of 10m40s.
>
I GET
OMP_NUM_THREADS=1
MKL_NUM_THREADS=1
DFLAGS = -D__INTEL -D__FFTW -D__USE_INTERNAL_FFTW -D__MPI -D__PARA
mpiexec -n 4 /home/emenendez/ChemUtils/Espresso/espresso4.0cvs3/bin4/pw.x <
cdteo0.2.md.in >> cdteo0.2.md.out
PWSCF : 12m33.73s CPU time, 12m43.31s wall time
> changing MKL to threaded MKL using 4 threads and 1 mpi task
> i get a wall time of 18m8s and cpu time of 28m30s
> (which means that roughly 40% of the time the code
> was running multi-threaded BLAS/LAPACK).
OMP_NUM_THREADS=4
MKL_NUM_THREADS=4
mpiexec -n 1 /home/emenendez/ChemUtils/Espresso/espresso4.0cvs3/bin4/pw.x <
cdteo0.2.md.in >> cdteo0.2.md.out
PWSCF : 27m40.02s CPU time, 17m 2.73s wall time
>
> with serial FFT, threaded MKL using 2 threads and 2 mpi tasks
> i get a wall time of 12m45s and cpu time of 14.42s
UNTESTED IN THIS MACHINE, BUT WORSE IN OTHERS.
>
> now when i swap the serial FFTW2 against the
> intel MKL FFTW2 wrapper i get with 2 threads and 2 MPI tasks
> a wall time of 15m2s and a cpu time of 24m11s.
UNTESTED IN THIS MACHINE, BUT WORSE IN OTHERS.
> and with 4 threads and 1 MPI task i get
> a wall time of 0h19m and a cpu time of 1h 2m
OMP_NUM_THREADS=4
MKL_NUM_THREADS=4
mpiexec -n 1 /home/emenendez/ChemUtils/Espresso/espresso4.0cvs3/bin5/pw.x <
cdteo0.2.md.in >> cdteo0.2.md.out
PWSCF : 58m50.42s CPU time, 17m55.90s wall time
> and finally when disabling threading and with
> 4 MPI tasks i get 12m38 wall time and 11m14s cpu time.
OMP_NUM_THREADS=1
MKL_NUM_THREADS=1
DFLAGS = -D__INTEL -D__FFTW -D__MPI -D__PARA (using fftw2_intel)
mpiexec -n 4 /home/emenendez/ChemUtils/Espresso/espresso4.0cvs3/bin5/pw.x <
cdteo0.2.md.in >> cdteo0.2.md.out
PWSCF : 13m 2.54s CPU time, 13m16.11s wall time
IT IS WORSE THAN USING THE INTERNAL FFTW.
HOWEVER, RUNNING SERIAL
OMP_NUM_THREADS=4
MKL_NUM_THREADS=4
DFLAGS = -D__INTEL -D__FFTW
BLAS_LIBS = -lfftw2xf_intel -lmkl_em64t
/home/emenendez/ChemUtils/Espresso/espresso4.0cvs3/bin2/pw.x <
cdteo0.2.md.in >> cdteo0.2.md.out
PWSCF : 36m58.97s CPU time, 11m36.11s wall time
OMP_NUM_THREADS=4
MKL_NUM_THREADS=4
DFLAGS = -D__INTEL -D__FFTW3
BLAS_LIBS = -lfftw3xf_intel -lmkl_em64t
/home/emenendez/ChemUtils/Espresso/espresso4.0cvs3/bin3/pw.x <
cdteo0.2.md.in >> cdteo0.2.md.out
PWSCF : 36m44.64s CPU time, 11m29.59s wall time
>
> obviously, switching to the intel fft didn't help.
FOR ME, IT HELPS ONLY WHEN RUNNING SERIAL.
>
>
> your system with many states and only gamma point
> is definitely a case that benefits the most from
> multi-threaded BLAS/LAPACK.
TYPICAL FOR BO MOLECULAR DYNAMICS.
I WOULD SAY, AVOID MIXING MPI AND OPENMP. ALSO AVOID INTEL FFTW WRAPPERS
WITH MPI, EVEN IF OMP_NUM_THREADS=1.
USE THREADED BLAS/LAPACK/FFTW2(3) FOR SERIAL RUNS.
ANYWAY, THE DIFFERENCE BETWEEN THE BEST MPI AND THE BEST OPENMP IS LESS THAN
10% (11m30s vs 12m43s)
>
>
> i'm curious to learn how these number match up
> with your performance measurements.
>
> cheers,
> axel.
>
>
>
--
Eduardo Menendez
--
Eduardo Menendez
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.democritos.it/pipermail/pw_forum/attachments/20080507/319e20ea/attachment.htm
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: make.sys.txt
Url: http://www.democritos.it/pipermail/pw_forum/attachments/20080507/319e20ea/attachment.txt
More information about the Pw_forum
mailing list