[Pw_forum] Fwd: openmp vs mpich performance with MKL 10.x

Wed May 7 23:06:26 CEST 2008

---------- Forwarded message ----------
From: Eduardo Ariel Menendez Proupin <eariel99 at gmail.com>
Date: 2008/5/7
Subject: Re: [Pw_forum] openmp vs mpich performance with MKL 10.x
To: Axel Kohlmeyer <akohlmey at cmm.chem.upenn.edu>

Hi,

 with serial MKL, serial FFTW-2.1.5 and OpenMPI with 4 mpi tasks.
> i get a wall time of 12m12s and cpu time of 10m40s.
>
I GET
OMP_NUM_THREADS=1
MKL_NUM_THREADS=1
 DFLAGS         =  -D__INTEL -D__FFTW -D__USE_INTERNAL_FFTW -D__MPI -D__PARA
mpiexec -n 4 /home/emenendez/ChemUtils/Espresso/espresso4.0cvs3/bin4/pw.x <
cdteo0.2.md.in >> cdteo0.2.md.out
     PWSCF        : 12m33.73s CPU time,    12m43.31s wall time

> changing MKL to threaded MKL using 4 threads and 1 mpi task
> i get a wall time of 18m8s and cpu time of 28m30s
> (which means that roughly 40% of the time the code
> was running multi-threaded BLAS/LAPACK).

OMP_NUM_THREADS=4
MKL_NUM_THREADS=4
mpiexec -n 1 /home/emenendez/ChemUtils/Espresso/espresso4.0cvs3/bin4/pw.x <
cdteo0.2.md.in >> cdteo0.2.md.out
     PWSCF        : 27m40.02s CPU time,    17m 2.73s wall time

>
> with serial FFT, threaded MKL using 2 threads and 2 mpi tasks
> i get a wall time of 12m45s and cpu time of 14.42s

UNTESTED IN THIS MACHINE, BUT WORSE IN OTHERS.

>
> now when i swap the serial FFTW2 against the
> intel MKL FFTW2 wrapper i get with 2 threads and 2 MPI tasks
> a wall time of 15m2s and a cpu time of 24m11s.

 UNTESTED IN THIS MACHINE, BUT WORSE IN OTHERS.

> and with 4 threads and 1 MPI task i get
> a wall time of 0h19m   and a cpu time of 1h 2m

OMP_NUM_THREADS=4
MKL_NUM_THREADS=4
mpiexec -n 1 /home/emenendez/ChemUtils/Espresso/espresso4.0cvs3/bin5/pw.x <
cdteo0.2.md.in >> cdteo0.2.md.out
     PWSCF        : 58m50.42s CPU time,    17m55.90s wall time

> and finally when disabling threading and with
> 4 MPI tasks i get 12m38 wall time and 11m14s cpu time.

 OMP_NUM_THREADS=1
MKL_NUM_THREADS=1
 DFLAGS         =  -D__INTEL -D__FFTW  -D__MPI -D__PARA (using fftw2_intel)
mpiexec -n 4 /home/emenendez/ChemUtils/Espresso/espresso4.0cvs3/bin5/pw.x <
cdteo0.2.md.in >> cdteo0.2.md.out
     PWSCF        : 13m 2.54s CPU time,    13m16.11s wall time
IT IS WORSE THAN USING THE INTERNAL FFTW.

HOWEVER, RUNNING SERIAL

OMP_NUM_THREADS=4
MKL_NUM_THREADS=4
DFLAGS         =  -D__INTEL -D__FFTW
BLAS_LIBS      = -lfftw2xf_intel -lmkl_em64t
/home/emenendez/ChemUtils/Espresso/espresso4.0cvs3/bin2/pw.x <
cdteo0.2.md.in >> cdteo0.2.md.out
     PWSCF        : 36m58.97s CPU time,    11m36.11s wall time

OMP_NUM_THREADS=4
MKL_NUM_THREADS=4
DFLAGS         =  -D__INTEL -D__FFTW3
BLAS_LIBS      = -lfftw3xf_intel -lmkl_em64t
/home/emenendez/ChemUtils/Espresso/espresso4.0cvs3/bin3/pw.x <
cdteo0.2.md.in >> cdteo0.2.md.out
     PWSCF        : 36m44.64s CPU time,    11m29.59s wall time

>
> obviously, switching to the intel fft didn't help.

FOR ME, IT HELPS ONLY WHEN RUNNING SERIAL.

>
>
> your system with many states and only gamma point
> is definitely a case that benefits the most from
> multi-threaded BLAS/LAPACK.

TYPICAL FOR BO MOLECULAR DYNAMICS.

I WOULD SAY, AVOID MIXING MPI AND OPENMP. ALSO AVOID INTEL  FFTW WRAPPERS
WITH MPI, EVEN IF OMP_NUM_THREADS=1.
USE THREADED BLAS/LAPACK/FFTW2(3) FOR SERIAL RUNS.

ANYWAY, THE DIFFERENCE BETWEEN THE BEST MPI AND THE BEST OPENMP IS LESS THAN
10% (11m30s vs 12m43s)

>
>
> i'm curious to learn how these number match up
> with your performance measurements.
>
> cheers,
>   axel.
>
>
>

-- 
Eduardo Menendez

-- 
Eduardo Menendez
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.democritos.it/pipermail/pw_forum/attachments/20080507/319e20ea/attachment.htm 
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: make.sys.txt
Url: http://www.democritos.it/pipermail/pw_forum/attachments/20080507/319e20ea/attachment.txt