[Pw_forum] openmp vs mpich performance with MKL 10.x
Nicola Marzari
marzari at MIT.EDU
Tue May 6 21:21:49 CEST 2008
Dear Eduardo,
our own experiences are summarized here:
http://quasiamore.mit.edu/pmwiki/index.php?n=Main.CP90Timings
It would be great if you could contribute your own data, either for
pw.x or cp.x under the conditions you describe.
I noticed indeed, informally, a few of the things you mention:
1) no improvements with the Intel fftw2 wrapper, as opposed to fftw2
Q-E sources, when using mpi. I also never managed to successfully run
with the Intel fftw3 wrapper (or with fftw3 - that probably says
something about me).
2) great improvements of a serial code (different from Q-E) when using
the automatic parallelism of MKL in quad-cores.
3) btw, MPICH has always been for us the slower protocol, compared with
LAMMPI or OpenMPI
I actually wonder if the best solution on a quad-core would be, say,
to use two cores for MPI, and the other two for the openmp threads.
I eagerly await Axel's opinion.
nicola
Eduardo Ariel Menendez Proupin wrote:
> Hi,
> I have noted recently that I am able to obtain faster binaries of pw.x
> using the the OpenMP paralellism implemented in the Intel MKL libraries
> of version 10.xxx, than using MPICH, in the Intel cpus. Previously I had
> always gotten better performance using MPI. I would like to know of
> other experience on how to make the machines faster. Let me explain in
> more details.
>
> Compiling using MPI means using mpif90 as linker and compiler, linking
> against mkl_ia32 or mkl_em64t, and using link flags -i-static -openmp.
> This is just the what appears in the make.sys after running configure
> in version 4cvs,
>
> At runtime, I set
> export OMP_NUM_THREADS=1
> export MKL_NUM_THREADS=1
> and run using
> mpiexec -n $NCPUs pw.x <input >output
> where NCPUs is the number of cores available in the system.
>
> The second choice is
> ./configure --disable-parallel
>
> and at runtime
> export OMP_NUM_THREADS=$NCPU
> export MKL_NUM_THREADS=$NCPU
> and run using
> pw.x <input >output
>
> I have tested it in Quadcores (NCPU=4) and with an old Dual Xeon B.C.
> (before cores) (NCPU=2).
>
> Before April 2007, the first choice had always workes faster. After
> that, when I came to use the MKL 10.xxx, the second choice is working
> faster. I have found no significant difference between version 3.2.3 and
> 4cvs.
>
> A special comment is for the FFT library. The MKL has a wrapper to the
> FFTW, that must be compiled after instalation (it is very easy). This
> creates additional libraries named like libfftw3xf_intel.a and
> libfftw2xf_intel.a
> This allows improves the performance in the second choice, specially
> with libfftw3xf_intel.a.
>
> Using MPI, libfftw2xf_intel.a is as fast as using the FFTW source
> distributed with espresso, i.e., there is no gain in using
> libfftw2xf_intel.a. With libfftw3xf_intel.a and MPI, I have never been
> able to run pw.x succesfully, it just aborts.
>
> I would like to hear of your experiences.
>
> Best regards
> Eduardo Menendez
> University of Chile
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Pw_forum mailing list
> Pw_forum at pwscf.org
> http://www.democritos.it/mailman/listinfo/pw_forum
--
---------------------------------------------------------------------
Prof Nicola Marzari Department of Materials Science and Engineering
13-5066 MIT 77 Massachusetts Avenue Cambridge MA 02139-4307 USA
tel 617.4522758 fax 2586534 marzari at mit.edu http://quasiamore.mit.edu
More information about the Pw_forum
mailing list