[Pw_forum] Re: Woodcrest vs Opteron performance in pwscf calc.
Axel Kohlmeyer
akohlmey at vitae.cmm.upenn.edu
Wed Aug 2 14:07:34 CEST 2006
On Wed, 2 Aug 2006, Alexander Shaposhnikov wrote:
hi,
AS> Thanks for the answer.
AS> I don't think this topic is relevant to the PW_forum goals.
actually, it is. some people here spend a lot of money
on new machines and figuring out what is the best deal
for a specific application needs a lot of testing. so
every contribution is important.
[...]
AS> time to move on to the new platform. But my experience with the prev.
AS> generation of Intel processors showed, that Opteron is faster is most
AS> cases, especially then it comes to multi-threaded calculations.
really?? are you talking about OpenMP multi-threaded or explicit
multi-threaded? in my experience so far, OpenMP on an opteron system
was a serious letdown, and it was usually much better to use
MPI parallelism, even within the nodes.
[...]
AS> > It seems that woodcrest and dempsey are much faster than opteron. The
AS> > scalability of
AS> > dempsey is the best, woodcrest is the worst. Despite of the amazing
AS> > performance per
AS> > core of woodcrest, it drops to the same level of its predecessor, dempsey,
AS> > when taking
AS> > the machine as a unit to evaluate its performance.
one thing to check when using intel MKL is, whether it is running in
multi-threaded mode and thus getting better results on a 'half-loaded'
machines. for that, you may want to re-run the jobs with the
environment variable OMP_NUM_THREADS set to 1. secondly, memory
contention is a problem, so it would be interesting to see the
performance, if you run 4 serial jobs at the same time.
AS> > But remember one thing: the number for opteron may not be fair. I compiled
AS> > the program
AS> > using Intel fortran, Intel MPI 2.0. However, I ever used both Intel and
AS> > PathScale to
AS> > compile FFTW and its test cases on opteron machine, I didn't find any
AS> > impressive
AS> > differences.
the intel compiler ususally does a good job on opteron. especially,
since for floating point intensive jobs don't benefit a lot if at
all from using the atomated vectorization with SSE. usually get
the best performance on opteron and P4 with '-O2 -tpp6 -unroll'
not using any vectorization. that however is a different story
when it comes to BLAS/LAPACK: using ACML > 2.7 is essential to
get good performance on dual-core opteron machines.
there is a way to make (the gcc) ACML compatible with the intel
compiler (at least for packages that use only double precision
functions), see:
https://www.liniac.upenn.edu/wiki/tiki-index.php?page=acml+for+CMM
it would be nice to see, how using ACML would affect
the performance in this case.
best regards,
axel.
--
=======================================================================
Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu http://www.cmm.upenn.edu
Center for Molecular Modeling -- University of Pennsylvania
Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
tel: 1-215-898-1582, fax: 1-215-573-6233, office-tel: 1-215-898-5425
=======================================================================
If you make something idiot-proof, the universe creates a better idiot.
More information about the Pw_forum
mailing list