[Pw_forum] Re: Woodcrest vs Opteron performance in pwscf calc.

Wed Aug 2 14:07:34 CEST 2006

On Wed, 2 Aug 2006, Alexander Shaposhnikov wrote:

hi,

AS> Thanks for the answer. 
AS> I don't think this topic is relevant to the PW_forum goals. 

actually, it is. some people here spend a lot of money 
on new machines and figuring out what is the best deal
for a specific application needs a lot of testing. so
every contribution is important.

[...]
AS> time to move on to the new platform. But my experience with the prev.
AS> generation of Intel processors showed, that Opteron is faster is most
AS> cases, especially then it comes to multi-threaded calculations.

really?? are you talking about OpenMP multi-threaded or explicit
multi-threaded? in my experience so far, OpenMP on an opteron system
was a serious letdown, and it was usually much better to use 
MPI parallelism, even within the nodes.

[...]

AS> > It seems that woodcrest and dempsey are much faster than opteron. The 
AS> > scalability of
AS> > dempsey is the best, woodcrest is the worst. Despite of the amazing 
AS> > performance per
AS> > core of woodcrest, it drops to the same level of its predecessor, dempsey, 
AS> > when taking
AS> > the machine as a unit to evaluate its performance.

one thing to check when using intel MKL is, whether it is running in
multi-threaded mode and thus getting better results on a 'half-loaded'
machines. for that, you may want to re-run the jobs with the 
environment variable OMP_NUM_THREADS set to 1. secondly, memory 
contention is a problem, so it would be interesting to see the 
performance, if you run 4 serial jobs at the same time.

AS> > But remember one thing: the number for opteron may not be fair. I compiled 
AS> > the program
AS> > using Intel fortran, Intel MPI 2.0. However, I ever used both Intel and 
AS> > PathScale to
AS> > compile FFTW and its test cases on opteron machine, I didn't find any 
AS> > impressive
AS> > differences.

the intel compiler ususally does a good job on opteron. especially,
since for floating point intensive jobs don't benefit a lot if at
all from using the atomated vectorization with SSE. usually get
the best performance on opteron and P4 with '-O2 -tpp6 -unroll' 
not using any vectorization. that however is a different story
when it comes to BLAS/LAPACK: using ACML > 2.7 is essential to
get good performance on dual-core opteron machines.

there is a way to make (the gcc) ACML compatible with the intel
compiler (at least for packages that use only double precision
functions), see:
https://www.liniac.upenn.edu/wiki/tiki-index.php?page=acml+for+CMM

it would be nice to see, how using ACML would affect 
the performance in this case.

best regards,
    axel.

-- 
=======================================================================
Axel Kohlmeyer   akohlmey at cmm.chem.upenn.edu   http://www.cmm.upenn.edu
   Center for Molecular Modeling   --   University of Pennsylvania
Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
tel: 1-215-898-1582,  fax: 1-215-573-6233,  office-tel: 1-215-898-5425
=======================================================================
If you make something idiot-proof, the universe creates a better idiot.