<br><div class="gmail_quote"><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
there are two issue that need to be considered.<br>
<br>
1) how large are your test jobs? if they are not large enough, timings are pointless.</blockquote><div> </div><div>about 15 minutes in Intel Quadcore. 66 atoms: Cd_30Te_30O_6. 576 electrons in total. <br>My test may be very particular. If you a have a balanced benchmark, I would like to run it.<br>
<br></div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><br>
2) it is most likely, that you are still tricked by the<br>
auto-parallelization of intel MKL. the export OMP_NUM_THREADS<br>
will usually only work for the _local_ copy, for some<br>
MPI startup mechanisms not at all. thus your MPI jobs will<br>
be slowed down.</blockquote><div>I am using only SMP. Sorry, I still haven't a cluster of Quadcores. <br></div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
<br>
<br>
to make certain that you only like the serial version of<br>
MKL with your MPI executable, please replace -lmkl_em64t<br>
in your make.sys file with<br>
-lmkl_intel_lp64 -lmkl_sequential -lmkl_core</blockquote><div><br>Yes, I also tried that. The test runs in 14m2s. Using only -lmkl_em64t it runs in 14m31s. Using serial compilations it ran in 12m20s.<br><br><br><br></div>
Thanks,<br>Eduardo<br><br><br></div>