<html>

<head>

<style>

.hmmessage P

{

margin:0px;

padding:0px

}

body.hmmessage

{

FONT-SIZE: 10pt;

FONT-FAMILY:Tahoma

}

</style>

</head>

<body class='hmmessage'>

<style>.hmmessage P{margin:0px;padding:0px}body.hmmessage{FONT-SIZE: 10pt;FONT-FAMILY:Tahoma}</style>Dear Axel,<br><br>First, I wanna express my acknowledgement for your kindly responding. From it I learned a lot.<br><br>I typed 'uname -a', the information 'Linux node5 2.6.9-42.ELsmp #1 SMP Wed Jul 12 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux' displayed. So, it's a x86_64 OS.<br><br>&gt; VL&gt; Why I updated my MKL and compilers brings me less efficiency?<br>&gt; <br>&gt; that has most likely other reasons (runaway processes?, other users?)<br><br>I'm sure no other users, I'm the admin of that cluster.<br><br>Have your tested the intel C++ and Fortran 10.1.017 Compilers for intel 64 vision and MKL 10.0.3.020? Do you think it's really suitable for my clusters and QE? Do you think it's better than 10.1.008 compilers and 10.0.011 MKL?<br><br>&gt; VL&gt; At last, I compile the QE using amd64 architecture schedule by intel <br>&gt; VL&gt; C++ and Fortran 10.1.017 vision and MKL 10.0.3.020 library, but I <br>&gt; VL&gt; find it less efficienct the the QE compiled by intel C++ and Fortran <br>&gt; VL&gt; 10.1.008 vision and 10.0.011 library. The efficiency of QE compiled <br>&gt; VL&gt; by 10.1.008 compiler and 10.0.011 is about 60% but the QE compiled <br>&gt; VL&gt; by 10.1.017compiler is 10% tested by input file like this:<br>&gt; <br>&gt; how do you determine this 'efficiency'? how do you run your job?<br><br>I run the inputfile by mpi command 'mpiexec -n 20 pw.x &lt; inputfile &gt; outputfile', the job started immediately. The input file is like this:<br><pre> &amp;CONTROL<br>                       title = 'Anatase lattice BFGS' ,<br>                 calculation = 'vc-relax' ,<br>                restart_mode = 'from_scratch' ,<br>                      outdir = '/home/vega/tmp/' ,<br>                  pseudo_dir = '/home/vega/espresso-4.0/pseudo/' ,<br>                      prefix = 'Anatase lattice default' ,<br>               etot_conv_thr = 0.000000735 ,<br>               forc_conv_thr = 0.0011668141375 ,<br>                       nstep = 1000 ,<br> /<br> &amp;SYSTEM<br>                       ibrav = 6,<br>                   celldm(1) = 7.135605333,<br>                   celldm(3) = 2.5121822033898305084745762711864,<br>                         nat = 12,<br>                        ntyp = 2,<br>                     ecutwfc = 25 ,<br>                     ecutrho = 200 ,<br> /<br> &amp;ELECTRONS<br>                    conv_thr = 7.3D-8 ,<br> /<br> &amp;IONS<br>                ion_dynamics = 'bfgs' ,<br> /<br> &amp;CELL<br>               cell_dynamics = 'bfgs' ,<br>                 cell_dofree = 'xyz' ,<br> /<br>ATOMIC_SPECIES<br>   Ti   47.86700  Ti.pw91-sp-van_ak.UPF <br>    O   15.99940  O.pw91-van_ak.UPF <br>ATOMIC_POSITIONS angstrom <br>   Ti      0.000000000    0.000000000    0.000000000    <br>   Ti      1.888000000    1.888000000    4.743000000    <br>   Ti      0.000000000    1.888000000    2.372000000    <br>   Ti      1.888000000    0.000000000    7.115000000    <br>    O      0.000000000    0.000000000    1.973000000    <br>    O      1.888000000    1.888000000    6.716000000    <br>    O      0.000000000    1.888000000    4.345000000    <br>    O      1.888000000    0.000000000    9.088000000    <br>    O      1.888000000    0.000000000    5.141000000    <br>    O      0.000000000    1.888000000    0.398000000    <br>    O      1.888000000    1.888000000    2.770000000    <br>    O      0.000000000    0.000000000    7.513000000    <br>K_POINTS automatic <br>  7 7 3   1 1 1 <span style="font-family: Tahoma,Helvetica,Sans-Serif;"><br><br></span>And I see 4 process on each node by 'top' command. So I think each core has a process. <br>Is there anything I misunderstanding? <br><br>I think I should also use the mpi command like this 'mpiexec -n 20 pw.x -npool 5 &lt; inputfile &gt; outputfile'<br>I has tried <span style="font-family: Tahoma,Helvetica,Sans-Serif;">it. But no obvious improvement, the usage of the CPU  count by 'sar' command still about  60%, 7% for the system, and the rest is idle.<br>&nbsp;<span style="font-family: monospace;"></span></span><br>&gt; it is much better to compiler fftw with the native gcc compiler.<br><br>&gt; but since QE actually contains FFTW there is no need to install<br><br>&gt; or compile it.<br><br>QE contains FFTW? Where is it? Does it should be detected by the QE configure?<br>if I compile the FFTW by gcc or don't compile FFTW, QE configure can't detect it and show the information<br>about the FFTW in QE configure process even I have put the dirs to the environment variable. So, I think<br>only the fftw compiled by ifort can be found by my QE.<br><br>Do your think intel MKL is needed? When I configured the QE without intel MKL, the QE configure also could find <br>the BLAS and LAPACK under itself folder.<br><br>You mentioned OMP_NUM_THREADS again. I'm sorry I know little about it.<br>Should I use the export command like 'export OMP_NUM_THREADS=1'?<br>If the command is enough, could you please tell me, when I should type it? before configuring the QE?<br>or before using the mpiexec command? <br><br>And could you give me some hints about the optimization of Anatase lattice using BFGS schedule?<br>Why the 'cell_dofree = 'xyz'' in &amp;CELL section take no effect to aviod lattice angles changing, <br>and result in a 'not orthogonal operation' error?<br>Do you think, I shloud never using BFGS to optimize Anatase lattice? <br><br>thank your again for your so detail responding.<br></pre><br>Vega Lew<br>PH.D Candidate in Chemical Engineering<br>College of Chemistry and Chemical Engineering<br>Nanjing University of Technology, 210009, Nanjing, Jiangsu, China<br><br>&gt; Date: Wed, 25 Jun 2008 23:50:38 -0400<br>&gt; From: akohlmey@cmm.chem.upenn.edu<br>&gt; To: vegalew@hotmail.com<br>&gt; CC: pw_forum@pwscf.org<br>&gt; Subject: Re: [Pw_forum] questions about intel CPU and vc-relax using bfgs cell optimization<br>&gt; <br>&gt; On Thu, 26 Jun 2008, vega lew wrote:<br>&gt; <br>&gt; VL&gt; <br>&gt; VL&gt; Dear all,<br>&gt; VL&gt; <br>&gt; <br>&gt; VL&gt; I built a cluster of 5 computers with intel Core TM 2 Q6600 CPU <br>&gt; VL&gt; (quadcore), and 40G memory total (8G each) on S3000AH system board. <br>&gt; VL&gt; The network is 1Gbit Ethernet. I also checked the em64t option in <br>&gt; VL&gt; BIOS is on, so I think Q6600 is a cpu using em64t technology. For <br>&gt; VL&gt; more information about my CPU, see <br>&gt; <br>&gt; more important is to determine whether you installed a 32-bit or<br>&gt; a 64-bit version of the OS. you can find that out with 'uname -a'.<br>&gt; for 32-bit you get (amongst others) i386 and i686 whereas for <br>&gt; 64-bit you get x86_64. regardless of bios options or what cpuinfo<br>&gt; shows, the cpu can handle both.<br>&gt; <br>&gt; [...]<br>&gt; <br>&gt; VL&gt; Therefore, I updated my intel C++ and Fortran Compilers from <br>&gt; VL&gt; 10.1.008 to latest vision 10.1.017 for Intel(R) 64 and MKL from <br>&gt; VL&gt; 10.0.011 to latest 10.0.3.020, file names displayed on website were <br>&gt; VL&gt; l_cc_p_10.1.017_intel64.tar.gz, l_cc_p_10.1.017_intel64.tar.gz and <br>&gt; VL&gt; l_mkl_p_10.0.3.020.tgz. After installation of the three, I compiled <br>&gt; VL&gt; for em64t vision of blas95 lapack95 in <br>&gt; <br>&gt; those are not needed.<br>&gt; <br>&gt; VL&gt; /opt/intel/mkl/10.0.3.020/interfaces/ using ifort under <br>&gt; VL&gt; /opt/intel/fce/10.1.017/bin/. Then compiled mpich2 using ifort and <br>&gt; VL&gt; icc. But when I compile fftw 2.1.5 an error occurred, so I compile <br>&gt; <br>&gt; it is much better to compiler fftw with the native gcc compiler.<br>&gt; but since QE actually contains FFTW there is no need to install<br>&gt; or compile it.<br>&gt; <br>&gt; VL&gt; the fftw 2.1.5 using 10.1.008 ifort and icc on other node with same <br>&gt; VL&gt; hardware, the scp it to master node. After all above done, I turned <br>&gt; VL&gt; to compile QE.<br>&gt; <br>&gt; VL&gt; But to my surprise, QE detected my architecture as amd64, not ia32 <br>&gt; VL&gt; or ia64. My first question is does QE support the intel EM64T <br>&gt; VL&gt; technology and take advantages from it ?<br>&gt; <br>&gt; it is neither ia32 nor ia64. amd actually invented this 64-bit mode<br>&gt; and then intel named it EM64t (to avoid having to call it amd64).<br>&gt; the official linux architecture is x86_64.<br>&gt; <br>&gt; VL&gt; At last, I compile the QE using amd64 architecture schedule by intel <br>&gt; VL&gt; C++ and Fortran 10.1.017 vision and MKL 10.0.3.020 library, but I <br>&gt; VL&gt; find it less efficienct the the QE compiled by intel C++ and Fortran <br>&gt; VL&gt; 10.1.008 vision and 10.0.011 library. The efficiency of QE compiled <br>&gt; VL&gt; by 10.1.008 compiler and 10.0.011 is about 60% but the QE compiled <br>&gt; VL&gt; by 10.1.017compiler is 10% tested by input file like this:<br>&gt; <br>&gt; how do you determine this 'efficiency'? how do you run your job?<br>&gt; <br>&gt; since MKL will automatically multi-thread across all cores,<br>&gt; you have to set the environment variable OMP_NUM_THREADS to 1 <br>&gt; or else you'll be oversubscribing each cpu 4x. secondly, for <br>&gt; efficient operation across gigabit ethernet (which is quite slow<br>&gt; and has very high latencies), you have to parallelize across <br>&gt; k-point pools, at least between nodes or else your performance<br>&gt; will be horrible. if you have not taken care of mkl multithreading<br>&gt; all performance data will be bogus.<br>&gt; <br>&gt; <br>&gt; VL&gt; My second question is about the efficiency: <br>&gt; VL&gt; Which compiler and MKL vision is the best one for my cluster?<br>&gt; <br>&gt; the one that runs corectly. the performance difference between<br>&gt; a different optimized implementations of BLAS are on average<br>&gt; of the order of 10% of the total time. compiler impact (e.g.<br>&gt; between g95/gfortran and intel 10) is of the same order.<br>&gt; <br>&gt; VL&gt; Why I updated my MKL and compilers brings me less efficiency?<br>&gt; <br>&gt; that has most likely other reasons (runaway processes?, other users?)<br>&gt; <br>&gt; VL&gt; What is the best efficiency of my cluster can reach ? 60% is low or high for QE?<br>&gt; <br>&gt; you should be able to do better, if you run your job the right way.<br>&gt; please check the documentation on how to run QE properly in parallel.<br>&gt; <br>&gt; to determine the performance baseline, you should first run a test<br>&gt; with only one MPI task and set OMP_NUM_THREADS to 1 (read the intel<br>&gt; docs about this).<br>&gt; <br>&gt; cheers,<br>&gt;    axel.<br>&gt; <br>&gt; <br>&gt; -- <br>&gt; =======================================================================<br>&gt; Axel Kohlmeyer   akohlmey@cmm.chem.upenn.edu   http://www.cmm.upenn.edu<br>&gt;    Center for Molecular Modeling   --   University of Pennsylvania<br>&gt; Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323<br>&gt; tel: 1-215-898-1582,  fax: 1-215-573-6233,  office-tel: 1-215-898-5425<br>&gt; =======================================================================<br>&gt; If you make something idiot-proof, the universe creates a better idiot.<br><br /><hr />Connect to the next generation of MSN Messenger�  <a href='http://imagine-msn.com/messenger/launch80/default.aspx?locale=en-us&source=wlmailtagline' target='_new'>Get it now! </a></body>

</html>