[Pw_forum] QE and mpich2, Linux

ac.rain at inbox.com ac.rain at inbox.com
Wed Dec 8 08:15:51 CET 2010


hi all,

I am having troubles getting MPICH2 working with Quantum Espresso, it seems to be working somewhat but with errors.

There may be an issue with my environment setup, some is guess-work as I couldn't find much documentation for espresso and mpich2.

server4 is the system the user will submit from. server3 and server5 are extra nodes on the same subnet. mpich2 was built with gfortran support and ssh keys are setup with no message-Of-Day so nodes can talk properly.

The espresso directory /usr/local/espresso-4.2.1 is located on server4 and NFS shared out to server3 and server5 with read/write mode (when I initially had NFS set to read-only, espresso was unable to write a temp file to the tests directory so I made it a read-write NFS export). This means all 3 systems have read-write access to the same espresso directory, is this correct?

extra info: inside the examples/environment_variables file I have the line "TMP_DIR=/home/mpiexec_espresso_tmp" however no files were written there (each system has its own instance of that dir) it seems to just use /usr/local/espresso-4.2.1/tmp ,as I can see files from today's date. This is fine with me if it prefers to use the NFS partition. I tried messing with PARA_PREFIX and PARA_POSTFIX inside the variables file but I only ran into worse issues.

here is my test command...

mpiexec -f ~/mpiMachinefile.txt -n 10 -wdir /usr/local/espresso-4.2.1/tests ./check-pw.x.j

here is the contents of my ~/mpiMachinefile.txt file...
server3:24
server4:24
server5:24

I have also tried with -n 30 and it seemed to put most of the process on the first server in the list as expected, however I never saw 10 or more process using 100% of the core in 'top'. When there were many processes at once they were only using small percentage of resources, with one of the process using 100% of core/cpu. From my understanding there should be 24 different process per machine at 100%, depending on what it's doing.

In summary: the errors are troubling, and I don't think the system(s) are using their full potential for simulating.

below is the first part of command-line output, from the mpiexec using the -n 10 option.
any advice appreciated, thanks,
Nick - Linux Administrator

$ mpiexec -f ~/mpiMachinefile.txt -n 10 -wdir /usr/local/espresso-4.2.1/tests ./check-pw.x.j 
Checking atom...Checking atom...Checking atom...Checking atom...Checking atom...Checking atom...Checking atom...Checking atom...Checking atom...Checking atom...passed
Checking atom-lsda...passed
Checking atom-pbe...discrepancy in pressure detected
Reference: -14.44, You got: -14.43
Checking atom-sigmapbe...passed
Checking atom-lsda...passed
Checking atom-lsda...passed
Checking atom-lsda...passed
Checking atom-lsda...passed
Checking atom-lsda...passed
Checking atom-lsda...passed
Checking atom-lsda...passed
passed
awk: cmd. line:6: fatal: cannot open file `atom.tmp' for reading (No such file or directory)
/bin/rm: cannot remove `atom.tmp': No such file or directory
Checking atom-lsda...Checking atom-lsda...STOP 2
FAILED with error condition!
Input: atom-lsda.in, Output: atom-lsda.out, Reference: atom-lsda.ref
Aborting
STOP 2
STOP 2
FAILED with error condition!
Input: atom-lsda.in, Output: atom-lsda.out, Reference: atom-lsda.ref
Aborting
FAILED with error condition!
Input: atom-lsda.in, Output: atom-lsda.out, Reference: atom-lsda.ref
Aborting
STOP 2
FAILED with error condition!
Input: atom-lsda.in, Output: atom-lsda.out, Reference: atom-lsda.ref
Aborting
STOP 2
FAILED with error condition!
Input: atom-lsda.in, Output: atom-lsda.out, Reference: atom-lsda.ref
Aborting
STOP 2
FAILED with error condition!
Input: atom-lsda.in, Output: atom-lsda.out, Reference: atom-lsda.ref
Aborting
discrepancy in number of scf iterations detected
Reference: 7, You got: 11
Checking atom-pbe...discrepancy in number of scf iterations detected
Reference: 7, You got: 11
Checking atom-pbe...discrepancy in pressure detected
Reference: -14.44, You got: -14.43
Checking atom-sigmapbe...discrepancy in pressure detected
Reference: -14.44, You got: -14.43
Checking atom-sigmapbe...STOP 2
FAILED with error condition!
Input: atom-sigmapbe.in, Output: atom-sigmapbe.out, Reference: atom-sigmapbe.ref
Aborting
discrepancy in number of scf iterations detected
Reference: 7, You got: 25
Checking atom-pbe...discrepancy in pressure detected
Reference: -14.44, You got: -14.43
Checking atom-sigmapbe...discrepancy in total energy detected
Reference:   -31.491047, You got:     0.000000
discrepancy in number of scf iterations detected
Reference: 16, You got: 
discrepancy in pressure detected
Reference: -15.02, You got: 
Checking berry...passed
Checking berry, step 2 ...discrepancy in number of scf iterations detected
Reference: 16, You got: 34
discrepancy in pressure detected
Reference: -15.02, You got: -15.11
Checking berry...STOP 2
FAILED with error condition!
Input: berry.in2, Output: berry.out2, Reference: berry.ref2
Aborting
passed
Checking berry, step 2 ...STOP 2
FAILED with error condition!
Input: berry.in2, Output: berry.out2, Reference: berry.ref2
Aborting
discrepancy in number of scf iterations detected
Reference: 16, You got: 32
discrepancy in pressure detected
Reference: -15.02, You got: -14.98
Checking berry...passed
Checking berry, step 2 ...

<output chopped/incomplete> end.

____________________________________________________________
Publish your photos in seconds for FREE
TRY IM TOOLPACK at http://www.imtoolpack.com/default.aspx?rc=if4


More information about the Pw_forum mailing list