[Pw_forum] nodes halt problem

Eduardo Ariel Menendez P emenendez at macul.ciencias.uchile.cl
Fri Apr 28 23:25:57 CEST 2006


Dear friends
I apologize for a question of sys admin and not of espresso. However, this
is important for me to be able to has either espresso or siesta.

I would like to know if someone has had problem with this hardware
CPU: Xeom 3.06 GHz EMT64 (configures as 32 bits)
Chipset Intel E7320
Network adapter Intel 825401GI integrated in the mother board.
SATA hard disk in each node.
SATA controller 6300 ESB
My Linux is 2.4-27-2-686-smp Debian 1:3.3.6-10

When  running large parallel jobs, some nodes halt without any message.
We have made tests to determine if due it is hardware or software, but
we are not sure yet.

The system halts happen with at least two codes and with several versions
of LAM-MPI and MPICH, so it is no the problem.

One job using two CPU in the same node runs up to the end.
The same job using 2CPU in two nodes aborts due to one node halt.
The moment of halt seems random, also the node.
Nodes are connected via a Gbit switch. This make the switch and the
network cards suspects.

One job using 2 nodes and 4 CPU, linked using crossover cables, not the
switch, also abort due to halt. Then it seems that the switch is not
guilty.

Then it seems a problem of the kernel, or the network
controller, or the network adapter. How can I know who is guilty?
Is the problem is software, is there a solution other than install other
operative system? Can I discard software problems?

Thanks
Eduardo



More information about the Pw_forum mailing list