Node Discovery Protocol ----------------------- 1 Motivation Within an openMosix cluster, all participating nodes must have a loosely synchronized map of all nodes. In other words, all nodes must be aware of each other, but their maps need not be consistent at any given instant. The purpose of the node discovery protocol is to fulfill this requirement. It provides a mechanism to allow existing nodes to recognize new nodes which would like to join the cluster, and a method to allow new joining nodes to build maps of existing machines within the cluster. 2 Design 2.1 Messaging The initial implementation of the node discovery protocol is minimal. When a new node is initialized (when node discovery is activated), the following sequence of events occurs. 1. The new node is initialized, which means that its network interface(s) are configured up and openMosix is ready to operate. 2. The new node sends a "join" message to all other openMosix nodes signifying its existence. Receiving nodes can then add the sending to their map. OpenMosix has the concept of interface aliases. If a host has can send packets with different source addresses on the same network, and would like other nodes on the network to recognize them as the same host, then an alias entry mapping one node identifier to two interface addresses can be specified. As part of a join message, up to six aliase entries can be specified. Hosts will keep track of these addresses in order to know how to set their "number of gateway" entries, as well as aliases. 3. Each receiving node can then respond by sending an "acknowledgment" broadcast message to all openMosix nodes signifying its existence. This broadcast helps nodes maintain more accurate maps. The same alias information can be passed with acknowledgements as with joins. 4. [ not implemented or decided ] Before a node becomes unavailable, it can send a "leaving" message to all other nodes. Other nodes can remove the departing node from their map, along with its aliases, and gateway entries. 3 Implementation 3.1 Communication All nodes in the openMosix cluster will join a multicast group to be used for auto-discovery communication. In many clusters this will effectively be a routeable broadcast, because all nodes will join the multicast group. When a auto-discovery is activated (the auto-discovery daemon is started), it sends a "join" message to the multicast group. Upon receipt, nodes running the auto-discovery daemon send an "acknowledgment" message to the multicast group. Another approach would be for receiving nodes to send a single UDP datagram to the sending node. This approach has two disadvantages: (1) it does not necessarily reduce traffic because of potential ARP requests, and (2) does not have the benefit of aiding correction of other node maps, perhaps due to a lost datagram. 3.1.1 Message Structure The structure of all messages (payload of each datagram) sent by auto-discovery is as follows: (This is a future structure, currently the mskX fields are not there) 0000 0000 0011 1111 1111 2222 2222 2233 3333 3334 4444 4444 4544 5 0123 4567 8901 2345 6789 0123 4567 8901 2345 6890 1234 5678 9012 3 +----+----+----+----+----+----+----+----+----+----+----+----+----+-+ +mgcn|src |msk |ifn1|msk1|ifn2|msk2|ifn3|msk3|ifn4|msk4|ifn5|msk5|t| +----+----+----+----+----+----+----+----+----+----+----+----+----+-+ Definition of fields: 1. mgcn: a magic number to aid in verifying integrity of a message. If a preset magic number is not the first four bytes of the payload of a packet, it is discarded. 2. src: the source address of the message. This is used instead of the SRC in the IP header because it makes routing easier. 3. msk: the netmask for the source address. 4. ifnX: Interface alias fields. Each of these fields is an interface which nodes should consider an alias for the source address of the datagram. If there are no aliases, these fields are set to zero. 5. mskX: The respective netmask for each ifnX. 4. t: A type field describes the type of a message. Valid types are 'j' for a join message, and 'a' for an acknowledgement message. 3.2 Interaction with openMosix Kernel The auto-discovery daemon communicates with the openMosix kernel by reading and writing to /proc/mosix (/proc/hpc). [...] 4 Constraints: 1. Join messages are not retransmitted. In the case that one is lost, the system relies on multicasted acknowledgements to correct missing map entries. 2. Node identifier selection is based on the last two octets of an IPv4 address of the first specified interface. When interfaces are configured with certain netmasks (e.g. 0xffff0000), node identifier collisions can occur. 3. There is no routing loop detection. Using real multicast routing is recommended in complex networks. 5 Future: 5.1 New /proc Interface Below is a proposal of what a new /proc interface for autodiscovery might look like. I've started on it a bit, but it's up to the openMosix team if they actually want it. The existing /proc interface was not designed for dynamic node addition and removal. For example, in order to add a single node to the kernel's map, an entire array of structures must be written, and these structures match structures within the kernel. A cleaner abstraction and interface between auto-discover and /proc should be established. For example: 1. /proc/hpc/autodiscovery/add: A user application can write one or more new node entries to this interface notifying the kernel to add them to its map. 2. /proc/hpc/autodiscovery/remove: A user application can write one or more node removal entries to this interface notifying the kernel to remove them from its map. 3. /proc/hpc/autodiscovery/list: A user application can cat this interface to display nodes which reside in the kernel's map. Each of these interfaces would use ASCII to communicate.