May 20

Working a particularly troublesome problem this week reminded me of two things: I realised (not for the first time in my career) that sometimes the unlikliest root cause will be there to bite you in the ass.

It also reminded me of a phrase once used by the late John Peel:

I never make stupid mistakes. Only very, very clever ones.


You see, I'm currently working a plan to decommission a whole raft of legacy E10000 systems from the current client.

We moved a four-board domain last week from one datacentre to another, using a previously idle domain on the destination platform. Everything went smoothly - all three SANs were visible, the network appeared to come up, we re-applied a new Veritas license (because the hostid had changed). However - after I did a final reboot before leaving for the night I noted that I was having trouble connecting to the public network interface. In fact, I was seeing in excess of 95% packet loss on inbound traffic.

Backing out the change was not a (pleasant) option, so instead I updated DNS and the defaultrouter to divert all traffic through the backup network as a workaround, and left for the night (it was getting very late, and this wasn't a production host).

The following day I rustled up a member of the comms team, and we spent an afternoon in the datacentre, trying to troubleshoot in a structured manner. The results of this testing was annoyingly inconclusive - the problem was showing up on multiple cables, multiple NICs, multiple hosts and multiple switch ports.

As a workaround to try and ensure that the public and backup traffic don't beat each other up, we decided to connect the public network interface into the backup subnet.

That was the very, very clever mistake. Normally this wouldn't be a problem, because normally we have local-mac-address set, to ensure that each NIC on the machine has a unique Ethernet address. Of course - this caused problems at the first point that significant traffic was being passed through the backup interface - but when the problems were reported I noted the configuration on "ifconfig", it struck me that the behaviour that we were seeing was remarkably similar to the other problem with the public network.

I quickly logged onto a machine with a connection to that public network, performed a broadcast ping ("ping -s 255.255.255.255"), then checked the ARP table and compared it with the Ethernet address of the problem domain. My suspiscions were confirmed... we have two E10k domains with the same Ethernet address (and hostid), and both are set with "local-mac-address" to false. This must have been a deliberate action, as Sun must have generated the necessary key to allow the domain to be generated like so, but the real reason why this sysadmin trap was configured like so has been lost over the years.

Sheesh...

Posted by Mike Scott

| Top Exits (0)

4 Trackbacks

  1. No Trackbacks

1 Comments

Display comments as(Linear | Threaded)
  1. Anonymous says:

    Thanks Mike,

    We've been having 30-40% ping packet loss since the system controller board was replaced on a Sun-Fire-V490, which changed the eeprom local-mac-address to false. This is the only page I googled that helped me find the problem after much fustration.

    Cheers,
    G

Add Comment


Enclosing asterisks marks text as bold (*word*), underscore are made via _word_.
Standard emoticons like :-) and ;-) are converted to images.

To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly.
CAPTCHA