Apr 24
I recently found myself with an interesting problem to solve while jumpstarting some very large SPARC machines. Aside from some of the neat functionality that has been stuck in there since I last looked at it in detail, there's absolutely nothing new about jumpstarting machines, and although I've jumpstarted dozens/hundreds of machines of varying sizes over the years, there was significant expectation (and risk) to upgrading these particular machines and I wanted to make sure that I got it right first time, every time.


These particular machines are 32-way SPARC64 V based Fujitsu machines (actually 64-way, but each chassis is divided into two partitions). They have numerious I/O controllers (off the top of my head, I'd estimate about 16 controllers), with an even split of both traditional SCSI and fibrechannel controllers that were installed at various points during the lifecycle of the old OS. Most controllers have devices visible on them - either locally installed hard drives, or LUNs visible across the SAN.


My plan was to retain the pair of drives from the old Operating System (Solaris 8), and install the new OS onto a fresh pair of drives which were already connected - this would facilitate an easy back-out, should it be required. I could not guarantee that the controller numbers would not renumber during the re-install - and to have Jumpstart accidentally select the wrong devices could lead to the old OS image being overwritten, data loss, or simply just the wrong devices being selected so that the disks are not mirrored between separate system boards as intended.


Now, armed with the engineering manual of the PrimePOWER server in question, I could have researched the probe order of the PCI cards, and correleated this with the existing installation, but to be frank, I'm way too lazy to do that. I'd rather find another less error-prone method of forcing Jumpstart to select the correct devices for install. Additionally, given that a full boot/POST/reboot cycle on these machines can take in excess of an hour, I was keen to ensure that the outage window for the work was kept to an absolute mininum - this meant that I had to be absolutely sure that I was going to get it right first time.


I think I found a good, simple solution. In fact, I'm surprised that this functionality isn't already allowed in the jumpstart profile.


Within Solaris, we already have a unique device identifier that will persist across reboots and OS re-installs - the physical device path. So, rather than specificying the intended devices using the traditional cXtYdZ notation, we should be able to specify a physical device path (for example on this particular machine type it would take the format something like this: /pci@86,4000/scsi@4/sd@0,0). In fact - we don't need to specify the entire physical path - the only requirement is that we specify enough to uniquely identify one device, but in most cases it's probably safer to fully specify that path.


In order to facilitiate this, I used a jumpstart derived profile, and using a very basic "begin" script that will check the system hostname, and according to a case statement will identify which physical devices are appropriate for this host. It will then lookup the logical access path in cXtYdZ format and generate the necessary profile for it.


I'll not fully reproduce the script here because it contains references to client config and other functionality that I may write about in the future, but the following code fragments ought to give you the idea (hostnames have been changed to protect the innocent):



case `hostname` in

pw2500a)
        SOL10_ROOT=/pci@86,4000/scsi@4/sd@0,0
        SOL10_MIRROR=/pci@a4,4000/scsi@4/sd@0,0
;;

pw2500b)
        SOL10_ROOT=/pci@8a,4000/scsi@4/sd@0,0
        SOL10_MIRROR=/pci@94,4000/scsi@4/sd@0,0
;;
*)
        gameOver "ERROR: `hostname` not found in config"
;;
esac
######################
# Translate a physical device to a logical cXtXdX
translateDevice() {
        if [ $# -ne 1 ] ; then
                gameOver "ERROR : translateDevice() called with no args"
        fi
        DEVICE=$(ls -l /dev/dsk/c*s0 |grep $1 |awk '{print $(NF-2)}')
        if [ -z "$DEVICE" ] ; then
                gameOver "ERROR: $1 not found in /dev/dsk/"
        fi
        if [ $(echo $DEVICE |wc -w) -gt 1 ] ; then
                gameOver "ERROR: $1 found, but returned multiple disks ($DEVICE)"
        fi

        print $DEVICE | cut -d/ -f4 | sed 's/s0$//'
}
######################
# Main program
if [ -z "$SOL10_ROOT" ] ; then
        gameOver "ERROR: SOL10_ROOT not defined for `hostname`"
fi
SOL10_ROOT=$(translateDevice $SOL10_ROOT)
if [ -z "$SOL10_ROOT" ] ; then
        exit 1
fi
SOL10_ROOT="${SOL10_ROOT}sXSECTIONX"
SOL10_DEVS=$SOL10_ROOT

if [ ! -z "$SOL10_MIRROR" ] ; then
        SOL10_MIRROR=$(translateDevice $SOL10_MIRROR)
        if [ -z "$SOL10_MIRROR" ] ; then
                exit 1
        fi

        SOL10_MIRROR="${SOL10_MIRROR}sXSECTIONX"
        SOL10_DEVS="$SOL10_ROOT $SOL10_MIRROR"
fi

print "filesys mirror $SOL10_DEVS 10240 /"              |sed 's/XSECTIONX/0/g'
print "filesys mirror $SOL10_DEVS 4096 swap"            |sed 's/XSECTIONX/1/g'
print "filesys mirror $SOL10_DEVS 6144 /var"            |sed 's/XSECTIONX/3/g'
print "filesys mirror $SOL10_DEVS free /export" |sed 's/XSECTIONX/5/g'
print "metadb $SOL10_ROOT size 8192 count 3"            |sed 's/XSECTIONX/7/g'

if [ ! -z $SOL10_MIRROR ] ; then
print "metadb $SOL10_MIRROR size 8192 count 3"          |sed 's/XSECTIONX/7/g'
fi




(any errors or omissions in that script are likely to be due to me copy/pasting it into this entry..).


Importantly, this method also allows the user to run this script on the
host prior to the jumpstart actually being performed, and it should
still accurately  identify the correct devices using their current
logical access paths. A useful pre-install check:


 



$ ./getDiskLayout
filesys mirror c1t0d0s0 c5t0d0s0 10240 /
filesys mirror c1t0d0s1 c5t0d0s1 4096 swap
filesys mirror c1t0d0s3 c5t0d0s3 6144 /var
filesys mirror c1t0d0s5 c5t0d0s5 free /export
metadb c1t0d0s7 size 8192 count 3
metadb c5t0d0s7 size 8192 count 3
$



The end result - every single machine built first time correctly, without a significant amount of manual investigation into the physical device path and PCI probe order. When using derived profiles, there are many ways that the same functionality could be implemented, and this illustrates just one of those methods.

Posted by Mike Scott

| Top Exits (0)

0 Trackbacks

  1. No Trackbacks

0 Comments

Display comments as(Linear | Threaded)
  1. No comments

Add Comment


Enclosing asterisks marks text as bold (*word*), underscore are made via _word_.
Standard emoticons like :-) and ;-) are converted to images.

To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly.
CAPTCHA