Monday, December 31, 2007

Weekend Down in Flames

A page at 2:46 Saturday morning woke me up. Our Lawson server, an aging Digital Alpha 4100 running Tru64 Unix 5.1B was complaining. Right away I knew that my list of chores I had planned for the weekend was going to take a hit.

A volume managed in LSM (Logical Storage Manager) had lost a plex and one of the secondary swap spaces was running unmirrored. It looked like a relatively simple disk failure, so I headed back to bed.

In the morning, I went to the office to gather up some paperwork, then drove to the computer room. The alarm in the disk cabinet was screaming like a banshee, and it was festooned with a number of amber lights. The cabinet contains two sets of redundant HSZ70 scsi controllers. The top set is connected to an expansion unit on the back of the cabinet. There are about 70 disks in the cabinet. Four of them, all on channel 1 off the expansion unit were blinking amber. Yikes!

I checked the CLI interface (Command Line Interface Interface) and could see that two raidsets were running reduced, the third disk was the one that was the swap mirror, and the fourth bad disk isn't used. Since everything was mirrored or raided, the system was still up and functioning. I filed a call with HP support at 1-800-354-9000 (1-800-DIAL-A-PRAYER). I tried for a while to convince the phone support tech that there was probably something wrong with the entire channel on the expansion unit, but the controller didn't have any errors on it, so he eventually sent the call to the local field tech and asked them to bring over four new drives. Uh huh. Fortunately, I know the field tech is better than that, and when he called, I explained the situation.

Later in the evening, the tech arrived at the computer room (I had since gone back home) and he unplugged the expansion unit, blew on it, and put it back in. That brought the disks back online. However, it also managed to take down all the disks on channel one on the front of the cabinet too, including the mirror of the root disk that I'd booted from. He rebuilt all the disks from the hardware side, but it left a mighty mess to clean up on the software side.

With the boot disk down and "replaced", that means a shutdown and boot from the other root mirror to free it up so it can be rebuilt.

1) Disassociate and remove the bad plexes from the root volume
volplex -o rm dis rootvol-01
volplex -o rm dis swapvol-01
volplex -o rm dis vol-rz16g-01

2) Remove the disks from the diskgroup
voldg rmdisk rz16a
voldg rmdisk rz16b
voldg rmdisk rz16f
voldg rmdisk rz16g

3) Remove disks from LSM:
voldisk rm rz16a
voldisk rm rz16b
voldisk rm rz16f
voldisk rm rz16g

4) Physically replace the disks using the HSZ commands (I didn't have to replace mine since it wasn't the disks that failed)

5) Label the new disk
disklabel -wr rz16

If that doesn't work, try:

disklabel -z rz16
disklabel -wr rz16

For me, trying to label the disk didn't work at all because it was the boot disk and the OS still had a partition open and claimed. I had to "shutdown -h now" and reboot from the >>> console prompt using the address of the mirrored (good) root drive. On my system a show bootdef_dev showed that the default boot disk was dkc0..., so I used "boot dkc1" to boot from the mirror.

When the system rebooted, I was then able to continue with the disklabel:

disklabel -wr rz16

6) Mirror the root drive
volrootmir -a rz16

That command will build a mirror on the disk specified.

After all that, I tried to remirror the additional swap volume, but ran into a roadblock.

The disks were set up in LSM when the system was installed using Tru64 4.0x. I had since upgraded to Tru64 5.1B. LSM used to use a private region of 1024, but the new version uses a private region of 4096. When I tried to add the new disk using voldiskadm, and then mirror it, it said there wasn't enough space to complete the mirror:

# volassist mirror swapvol02 swap01
lsm:volassist: ERROR: Cannot allocate space to mirror 8376988 block volume

I then tried to write a disklabel from the good mirror onto the disk before adding it with voldiskadm (dsk13 is the good disk, dsk12 is the new one):

disklabel -r dsk13 > dsk13.lab
disklabel -R dsk12 dsk13.lab

I then added the disk through voldiskadm and chose not to initialize it. But then I got this error:

# volassist mirror swapvol02 swap01
lsm:volplex: ERROR: Volume swapvol02, plex swapvol02-01, block 0: Plex write:
Error: Write failure
lsm:volplex: ERROR: sd swap01-01 in plex swapvol02-01 failed during attach
lsm:volplex: ERROR: changing plex swapvol02-01:
Record not in disk group
lsm:volplex: ERROR: Attempting to cleanup after failure ...
lsm:volassist: ERROR: Could not attach new mirror(s) to volume swapvol02
lsm:volassist: WARNING: Object swapvol02-01: Unexpectedly removed from the configuration

Ouch. I turned back to Google and found two tech forum posts from people who'd encountered the same thing, but no one had answered them.

I called HP support and they sent me these three simple little commands:

# voldisksetup -i dsk12 privlen=1024
# voldg adddisk swap01=dsk12
# volassist mirror swapvol02 swap01

Worked like a charm, and the system is back to normal with everything mirrored and raided properly.

However, there's still an amber warning light on the cabinet. Since the controllers aren't reporting any errors, the field tech's best guess is a problem with one of the many fans in the cabinet. He'll be coming by Wednesday with a few new fans to try some replacements to see if he can get that pesky amber light to go away.

Wednesday, December 19, 2007

Quote of the Day

Here's my favorite quote today, found while browsing to try to find an AIX hardware support matrix. Apparently IBM is stingy with that information too.

From "inferno" on

"I do not get it, I explained that I was interested in getting IBM certified. Companies like Sun Microsystems and Red Hat will constantly harrass you and damn near send a limo and a bunch of exotic dancers to your home in order to get you certified. So what is wrong with IBM?"

Thursday, December 13, 2007

X11 Forwarding over SSH

Holy crap. I just spent about five hours banging my head against a wall.

In an effort to try to secure connections, I've been trying to get more things tunneled through ssh to lock down some more ports to our DMZ network. Today, I've been working on getting X Windows applications to tunnel over ssh to my PC.

I read a couple online manuals.

I connected from PuTTY on my PC to a Solaris 9 box on our DMZ. Ssh session came right up. I turned on X11 forwarding and enabled it on the server and tried to log in again. No dice. It closed the connection right after I typed in the password. I must be doing something wrong.

I did some more web searching. Read several more tutorials about ssh and X11 forwarding. Still no dice. Still must be doing something wrong. Click this. Click that. Edit this config file, edit that. Nope. Passive, active, indirect. Nope. Port forwarding. No port forwarding. Nope. DISPLAY set. DISPLAY not set. Nope. Nope. Nothing in the PuTTY logs.

I really must not understand this ssh/X11 forwarding thing at all. Yet every document I read has virtually the same instructions. What could I be missing?

I finally happen to check the error log on the Solaris 9 box itself. Sure enough, there were errors that corresponded to each of my login attempts. I did some more web searching and finally found it: a bug report for a Solaris 9 patch that causes X11 forwarding to fail. Effin' A.

I tried one of my Solaris 10 servers. Worked the first time. Five hours gone up in smoke. No wonder I'm quiet at the dinner table. I just worked hard all day doing nothing.

Friday, December 7, 2007

Reason #682 Why I Get Frustrated with IBM

An application developer down the hall came in the office late yesterday asking if I could install the HP LaserJet 4000 printer drivers on one of our IBM AIX servers. "Sure, no problem," I said, knowing in my core I was about to embark on a painful adventure.

This morning, I set about trying to locate the printer drivers on IBM's website. Fifteen minutes of thrashing about... no deal. They link to a huge list of some Infoprint driver crap, but nothing for HP printers.

I checked out HP's website, which is almost as poorly designed as IBM's. IBM's is worse because not only is the design bad, but they don't let you download anything useful. HP has lots of useful stuff on their website, but it's just really hard to find.

Anyway, after a few minutes of thrashing on HP's website, I found they have some generic un*x drivers for the LaserJet 4000 series, but I knew that wasn't going to fly on the AIX server. I needed the *.rte files from IBM.

I turned to Google, and found many posts to technical forums, and each one went like this:

Question: Where on the web can I find HP printer drivers for IBM?

Answer: You can't. I think they're on the installation CDs somewhere.

I then dove into my storage cabinet and pulled out my big box o' AIX cds. I located the AIX 5.2 install cds (seven of them) for that server as well as numerous other randomly labeled media. I popped the first CD into my Mac and started searching for something that looked like printer drivers. About this time, the developer guy popped in and said that his vendor said it was on disc one. Hmmm... okay.

Longer story shorter, it's not. It's disc three. I'll say it again so try to make sure Google picks it up for you out there searching the web. The HP LaserJet 4000 series printer drivers for AIX 5.2 are on installation cd number 3 of 7.

Just pop that cd into your server, or remotely mount it via NFS like I did, and run:

# smitty printers
Printer/Plotter Devices
Install Additional Printer/Plotter Software
choose /cdrom or whatever mount point you used
hit list to choose the printer drivers you want

The trick is finding them. Once you find them, it's super easy.

Thursday, December 6, 2007

Update: Virtual Frame Buffer for use with Oracle Reports Server

Well, it turns out that the handy VFB I described a couple posts down doesn't work with SQR. I found a Hyperion document that had corrected resolution settings for the virtual buffer:

/usr/openwin/bin/Xvfb :5 -dev vfb screen 0 1152x900x8 &
/usr/openwin/bin/twm -display :5 -v &
DISPLAY=:5.0; export DISPLAY

Apparently, SQR is picky about the resolution. 1152x900x8 works, 1600x1200x32 did not.