Computers: Can't Live With 'Em, Pass the Beer Nuts

Thursday, April 29, 2010

Cleaning up Solaris 10 Device Tree when LUNs Removed

The following was copied from Symantec here: http://sfdoccentral.symantec.com/sf/5.0MP3/solaris/html/vxvm_admin/ch02s24s03.htm but I added some notes because their instructions were not correct (at least on my system) in some spots.

To clean up the device tree after you remove LUNs

1.

The removed devices show up as drive not available (or drive type unknown) in the output of the format command:

413. c3t5006048ACAFE4A7Cd252
/pci@1d,700000/SUNW,qlc@1,1/fp@0,0/ssd@w5006048acafe4a7c,fc

2.

After the LUNs are unmapped using Array management or the command line, Solaris also displays the devices as either
unusable or failing (or maybe unknown just like all the devices - make sure you have the right ones!).

bash-3.00# cfgadm -al -o show_SCSI_LUN
[...]
c2::5006048acafe4a73,256 disk connected configured unusable
c3::5006048acafe4a7c,255 disk connected configured unusable
[...]

3.

If the removed LUNs show up as failing, you need to force a LIP on the HBA. This operation probes the
targets again, so that the device shows up as unusable. Unless the device shows up as unusable, it cannot be
removed from the device tree. Do a long listing of the rdsk directory to see what device to spevify:

luxadm -e forcelip /devices/pci@1d,700000/SUNW,qlc@1,1/fp@0,0:devctl

4.

To remove the device from the cfgadm database, run the following commands on the HBA:

cfgadm -c unconfigure -o unusable_SCSI_LUN c2::5006048acafe4a73

or this one if not unusable:

cfgadm -c unconfigure -o c3::5006048acafe4a7c

5.

Repeat step 2 to verify that the LUNs have been removed.
6.

Clean up the device tree. The following command removes the /dev/rdsk... links to /devices.

$devfsadm -Cv

Friday, February 29, 2008

Working With and Cleaning Out wtmpx

% last

# This example keeps only last 500 records. You might want more on a busy system

% /usr/lib/acct/fwtmp < /var/adm/wtmpx | tail -500 | /usr/lib/acct/fwtmp -ic > /tmp/wtmpx

# Test it
% last -f /tmp/wtmpx

% cat /tmp/wtmpx > /var/adm/wtmpx

Thursday, February 28, 2008

Weekend Down in Flames Revisited

The error from the previous post came back, and all four drives went bad again. This time though, the field engineer replaced the I/O expansion boards and the cable connecting them.

When I went to reboot, it went to book off of dkc0. You might remember that last time, dkc0 had gone bad during the field engineer's fiddling and I had booted off of dkc1 and used volrootmir to remirror dkc0.

Well, the boot didn't go so well. It said it found a valid boot block, but when LSM went to load, it spit out errors about a bad boot track and unmirrored something or other and then went into single user mode. Already running 10 minutes late getting the system back online, I just booted from dkc1 again and it worked fine.

I then removed rz16 (dkc0) from the LSM mirror and then readded it and remirrored. The volrootmir command went something like this, this time:

# volrootmir -a rz16

INFO: The '-a' option was specified for the /usr/sbin/volrootmir command,
however there are partitions on dsk1 that are not
encapsulated and therefore can not be mirrored.

INFO: The '-a' option was specified for the /usr/sbin/volrootmir command,
however the volume on dsk11 is not in the rootdg disk
group and will not be mirrored.

INFO: The '-a' option was specified for the /usr/sbin/volrootmir command,
however the volume on dsk14 is not in the rootdg disk
group and will not be mirrored.

INFO: The '-a' option was specified for the /usr/sbin/volrootmir command,
however the volume on dsk15 is not in the rootdg disk
group and will not be mirrored.

INFO: The '-a' option was specified for the /usr/sbin/volrootmir command,
however the volume on dsk10 is not in the rootdg disk
group and will not be mirrored.

Mirroring system disk dsk1 to disk rz16.
Mirroring rootvol to rz16a.
Mirroring swapvol to rz16b.
Mirroring vol-rz16g to rz16g.

Hmmm. It appears to have worked fine, though there's that note about partitions not being encapsulated. I assume it's referring to the empty partitions or the LSMsimp partition.

I'm left with a sinking feeling though. If the dkc1 disk I booted from goes bad, will I be able to boot from dkc0, or will I face a world of hurt and effort trying to boot from CD and restore from tape? I'd better schedule some downtime soon to try and boot from dkc0 and file a call with HP if it doesn't work. Just four more months with this server before we retire it!

In the meantime, the users are sucking up disk space faster than a "Stand by Me" leech sucks balls. I grabbed a couple of the unused 4.3GB drives, added them to the LSM config in voldiskadm as prod09 and prod10, then did:

# volassist make prodvol-09 8373900s prod09
# volassist -g prod mirror prodvol-09 prod10
# addvol /dev/vol/prod/prodvol-09 gfs_prod

I have about 8GB more disk space left. I hope it's enough to last four months.

Monday, January 7, 2008

Eudora no longer <<Dominant>>

Late last week, I began to get this error from Eudora on my Mac:

"error while performing unknown task for <<Dominant>>"

My sponsored version of Eudora was trying every two minutes to contact adserver.eudora.com to fetch an updated advertisement. The problem is, Qualcomm no longer develops nor supports Eudora, and apparently turned off the adserver server. Without a response from the server, the Eudora client would pop up an error every couple minutes.

As a quick-n-dirty patch, I put an entry in my Mac's /etc/hosts file for "adserver.eudora.com" with an IP number that belongs to one of our local web servers and restarted Eudora. I haven't received the error since then. The access log on the web server shows that my Mac is requesting:

"POST /adjoin/playlists HTTP/1.0" 404 285

The web server is obviously replying with a 404 error, but at least Eudora doesn't timeout now. Not the best fix, but the best for now.

Wednesday, January 2, 2008

Back to that Pesky Virtual Frame Buffer

Sigh. While my procedures with the Solaris virtual frame buffer (see previous posts) has usually been working okay, I had to modify the kill command to make it definitely exclude the grep process even though it should have excluded it anyway - very weird:

/usr/bin/kill `ps -ef | grep Xsun | grep -v grep | grep ":5" | awk '{print $2}'`

The user reports that they're occasionally getting this error in their logs:

fstat: Bad file number
(failed to stat vfb)
stat: No such file or directory
(failed to stat vfb)
fstat: Bad file number
(failed to stat vfb)
stat: No such file or directory
(failed to stat vfb)
X connection to localhost:5.0 broken (explicit kill or server shutdown).

This is due to the VFB running in a local Solaris zone. According to a post I found at http://forum.java.sun.com/thread.jspa?threadID=5233796&tstart=120 one can add /dev/winlock to the local zone this way to solve those errors:

"Steps to add the pseudo device '/dev/winlock' to the local zone:

As a superuser in the global zone,

1. Add the '/dev/winlock' pseudo device to the local zone:
global# zonecfg -z
zonecfg:zonename> add device
zonecfg:zonename:device> set match=/dev/winlock
zonecfg:zonename:device> end
zonecfg:zonename> exit

With this, the local zone will have access to '/dev/winlock' device.

2. Reboot the local zone
global# zoneadm -z reboot

Once the local zone is rebooted, '/dev/winlock' will now be available in the local zone."

I've made the change in the zone configuration so it will take effect the next time I reboot. I can see it now - I'll reboot, the instructions above will have been wrong, the zone won't come up, and it'll take me a half hour to remember what changes I made... hopefully I've made enough notes that I'll remember it.

Monday, December 31, 2007

Weekend Down in Flames

A page at 2:46 Saturday morning woke me up. Our Lawson server, an aging Digital Alpha 4100 running Tru64 Unix 5.1B was complaining. Right away I knew that my list of chores I had planned for the weekend was going to take a hit.

A volume managed in LSM (Logical Storage Manager) had lost a plex and one of the secondary swap spaces was running unmirrored. It looked like a relatively simple disk failure, so I headed back to bed.

In the morning, I went to the office to gather up some paperwork, then drove to the computer room. The alarm in the disk cabinet was screaming like a banshee, and it was festooned with a number of amber lights. The cabinet contains two sets of redundant HSZ70 scsi controllers. The top set is connected to an expansion unit on the back of the cabinet. There are about 70 disks in the cabinet. Four of them, all on channel 1 off the expansion unit were blinking amber. Yikes!

I checked the CLI interface (Command Line Interface Interface) and could see that two raidsets were running reduced, the third disk was the one that was the swap mirror, and the fourth bad disk isn't used. Since everything was mirrored or raided, the system was still up and functioning. I filed a call with HP support at 1-800-354-9000 (1-800-DIAL-A-PRAYER). I tried for a while to convince the phone support tech that there was probably something wrong with the entire channel on the expansion unit, but the controller didn't have any errors on it, so he eventually sent the call to the local field tech and asked them to bring over four new drives. Uh huh. Fortunately, I know the field tech is better than that, and when he called, I explained the situation.

Later in the evening, the tech arrived at the computer room (I had since gone back home) and he unplugged the expansion unit, blew on it, and put it back in. That brought the disks back online. However, it also managed to take down all the disks on channel one on the front of the cabinet too, including the mirror of the root disk that I'd booted from. He rebuilt all the disks from the hardware side, but it left a mighty mess to clean up on the software side.

With the boot disk down and "replaced", that means a shutdown and boot from the other root mirror to free it up so it can be rebuilt.

1) Disassociate and remove the bad plexes from the root volume
volplex -o rm dis rootvol-01
volplex -o rm dis swapvol-01
volplex -o rm dis vol-rz16g-01

2) Remove the disks from the diskgroup
voldg rmdisk rz16a
voldg rmdisk rz16b
voldg rmdisk rz16f
voldg rmdisk rz16g

3) Remove disks from LSM:
voldisk rm rz16a
voldisk rm rz16b
voldisk rm rz16f
voldisk rm rz16g

4) Physically replace the disks using the HSZ commands (I didn't have to replace mine since it wasn't the disks that failed)

5) Label the new disk
disklabel -wr rz16

If that doesn't work, try:

disklabel -z rz16
disklabel -wr rz16

For me, trying to label the disk didn't work at all because it was the boot disk and the OS still had a partition open and claimed. I had to "shutdown -h now" and reboot from the >>> console prompt using the address of the mirrored (good) root drive. On my system a show bootdef_dev showed that the default boot disk was dkc0..., so I used "boot dkc1" to boot from the mirror.

When the system rebooted, I was then able to continue with the disklabel:

disklabel -wr rz16

6) Mirror the root drive
volrootmir -a rz16

That command will build a mirror on the disk specified.

After all that, I tried to remirror the additional swap volume, but ran into a roadblock.

The disks were set up in LSM when the system was installed using Tru64 4.0x. I had since upgraded to Tru64 5.1B. LSM used to use a private region of 1024, but the new version uses a private region of 4096. When I tried to add the new disk using voldiskadm, and then mirror it, it said there wasn't enough space to complete the mirror:

# volassist mirror swapvol02 swap01
lsm:volassist: ERROR: Cannot allocate space to mirror 8376988 block volume

I then tried to write a disklabel from the good mirror onto the disk before adding it with voldiskadm (dsk13 is the good disk, dsk12 is the new one):

disklabel -r dsk13 > dsk13.lab
disklabel -R dsk12 dsk13.lab

I then added the disk through voldiskadm and chose not to initialize it. But then I got this error:

# volassist mirror swapvol02 swap01
lsm:volplex: ERROR: Volume swapvol02, plex swapvol02-01, block 0: Plex write:
Error: Write failure
lsm:volplex: ERROR: sd swap01-01 in plex swapvol02-01 failed during attach
lsm:volplex: ERROR: changing plex swapvol02-01:
Record not in disk group
lsm:volplex: ERROR: Attempting to cleanup after failure ...
lsm:volassist: ERROR: Could not attach new mirror(s) to volume swapvol02
lsm:volassist: WARNING: Object swapvol02-01: Unexpectedly removed from the configuration

Ouch. I turned back to Google and found two tech forum posts from people who'd encountered the same thing, but no one had answered them.

I called HP support and they sent me these three simple little commands:

# voldisksetup -i dsk12 privlen=1024
# voldg adddisk swap01=dsk12
# volassist mirror swapvol02 swap01

Worked like a charm, and the system is back to normal with everything mirrored and raided properly.

However, there's still an amber warning light on the cabinet. Since the controllers aren't reporting any errors, the field tech's best guess is a problem with one of the many fans in the cabinet. He'll be coming by Wednesday with a few new fans to try some replacements to see if he can get that pesky amber light to go away.

Wednesday, December 19, 2007

Quote of the Day

Here's my favorite quote today, found while browsing to try to find an AIX hardware support matrix. Apparently IBM is stingy with that information too.

From "inferno" on www.pseriestech.org/forum/aix:

"I do not get it, I explained that I was interested in getting IBM certified. Companies like Sun Microsystems and Red Hat will constantly harrass you and damn near send a limo and a bunch of exotic dancers to your home in order to get you certified. So what is wrong with IBM?"