Monday, December 31, 2007

Weekend Down in Flames

A page at 2:46 Saturday morning woke me up. Our Lawson server, an aging Digital Alpha 4100 running Tru64 Unix 5.1B was complaining. Right away I knew that my list of chores I had planned for the weekend was going to take a hit.

A volume managed in LSM (Logical Storage Manager) had lost a plex and one of the secondary swap spaces was running unmirrored. It looked like a relatively simple disk failure, so I headed back to bed.

In the morning, I went to the office to gather up some paperwork, then drove to the computer room. The alarm in the disk cabinet was screaming like a banshee, and it was festooned with a number of amber lights. The cabinet contains two sets of redundant HSZ70 scsi controllers. The top set is connected to an expansion unit on the back of the cabinet. There are about 70 disks in the cabinet. Four of them, all on channel 1 off the expansion unit were blinking amber. Yikes!

I checked the CLI interface (Command Line Interface Interface) and could see that two raidsets were running reduced, the third disk was the one that was the swap mirror, and the fourth bad disk isn't used. Since everything was mirrored or raided, the system was still up and functioning. I filed a call with HP support at 1-800-354-9000 (1-800-DIAL-A-PRAYER). I tried for a while to convince the phone support tech that there was probably something wrong with the entire channel on the expansion unit, but the controller didn't have any errors on it, so he eventually sent the call to the local field tech and asked them to bring over four new drives. Uh huh. Fortunately, I know the field tech is better than that, and when he called, I explained the situation.

Later in the evening, the tech arrived at the computer room (I had since gone back home) and he unplugged the expansion unit, blew on it, and put it back in. That brought the disks back online. However, it also managed to take down all the disks on channel one on the front of the cabinet too, including the mirror of the root disk that I'd booted from. He rebuilt all the disks from the hardware side, but it left a mighty mess to clean up on the software side.

With the boot disk down and "replaced", that means a shutdown and boot from the other root mirror to free it up so it can be rebuilt.

1) Disassociate and remove the bad plexes from the root volume
volplex -o rm dis rootvol-01
volplex -o rm dis swapvol-01
volplex -o rm dis vol-rz16g-01

2) Remove the disks from the diskgroup
voldg rmdisk rz16a
voldg rmdisk rz16b
voldg rmdisk rz16f
voldg rmdisk rz16g

3) Remove disks from LSM:
voldisk rm rz16a
voldisk rm rz16b
voldisk rm rz16f
voldisk rm rz16g

4) Physically replace the disks using the HSZ commands (I didn't have to replace mine since it wasn't the disks that failed)

5) Label the new disk
disklabel -wr rz16

If that doesn't work, try:

disklabel -z rz16
disklabel -wr rz16

For me, trying to label the disk didn't work at all because it was the boot disk and the OS still had a partition open and claimed. I had to "shutdown -h now" and reboot from the >>> console prompt using the address of the mirrored (good) root drive. On my system a show bootdef_dev showed that the default boot disk was dkc0..., so I used "boot dkc1" to boot from the mirror.

When the system rebooted, I was then able to continue with the disklabel:

disklabel -wr rz16

6) Mirror the root drive
volrootmir -a rz16

That command will build a mirror on the disk specified.

After all that, I tried to remirror the additional swap volume, but ran into a roadblock.

The disks were set up in LSM when the system was installed using Tru64 4.0x. I had since upgraded to Tru64 5.1B. LSM used to use a private region of 1024, but the new version uses a private region of 4096. When I tried to add the new disk using voldiskadm, and then mirror it, it said there wasn't enough space to complete the mirror:

# volassist mirror swapvol02 swap01
lsm:volassist: ERROR: Cannot allocate space to mirror 8376988 block volume

I then tried to write a disklabel from the good mirror onto the disk before adding it with voldiskadm (dsk13 is the good disk, dsk12 is the new one):

disklabel -r dsk13 > dsk13.lab
disklabel -R dsk12 dsk13.lab

I then added the disk through voldiskadm and chose not to initialize it. But then I got this error:

# volassist mirror swapvol02 swap01
lsm:volplex: ERROR: Volume swapvol02, plex swapvol02-01, block 0: Plex write:
Error: Write failure
lsm:volplex: ERROR: sd swap01-01 in plex swapvol02-01 failed during attach
lsm:volplex: ERROR: changing plex swapvol02-01:
Record not in disk group
lsm:volplex: ERROR: Attempting to cleanup after failure ...
lsm:volassist: ERROR: Could not attach new mirror(s) to volume swapvol02
lsm:volassist: WARNING: Object swapvol02-01: Unexpectedly removed from the configuration

Ouch. I turned back to Google and found two tech forum posts from people who'd encountered the same thing, but no one had answered them.

I called HP support and they sent me these three simple little commands:

# voldisksetup -i dsk12 privlen=1024
# voldg adddisk swap01=dsk12
# volassist mirror swapvol02 swap01

Worked like a charm, and the system is back to normal with everything mirrored and raided properly.

However, there's still an amber warning light on the cabinet. Since the controllers aren't reporting any errors, the field tech's best guess is a problem with one of the many fans in the cabinet. He'll be coming by Wednesday with a few new fans to try some replacements to see if he can get that pesky amber light to go away.

Wednesday, December 19, 2007

Quote of the Day

Here's my favorite quote today, found while browsing to try to find an AIX hardware support matrix. Apparently IBM is stingy with that information too.

From "inferno" on www.pseriestech.org/forum/aix:

"I do not get it, I explained that I was interested in getting IBM certified. Companies like Sun Microsystems and Red Hat will constantly harrass you and damn near send a limo and a bunch of exotic dancers to your home in order to get you certified. So what is wrong with IBM?"

Thursday, December 13, 2007

X11 Forwarding over SSH

Holy crap. I just spent about five hours banging my head against a wall.

In an effort to try to secure connections, I've been trying to get more things tunneled through ssh to lock down some more ports to our DMZ network. Today, I've been working on getting X Windows applications to tunnel over ssh to my PC.

I read a couple online manuals.

I connected from PuTTY on my PC to a Solaris 9 box on our DMZ. Ssh session came right up. I turned on X11 forwarding and enabled it on the server and tried to log in again. No dice. It closed the connection right after I typed in the password. I must be doing something wrong.

I did some more web searching. Read several more tutorials about ssh and X11 forwarding. Still no dice. Still must be doing something wrong. Click this. Click that. Edit this config file, edit that. Nope. Passive, active, indirect. Nope. Port forwarding. No port forwarding. Nope. DISPLAY set. DISPLAY not set. Nope. Nope. Nothing in the PuTTY logs.

I really must not understand this ssh/X11 forwarding thing at all. Yet every document I read has virtually the same instructions. What could I be missing?

I finally happen to check the error log on the Solaris 9 box itself. Sure enough, there were errors that corresponded to each of my login attempts. I did some more web searching and finally found it: a bug report for a Solaris 9 patch that causes X11 forwarding to fail. Effin' A.

I tried one of my Solaris 10 servers. Worked the first time. Five hours gone up in smoke. No wonder I'm quiet at the dinner table. I just worked hard all day doing nothing.

Friday, December 7, 2007

Reason #682 Why I Get Frustrated with IBM

An application developer down the hall came in the office late yesterday asking if I could install the HP LaserJet 4000 printer drivers on one of our IBM AIX servers. "Sure, no problem," I said, knowing in my core I was about to embark on a painful adventure.

This morning, I set about trying to locate the printer drivers on IBM's website. Fifteen minutes of thrashing about... no deal. They link to a huge list of some Infoprint driver crap, but nothing for HP printers.

I checked out HP's website, which is almost as poorly designed as IBM's. IBM's is worse because not only is the design bad, but they don't let you download anything useful. HP has lots of useful stuff on their website, but it's just really hard to find.

Anyway, after a few minutes of thrashing on HP's website, I found they have some generic un*x drivers for the LaserJet 4000 series, but I knew that wasn't going to fly on the AIX server. I needed the *.rte files from IBM.

I turned to Google, and found many posts to technical forums, and each one went like this:

Question: Where on the web can I find HP printer drivers for IBM?

Answer: You can't. I think they're on the installation CDs somewhere.

I then dove into my storage cabinet and pulled out my big box o' AIX cds. I located the AIX 5.2 install cds (seven of them) for that server as well as numerous other randomly labeled media. I popped the first CD into my Mac and started searching for something that looked like printer drivers. About this time, the developer guy popped in and said that his vendor said it was on disc one. Hmmm... okay.

Longer story shorter, it's not. It's disc three. I'll say it again so try to make sure Google picks it up for you out there searching the web. The HP LaserJet 4000 series printer drivers for AIX 5.2 are on installation cd number 3 of 7.

Just pop that cd into your server, or remotely mount it via NFS like I did, and run:

# smitty printers
Printer/Plotter Devices
Install Additional Printer/Plotter Software
choose /cdrom or whatever mount point you used
hit list to choose the printer drivers you want

The trick is finding them. Once you find them, it's super easy.

Thursday, December 6, 2007

Update: Virtual Frame Buffer for use with Oracle Reports Server

Well, it turns out that the handy VFB I described a couple posts down doesn't work with SQR. I found a Hyperion document that had corrected resolution settings for the virtual buffer:

/usr/openwin/bin/Xvfb :5 -dev vfb screen 0 1152x900x8 &
/usr/openwin/bin/twm -display :5 -v &
DISPLAY=:5.0; export DISPLAY

Apparently, SQR is picky about the resolution. 1152x900x8 works, 1600x1200x32 did not.

Wednesday, November 21, 2007

X Server Basics

We have been using a commercial X Windows server on our PCs to get GUI access to our unix boxes for years. Recently though, a new set of people here need to get access to a new AIX system we have. They didn't want to spring for commercial licenses, so I introduced them to the Cygwin/X server software from http://x.cygwin.com, available at no cost under a modified GNU license.

Download the software, making sure to choose to install the inetutils and xorg-x11 portions. I also installed the openssh piece so I could use SSH to connect to servers if I wanted to.

Once installed, launch Cygwin and it'll give you a unix-like terminal interface to your PC files.

Make sure dtlogin is running on the remote unix host. If it isn't, run this on the host:

# /usr/dt/bin/dtlogin &

Then on your PC in the Cygwin window, run:

Xwin -query <remote_hostname> -from <my_pc_hostname_or_ip>

That will launch the nice GUI Xwindows login for the remote host.

Friday, November 16, 2007

Virtual Frame Buffer for use with Oracle Reports Server

Our DBA set up Oracle Reports Server on one of my Solaris unix servers so that our applications folks can make pretty graphs and send them out to administration. Oracle Reports Server requires a connection to an X-Server to draw the pretty graphs, even though it's running in batch. Wonderful. There's no graphics console on the server, so...

Our DBA went over to a unix workstation he runs testing on, logged in, set xhost + to allow the process (well, everyone really) to connect, and then set the DISPLAY variable in the application script to connect to the workstation for the X-Server access. Pretty kludgy solution, but it worked.

All that worked fine until someone stepped on the switch on the power strip for the workstation and it was down over the weekend without anyone noticing. I started it back up, but didn't log in and had no idea about the DISPLAY setting on the other production server. Fast forward about a week, imagine applications developers running around screaming about their graphing not working, and you've got a good picture.

Our DBA eventually remembered the DISPLAY connection he'd set up, and we got the workstation logged back in. Then I went to work finding an alternative.

I located these documents:

http://www.sun.com/bigadmin/content/submitted/virtual_buffer.html
http://www.idevelopment.info/data/Unix/General_UNIX/GENERAL_XvfbWithOracle9iAS.shtml

They were helpful, but of course, we slightly incorrect. Based on their recommendations, with a tweak to the Xvfb command to get the syntax correct, I came up with the following:

In the script that does the graphics, these three lines start up the virtual frame buffer, twm, and set up the DISPLAY variable correctly:

/usr/openwin/bin/Xvfb :5 -dev vfb screen 0 1600x1200x32 &
/usr/openwin/bin/twm -display :5 -v &
DISPLAY=:5.0; export DISPLAY

At the end of the script, this line kills the vfb to make sure it's not hanging around doing nothing:

/usr/bin/kill `ps -ef | grep Xsun | grep :5 | awk '{print $2}'` > /dev/null 2>&1

Starting up the virtual frame buffer gives the Oracle Reports Server process something to connect to, and it runs on the local host even though there's no graphics console. Nice!

Now I'll go log out of that workstation...

Friday, November 9, 2007

Customize Mailman Messages

To create a customized welcome message in Mailman 2.1.5 that's sent to users when they subscribe, follow these steps:

Create a directory mailman/data/lists/yourlist/en (for English language) and copy subscribeack.txt from /usr/local/mailman/templates into that directory. Customize it to whatever you like. Mailman will use this custom template for the welcome message.

Schweet. This is particularly handy for distribution-only lists where the "To post to this list..." instructions in the welcome message are confusing since regular subscribers would get rejected were they to follow those instructions.

Purge Those MySQL Binary Logs

I'm sure this is old hat to real MySQL people, but I'm pretty new to MySQL, especially replication, and our web server is usually pretty quiet, so I was a little surprised when I got a disk space warning because the binary logs had grown so large.

Signing onto the slave server, I ran "show slave status;" at the mysql> prompt to show that the server was reading from the binary log called "mysql-bin.004" on the master.

Logging onto the master, I ran "show master logs;" at the mysql> prompt (show binary logs; is supposed to work but did not - probably a version thing) to display the current logs saved in the mysql/var directory:

mysql> show master logs;
+---------------+
| Log_name |
+---------------+
| mysql-bin.002 |
| mysql-bin.003 |
| mysql-bin.004 |
+---------------+
3 rows in set (0.00 sec)

I then ran the purge master logs command to get rid of the deadwood:

mysql> purge master logs to 'mysql-bin.004';
Query OK, 0 rows affected (0.02 sec)

mysql> show master logs;
+---------------+
| Log_name |
+---------------+
| mysql-bin.004 |
+---------------+
1 row in set (0.00 sec)

It deleted the big files and we're out of the woods for disk space. I should probably set the max_binlog_size variable a little lower so it creates more, smaller logs so I don't reach a situation where I have a monstrous active log file and no old ones to purge.

Wednesday, October 24, 2007

Perfect Storm Disk Replacement

I recently had a drive go bad in a Sun StorEdge 3510 FC JBOD array connected to a V490 running Solaris 10 with Solaris Volume Manager. The disk was part of a five-disk stripeset that was mirrored with another stripset.

It was *not* easy finding documentation for getting this done. Tools I'd used on other systems that had SCSI attached arrays and on systems with FC attached RAID arrays did not work. The combination of JBOD with FC on a 3510 managed with Solaris Volume Manager with an active hot spare made it interesting. So without further ado...

How To Replace a Failed Drive on a JBOD Sun StorEdge 3510 FC Array That Has Been Failed Over to a Hot Spare Managed by Volume Manager in Solaris 10 (whew!)

Here's the device with the bad disk c1t10d0s0 that was replaced with the hot spare from c1t11d0s0:

# metastat d15
d15: Mirror
Submirror 0: d16
State: Okay
Submirror 1: d17
State: Okay
Pass: 1
Read option: roundrobin (default)
Write option: parallel (default)
Size: 716634624 blocks (341 GB)

d16: Submirror of d15
State: Okay
Hot spare pool: hsp000
Size: 716634624 blocks (341 GB)
Stripe 0: (interlace: 256 blocks)
Device Start Block Dbase State Reloc Hot Spare
c1t4d0s0 20352 Yes Okay Yes
c1t3d0s0 20352 Yes Okay Yes
c1t2d0s0 20352 Yes Okay Yes
c1t1d0s0 20352 Yes Okay Yes
c1t0d0s0 20352 Yes Okay Yes

d17: Submirror of d15
State: Okay
Hot spare pool: hsp000
Size: 716634624 blocks (341 GB)
Stripe 0: (interlace: 256 blocks)
Device Start Block Dbase State Reloc Hot Spare
c1t9d0s0 20352 Yes Okay Yes
c1t8d0s0 20352 Yes Okay Yes
c1t7d0s0 20352 Yes Okay Yes
c1t6d0s0 20352 Yes Okay Yes
c1t10d0s0 20352 No Okay Yes c1t11d0s0

Device Relocation Information:
Device Reloc Device ID
c1t4d0 Yes id1,ssd@n20000011c6968cf9
c1t3d0 Yes id1,ssd@n20000011c6967f16
c1t2d0 Yes id1,ssd@n20000011c6968c7c
c1t1d0 Yes id1,ssd@n20000011c68baaed
c1t0d0 Yes id1,ssd@n20000011c6968ca1
c1t9d0 Yes id1,ssd@n20000011c6967e6e
c1t8d0 Yes id1,ssd@n20000011c68b0388
c1t7d0 Yes id1,ssd@n20000011c68deaaf
c1t6d0 Yes id1,ssd@n20000011c6969259
c1t11d0 Yes id1,ssd@n20000011c68bbb2d

I removed the meta database replicas that were on c1t10d0 but I'm not convinced I had to do that before continuing.

The cfgadm command can show the attachment point for the disk.

# cfgadm -al
Ap_Id Type Receptacle Occupant Condition
c0 scsi-bus connected configured unknown
c0::dsk/c0t0d0 CD-ROM connected configured unknown
c1 fc-private connected configured unknown
c1::22000011c68b0388 disk connected configured unknown
c1::22000011c68b5cb3 disk connected configured unknown
c1::22000011c68baaed disk connected configured unknown
c1::22000011c68bbb2d disk connected configured unknown
c1::22000011c68deaaf disk connected configured unknown
c1::22000011c6967e6e disk connected configured unknown
c1::22000011c6967f16 disk connected configured unknown
c1::22000011c6968c7c disk connected configured unknown
c1::22000011c6968ca1 disk connected configured unknown
c1::22000011c6968cf9 disk connected configured unknown
c1::22000011c6969259 disk connected configured unknown
c1::22000011c696a895 disk connected configured unknown
c1::225000c0ff086290 ESI connected configured unknown
c2 fc-private connected configured unknown
c2::500000e01127c191 disk connected configured unknown
c2::500000e01127c8a1 disk connected configured unknown
usb0/1 unknown empty unconfigured ok
usb0/2 unknown empty unconfigured ok
usb0/3 unknown empty unconfigured ok
usb0/4 unknown empty unconfigured ok

However, both the cfgadm and luxadm commands are unable to remove the drive since it's on a fiber loop and is a JBOD array.

# cfgadm -x replace_device c1::22000011c68b5cb3
cfgadm: Configuration operation not supported

# luxadm remove_device 22000011c68b5cb3

WARNING!!! Please ensure that no filesystems are mounted on these device(s).
All data on these devices should have been backed up.

Error: Invalid path. Device is not a SENA subsystem. - 22000011c68b5cb3.

Instead, use luxadm to offline the bad disk:

# luxadm -e offline /dev/rdsk/c1t10d0s2

Then devfsadm to remove the dev entries:

# devfsadm -Cv
devfsadm[3915]: verbose: removing file: /dev/dsk/c1t10d0s0
devfsadm[3915]: verbose: removing file: /dev/dsk/c1t10d0s1
devfsadm[3915]: verbose: removing file: /dev/dsk/c1t10d0s2
devfsadm[3915]: verbose: removing file: /dev/dsk/c1t10d0s3
devfsadm[3915]: verbose: removing file: /dev/dsk/c1t10d0s4
devfsadm[3915]: verbose: removing file: /dev/dsk/c1t10d0s5
devfsadm[3915]: verbose: removing file: /dev/dsk/c1t10d0s6
devfsadm[3915]: verbose: removing file: /dev/dsk/c1t10d0s7
devfsadm[3915]: verbose: removing file: /dev/rdsk/c1t10d0s0
devfsadm[3915]: verbose: removing file: /dev/rdsk/c1t10d0s1
devfsadm[3915]: verbose: removing file: /dev/rdsk/c1t10d0s2
devfsadm[3915]: verbose: removing file: /dev/rdsk/c1t10d0s3
devfsadm[3915]: verbose: removing file: /dev/rdsk/c1t10d0s4
devfsadm[3915]: verbose: removing file: /dev/rdsk/c1t10d0s5
devfsadm[3915]: verbose: removing file: /dev/rdsk/c1t10d0s6
devfsadm[3915]: verbose: removing file: /dev/rdsk/c1t10d0s7

The output from cfgadm now shows the device as unusable:

# cfgadm -al
Ap_Id Type Receptacle Occupant Condition
c0 scsi-bus connected configured unknown
c0::dsk/c0t0d0 CD-ROM connected configured unknown
c1 fc-private connected configured unknown
c1::22000011c68b0388 disk connected configured unknown
c1::22000011c68b5cb3 disk connected configured unusable
c1::22000011c68baaed disk connected configured unknown
c1::22000011c68bbb2d disk connected configured unknown
[...snip...]

Physically replace the device. In the 3510 JBOD array with the default boxid of zero (check the button hidden under the left plastic ear tab), the disk layout looks like this:

0 3 6 9
1 4 7 10
2 5 8 11

(0 to 11 counting down columns first then over rows)

When the disk is replaced, the devfsadm daemon should pick up the disk immediately and configure the dev entries. If not, try this to see what the problem is:

# luxadm -e port
/devices/pci@9,600000/SUNW,qlc@2/fp@0,0:devctl CONNECTED
/devices/pci@8,600000/SUNW,qlc@1/fp@0,0:devctl CONNECTED

Note: If you get a "NOT CONNECTED" error on the 3510 path, check cfgadm to see if the fiber connection is connected.

# cfgadm -al
Ap_Id Type Receptacle Occupant Condition
c0 scsi-bus connected configured unknown
c0::dsk/c0t0d0 CD-ROM connected configured unknown
c1 fc-private connected configured unknown
c1::22000011c68b0388 disk connected configured unknown
c1::22000011c68baaed disk connected configured unknown
c1::22000011c68bbb2d disk connected configured unknown
c1::22000011c68deaaf disk connected configured unknown
c1::22000011c6967e6e disk connected configured unknown
c1::22000011c6967f16 disk connected configured unknown
c1::22000011c6968c7c disk connected configured unknown
c1::22000011c6968ca1 disk connected configured unknown
c1::22000011c6968cf9 disk connected configured unknown
c1::22000011c6969259 disk connected configured unknown
c1::22000011c696a895 disk connected configured unknown
c1::225000c0ff086290 ESI connected configured unknown
c1::500000e014cb0282 disk connected configured unknown
c2 fc-private connected configured unknown
c2::500000e01127c191 disk connected configured unknown
c2::500000e01127c8a1 disk connected configured unknown
usb0/1 unknown empty unconfigured ok
usb0/2 unknown empty unconfigured ok
usb0/3 unknown empty unconfigured ok
usb0/4 unknown empty unconfigured ok

If the controller isn't there or is unconfigured try the following:

# cfgadm -c configure cx

If the drives appear with a condition set to "unusable" do the following using the pathname from the luxadm -e port command above:

# luxadm -e forcelip devices/pci@9,600000/SUNW,qlc@2/fp@0,0:devctl

Once the dev devices for the replaced drive are back in, use format to partition the new drive like the old one used to be. You can use the partition map from the hot spare as a template.

Once the drive is partitioned, add any database replicas that may have been on the original device (I should mention that I forgot to do that, so I'm not 100% sure that works), then do a metareplace to trigger the hot spare to go back to available and the replaced drive to start resyncing:

# metareplace -e d17 c1t10d0s0

Show progress with:

# metastat | grep %

Resync in progress: 73 % done

and see that the hot spare is available again with:

# metahs -i

# metahs -i
hsp000: 2 hot spares
Device Status Length Reloc
c1t11d0s0 Available 143349312 blocks Yes
c1t5d0s0 Available 143349312 blocks Yes

Device Relocation Information:
Device Reloc Device ID
c1t11d0 Yes id1,ssd@n20000011c68bbb2d
c1t5d0 Yes id1,ssd@n20000011c696a895

keywords: 3150 storedge storagetek solaris volume manager hot spare fc fiber channel jbod

Wednesday, August 8, 2007

Solaris Link Aggregation Update

The network guys set up the Cisco switch with LACP active for my two ports and the connection came right up. However, it seems that the load balancing is quite far from a 50/50 split across the interfaces that I expected to see. My research continues today. See my previous post for details of how this project started.

Tuesday, August 7, 2007

Solaris Link Aggregation

I'm setting up a server that will have quite a bit of network traffic. It's a Sun Microsystems V245 with four built in bge network interfaces. I've connected two of them and am hoping to aggregate them together to combine bandwidth into a logically bigger pipe.

Link aggregation used to be called Trunking in earlier versions of Solaris. Fortunately, I'm using a version of Solaris later than 10 1/06 which was the first version to natively support aggregation. Before that, one needed separate Sun Trunking software.

The Solaris System Administration Guide: IP Services contains the information you'll need to do this, though there are a couple of typos in the manual to work around.

Quick and dirty:

If you want to include a live network connection in the aggregate, you have to unplumb it first with "ifconfig bge0 unplumb" for example. You need to be on the console since that will drop your connection, of course.

Do "eeprom local-mac-address?" to make sure it's true. If it's not, do "eeprom local-mac-address?=true".

Your interfaces must be of the type bge, e1000g, or xge, and must run at the same speed and in full duplex mode (check with "dladm show-link").

Next, set up the aggregate interface with "dladm create-aggr -d bge0 -d bge1 1". That will set up an interface called "aggr1" with both physical interfaces, as shown with "dladm show-aggr".

Finally, do a "dladm modify-aggr -l passive 1", assuming that you'll be making the switch that you're connected to (see below) "active" for LACP. I think you can make both sides active or make the host active and the switch passive. I don't suppose both sides can be passive or no negotiation would take place.

For IPv4 addresses, create /etc/hostname.aggr1 (not /etc/hostname.aggr.1 as shown in the manual) with the hostname of the server in the file, matching the hostname to ip definition in /etc/hosts. Touch /reconfigure and reboot or "reboot -- -r" to do a reconfiguration reboot.

Do an "ifconfig -a" to show that aggr1 is now the defined interface with your correct mask.

If your interfaces are connected to a switch, as mine are, you need to configure the switch ports to be used as an aggregation, and if the switch supports LACP, if must be configured in either active or passive mode (either, but not off mode).

I configured my aggregation and the links are up and running. However, since my network guy hasn't configured the switch yet, I'm getting "WARNING: IP: Hardware address 'xx.xx.xx.xx.xx.xx' trying to be our address xxx.xxx.xxx.xxx!" messages in the messages log. They should go away when the switch is configured properly. Hopefully that won't be too much of a hassle for our guys.

Monday, July 30, 2007

IBM p660 7026-6H1 RAM Installation

We have an IBM p660 model 7026-6H1 that required 2GB more RAM. The system started out with 2GB in it. A couple years ago, we had IBM install 2GB more. At that time, IBM was still selling the new DIMMs. We bought it from them and the CE came in to do the installation.
Due to application upgrades, we needed 2GB more. IBM no longer sells new memory for that box. (By the way, have you ever seen a company retire hardware faster than IBM?) Off to the refurb market, I purchased 2GB more (4 x 512MB DIMMs FRU 0000033P3584). The DIMMs must be installed in quads.

I searched the 'net for DIY installation instructions. Ha! IBM keeps a tight lid on such documents. Sun Microsystems, for contrast, keeps an online library of every document under the sun, no pun intended, and ships out CDs with the servers with animations of how to install whatever you want to. Not IBM. No sir-ee, that there computer is far to complicated for anyone except a $300/hour IBM engineer to work on. We could show you the documents, but you'd only hurt yourself.

Anyway, I was watching when the CE installed the RAM a couple years ago, so I figured I could take a pretty good whack at it. I scheduled the downtime and grabbed my toolkit and static strap.

Our p660 is a multi-processor box. There's a special rule about those boxes with single processors and how much memory they can hold, so I can't help you there. Our system is a four-way and the CPU shelf (not the I/O shelf) contains a 16-slot memory expansion board.

I had to unplug a couple of items like the keyboard and mouse to get the CPU shelf to slide out the back of the rack far enough to get the top off. Two easy screws, no problem. The RAM expansion board is on the right, looking at it from the back. I think there are two in there, actually. I pulled up on the tabs for the leftmost one. It had sixteen slots, eight already occupied with 512MB DIMMs - two fore and aft slots in each of columns 1, 2, 7 and 8 from left to right as viewed from the rear. The memory has to be installed in sets of four, symmetrically left and right about the center line. (I know this because I installed the DIMMs symmetrically forward and back and it no worky.) I put a DIMM in the fore and aft slots in columns 3 and 6 and booted to see through "bootinfo -r" and "lsattr -El sys0 -a realmem" that the system was now showing 6GB installed.

DIY is much cheaper than purchasing an MES to have IBM do the installation. This of course doesn't address the issue of how IBM will feel when/if RAM goes bad, I call them for contract service, and they find refurbished RAMs in the box installed by "unqualified" personnel.

Gotta' Start Somewhere

I depend on Google and the Internet to do my job. I'm a systems admin and it's hard to remember how I ever did my job without the ability to search the Internet and find people having problems just like me and their solutions to those problems. I suppose I did it more slowly. In fact, I was not a system administrator when the web was born, so that's probably the reason I can't remember what it was like.

Around 1993, when I started playing with Mosaic and installed my employer's first web server and homepage, I was just the help desk guy and also helped out with some systems stuff, some network stuff, and some database programming stuff. Now, fourteen years later (yikes!) our web page has long been in the hands of others, but the web server itself is still mine. I'm now the Unix (and sometimes VMS for what we have left) guy around here.

This blog will have some of my notes from work in it. You will likely see the more interesting, difficult, or entertaining computer problems I encounter. I have found countless solutions to problems on the net, in tech forums, help pages, and blogs like this. I've always felt like I should give back a little. This blog is my modest attempt to do that.

Of course, it's actually a selfish endeavor as it will help me remember what experience I've had and will give me an easy-to-search resource of my own past solutions if the problems pop up again.