Thursday, May 15, 2014

AIX - Replacing Faulty Disk in ROOTVG


Replacing Faulty Disk in ROOTVG

Analyzing Disk Fault

The first signs that a hard disk is going faulty are temporary error log messages in Error Reporter. If you see random temporary errors, then you don't have an immediate problem but if you start to see a bundle of temporary errors then the disk will need replacing. The worse case scenario is permanent error against a hard disk and stale partitions.

Check to see how many errors have been logged and whether they are permanent of temporary by:

errpt |more
1581762B   0727203502 T H hdisk0         DISK OPERATION ERROR
1581762B   0727203502 P H hdisk0         DISK OPERATION ERROR

The first error log message shows that there is a temporary disk problem on hdisk0, whilst the second error log message shows a permanent error also on hdisk0. The procedures for replacing hdisk0 & hdisk1 <part of rootvg> are slightly different. See the steps below.

To check for stale partitons, run the command: lsvg -l rootvg
rootvg:
LV NAME                      TYPE      LPs          PPs          PVs         LV STATE                     MOUNT POINT
hd5                 boot         1              2              2              closed/syncd    N/A
hd6                 paging   64              128          2              open/syncd      N/A
hd8                 jfslog       1              2              2              open/stale        N/A
hd4                 jfs            4              8              2              open/stale        /

Steps for replacing faulty disks in other volume groups are much simpler than replacing disks in rootvg. I have written a procedure for this below also.

For procedures on replacing faulty SSA disk, refer to the link

Replacing hdisk0 in rootvg

Change bootlist

bosboot -a -d hdisk1                               Make sure hdisk1 has a boot image
bootlist -m normal hdisk1 hdisk0            Change the bootlist so the system will use hdisk1 before hdisk0

Removing Primary Dump Device

sysdumpdev -l         The primary dump device will always be on hdisk0, this will need to be changed

                primary                                    /dev/pdumplv
                secondary                                /dev/sdumplv
                copy directory         /var/adm/dump
                forced copy flag      FALSE
                always allow dump                 TRUE
                dump compression                  ON

sysdumpdev -Pp /dev/hd6                       Changes primary dump device

                primary                                    /dev/hd6
                secondary                                /dev/sdumplv
                copy directory         /var/adm/dump
                forced copy flag     FALSE
                always allow dump                 TRUE
                dump compression                  ON

rmlv pdumplv                                          Remove the logical volume pdumplv, the primary dump device

Un-Mirroring Hard Disk from VG

Now you need to un-mirror the volume group so the disk can be removed. There are two ways you can do this, one is whereby you run it at a disk level and the other is at a logical partition level. The outcome will be the same with both commands but with the second you have more control.

Method One

unmirrorvg rootvg hdisk0                               Unmirrors the disk.

NB: Sometimes this is unstable, especially if you have stale partitions. I have also noticed that if pdumplv is mirrored <shouldn't be by default>, this command will fail. In this instance, unmirror the logical volume and then run the unmirrorvg command, alternatively follow the method below.

Method Two

lsvg -l rootvg                                   Lists all logical volumes in rootvg
rootvg:
LV NAME                      TYPE      LPs          PPs          PVs         LV STATE                     MOUNT POINT
hd5                 boot         1              2              2              closed/syncd    N/A
hd6                 paging   64              128          2              open/syncd      N/A
hd8                 jfslog       1              2              2              open/syncd      N/A
hd4                 jfs            4              8              2              open/syncd      /

rmlvcopy LVNAME 1 hdisk0  Run this command for each logical volume
e.g: rmlvcopy hd5 1 hdisk0


Check the disk has been umirrored by: lsvg -l rootvg. For each LV, the PVs column will have 1
rootvg:
LV NAME              TYPE      LPs          PPs          PVs         LV STATE                     MOUNT POINT
hd5                          boot         1              2              1              closed/syncd    N/A
hd6                          paging   64              128          1              open/syncd      N/A
hd8                          jfslog       1              2              1              open/syncd      N/A
hd4                          jfs            4              8              1              open/syncd      /

Make a note of the SCSI id and serial number which will make the CE's life easier when he has to remove the disk. I have highlighted the SCSI id <8> and serial number <4DFJY156> from the example below. The command you need to run is. lscfg -vl hdisk0

  DEVICE            LOCATION          DESCRIPTION
  hdisk0            10-88-00-8,0      16 Bit LVD SCSI Disk Drive <9100 MB>
        Manufacturer............................IBM
        Machine Type and Model......DDYS-T09170M
        FRU Number...........................00P1517
        ROS Level and ID...................53394841
        Serial Number.........................4DFJY156
        EC Level...................................F79924
        Part Number............................07N3852
        Device Specific.<Z0>...............000003029F00013A
        Device Specific.<Z1>...............07N4925
        Device Specific.<Z2>...............0933
        Device Specific.<Z3>...............00315
        Device Specific.<Z4>...............0001
        Device Specific.<Z5>...............22
        Device Specific.<Z6>...............F79924

Remove the Disk from VG

reducevg rootvg hdisk0           Remove hdisk0 from the volume group
rmdev -l hdisk0 -d                   Remove the definition of hdisk0 from the system

lsvg rootvg                              Ensure disk is removed
lspv hdisk0                              Ensure disk is removed

Now Remove the Disk physically and add the New Disk.

Add the New Disk to the System

cfgmgr     Now run configuration Manager to add the new disk to the system
diag         Then go into diagnostics to update the system log so the system is aware that hdisk0 has been replaced
        Task Selection ->
        Log Repair Action ->
        hdisk0                                    
Esc 0                       To exit diagnostics after Log Repair Action has completed.

errpt | more                              Check Log Repair Action has taken place. You should see an entry like :-

                2F3E09A4   0819110902 I H hdisk2         REPAIR ACTION

diag                 Go back into diagnostics and certify this disk. This will indicate whether the new disk is ok
        Task Selection ->
        Certify the disk ->
        hdisk0                                     Commit the changes and exit by pressing F3
Esc 0                                                       To exit diagnostics after Certifying the new disk

Add disk into the Volume Group
extendvg rootvg hdisk0                           Add disk into the volume group rootvg

Now you need to re-mirror the disk. Again you can mirror at a disk level or at a logical level.

Re-Mirroring Hard Disk

Method One
mirrorvg rootvg hdisk0                           Mirrors the disk
syncvg -v rootvg                                     Synchronizes the volume group and the data contained within it

NB: This method will mirror the logical volume pdumplv. Unmirror the logical volume by:
rmlvcopy pdumplv 1 hdisk1

Method Two
lsvg -l rootvg                                           Lists all the logical volumes to re-mirror
mklvcopy -k LVNAME 2 hdisk0             Run this command for each logical volume. This will also synchronize the data <-k>
e.g: mklvcopy hd5 hdisk0                      
NB: Do not mirror the logical volume pdumplv
syncvg -v rootvg                                     Synchronizes the volume group and the data contained within it
lsvg -l rootvg                                           Check datavg has been mirrored and status is open/syncd

Check the volume group has been completely re-mirrored by: lsvg -l rootvg. The PV column should have 2 for each LVNAME apart from pdumplv & sdumplv
rootvg:
LV NAME             TYPE      LPs          PPs          PVs         LV STATE                     MOUNT POINT
hd5                          boot         1              2              2              closed/syncd    N/A
hd6                          paging   64              128          2              open/syncd      N/A
hd8                          jfslog       1              2              2              open/syncd      N/A
hd4                          jfs            4              8              2              open/syncd      /

mklv -y 'pdumplv' rootvg 40 hdisk0                       Re-create the logical volume for your primary dump device
sysdumpdev -Pp /dev/pdumplv                               Re-alocate your primary dump device.

                primary                                    /dev/pdumplv
                secondary                                /dev/sdumplv
                copy directory                         /var/adm/dump
                forced copy flag     FALSE
                always allow dump                 TRUE
                dump compression                  ON

bosboot -a -d hdisk0                                               Update the boot image on hdisk0
bootlist -m normal hdisk0 hdisk1                            Change your boot list back.