1. Bicom Systems
  2. Solution home
  3. SERVERware
  4. HOWTOs SERVERware 4

HOWTO: Faulty Drive Replacement on SERVERware 4 Mirror/Storage Edition

If one of the disks from the storage pool is damaged, the next procedure should be followed:

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

If the disk fails, zpool will be in the state: DEGRADED, on the primary server.

~# zpool status
pool: NETSTOR
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub repaired 0 in 0h0m with 0 errors on Tue Dec  6 15:10:59 2016
config:
        NAME                    STATE     READ WRITE CKSUM
        NETSTOR                 DEGRADED     0     0     0
          mirror-0              DEGRADED     0     0     0
            SW3-NETSTOR-SRV1-1  ONLINE       0     0     0
            SW3-NETSTOR-SRV2-1  FAULTED      3     0     0  too many errors
errors: No known data errors


First, we have to make sure our damaged disk is on a secondary server not primary
in this case, we can find this out from the output above:
SW3-NETSTOR-SRV2-1 FAULTED


SRV2 this means Server-2 has a damaged disk.


If this is the case, we can proceed to the next step.
If the damaged disk is on the primary server SRV1, then we should first make a manual takeover and switch it to the secondary server. To switch manually, ssh to the secondary server and execute the next command:


killall -SIGUSR1 sysmonit


Next, we should physically replace the damaged disk in the server.

In the output from the zpool status we can see that SW3-NETSTOR-SRV2-1 is corrupted:


 SW3-NETSTOR-SRV2-1  FAULTED      3     0     0  too many errors


If this is the case we need to replace the disk labeled SW3-NETSTOR-SRV2-1 with a new one and add it to the zpool mirror.

First, physically remove the faulty disk from the server and replace it with a new disk.

After replacement, we should see a new disk in /dev/disk/by-id/


# ls -lah /dev/disk/by-id
total 0
drwxr-xr-x 2 root root 480 Srp 27 08:57 .
drwxr-xr-x 7 root root 140 Srp 27 08:13 ..
lrwxrwxrwx 1 root root   9 Srp 27 08:13 ata-INTEL_SSDSC2CW060A3_CVCV308402M3060AGN -> ../../sde
lrwxrwxrwx 1 root root  10 Srp 27 08:13 ata-INTEL_SSDSC2CW060A3_CVCV308402M3060AGN-part1 -> ../../sde1
lrwxrwxrwx 1 root root  10 Srp 27 08:13 ata-INTEL_SSDSC2CW060A3_CVCV308402M3060AGN-part2 -> ../../sde2
lrwxrwxrwx 1 root root  10 Srp 27 08:13 ata-INTEL_SSDSC2CW060A3_CVCV308402M3060AGN-part9 -> ../../sde9
lrwxrwxrwx 1 root root   9 Srp 27 08:13 ata-ST31000520AS_5VX0BZN0 -> ../../sda
lrwxrwxrwx 1 root root  10 Srp 27 08:13 ata-ST31000520AS_5VX0BZN0-part1 -> ../../sda1
lrwxrwxrwx 1 root root   9 Srp 27 08:13 ata-WDC_WD10JFCX-68N6GN0_WD-WX61A465TH1Y -> ../../sdc
lrwxrwxrwx 1 root root  10 Srp 27 08:13 ata-WDC_WD10JFCX-68N6GN0_WD-WX61A465TH1Y-part1 -> ../../sdc1
lrwxrwxrwx 1 root root   9 Srp 27 08:13 ata-WDC_WD10JFCX-68N6GN0_WD-WX81EC512Y4H -> ../../sdd
lrwxrwxrwx 1 root root  10 Srp 27 08:13 ata-WDC_WD10JFCX-68N6GN0_WD-WX81EC512Y4H-part1 -> ../../sdd1
lrwxrwxrwx 1 root root   9 Srp 27 08:57 ata-WDC_WD10JFCX-68N6GN0_WD-WXK1E6458WKX -> ../../sdb
lrwxrwxrwx 1 root root   9 Srp 27 08:13 wwn-0x10076999618641940481x -> ../../sdd
lrwxrwxrwx 1 root root  10 Srp 27 08:13 wwn-0x10076999618641940481x-part1 -> ../../sdd1
lrwxrwxrwx 1 root root   9 Srp 27 08:13 wwn-0x11689569317835657217x -> ../../sdc
lrwxrwxrwx 1 root root  10 Srp 27 08:13 wwn-0x11689569317835657217x-part1 -> ../../sdc1
lrwxrwxrwx 1 root root   9 Srp 27 08:57 wwn-0x11769037186453098497x -> ../../sdb
lrwxrwxrwx 1 root root   9 Srp 27 08:13 wwn-0x12757853320186451405x -> ../../sde
lrwxrwxrwx 1 root root  10 Srp 27 08:13 wwn-0x12757853320186451405x-part1 -> ../../sde1
lrwxrwxrwx 1 root root  10 Srp 27 08:13 wwn-0x12757853320186451405x-part2 -> ../../sde2
lrwxrwxrwx 1 root root  10 Srp 27 08:13 wwn-0x12757853320186451405x-part9 -> ../../sde9
lrwxrwxrwx 1 root root   9 Srp 27 08:13 wwn-0x7847552951345238016x -> ../../sda
lrwxrwxrwx 1 root root  10 Srp 27 08:13 wwn-0x7847552951345238016x-part1 -> ../../sda1


Now when we have a block device name, we can make a table, partition, and prepare the drive for usage.

To make a partition table use parted:

~# parted /dev/ --script -- mktable gpt


Create a new label.

IMPORTANT: label must be named in the following format: SW3-NETSTOR-SRVx-y.

Where “SRVx” comes from the server number and “-y” is the disk number.


So, in our example (SW3-NETSTOR-SRV2-1):
  • SW3-NETSTOR-SRV2 - this means virtual disk on SERVER 2
  • -1 - this is the number of the disk (disk 1)

Now add a label to the new drive.

Create the partition with the name to match our faulty partition on the server. We have this name from the above:

SW3-NETSTOR-SRV2-1 FAULTED 3 0 0 too many errors

Our command in this case will be:

~# parted /dev/ --script -- mkpart "SW3-NETSTOR-SRV2-1" 1 -1


We have now added a new partition and created a label.

To replace the drive we can use sw-nvme commands which are listed below:



Now we need to replace old drive with new drive using command:

~# sw-nvme replace-disk --old /dev/disk/by-id/old_disk_id --new /dev/disk/by-id/new_disk_id

To find the old disk id in CMD enter sw-nvme show command.

Example:

~#sw-nvme show  
{
 "config": "/sys/kernel/config/nvmet",
 "hosts": [
  "3cc5c2aa47825e608570a938971bcd7c"
 ],
 "subsystems": {
  "sw-mirror": {
   "acl": [
    "3cc5c2aa47825e608570a938971bcd7c"
   ],
   "namespaces": [
    {
     "id": 1,
     
"device": "/dev/disk/by-id/ata-KINGSTON_SA400S37120G_50026B73804B902A",
"enabled": true

}
],
"allow_any_host": false
}
},
"ports": {
"1": {
"address": "1.1.1.31",
"port": 4420,
"address_family": "ipv4",
"trtype": "tcp",
"subsystems": "sw-mirror"
}
}
}

Now when we have old disk id and new disk id our command for disk replace will be:


~#sw-nvme replace-disk --old /dev/disk/by-id/ata-KINGSTON_SA400S37120G_50026B73804B902A --new /dev/disk/by-id/ata-WDC_WD10JFCX-68N6GN0_WD-WXK1E6458WKX


This ends our procedure on the secondary server.

Next, on the primary server, add a newly created virtual disk to the zfs pool.

Next, we need to execute :

~# partprobe


We can see zpool status:

~# zpool status
  pool: NETSTOR
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub repaired 0 in 0h0m with 0 errors on Tue Dec  6 15:10:59 2016
config:


        NAME                    STATE     READ WRITE CKSUM
        NETSTOR                 DEGRADED     0     0     0
          mirror-0              DEGRADED     0     0     0
            SW3-NETSTOR-SRV1-1  ONLINE       0     0     0
            SW3-NETSTOR-SRV2-1  FAULTED      3     0     0  too many errors


errors: No known data errors

From the output we can see:

SW3-NETSTOR-SRV2-1 FAULTED status of secondary disk.

Now we need to change guid of the old disk to guid of the new disk, so that zpool can identify the new disk.

To change guid from old to new in zpool, first, we need to find out new guid.

We can use zdb command to find out:

~# zdb
NETSTOR:
    version: 5000
    name: 'NETSTOR'
    state: 0
    txg: 15
    pool_guid: 14112818788567273316
    errata: 0
    hostname: 'HydraA-1'
    vdev_children: 1
    vdev_tree:
        type: 'root'
        id: 0
        guid: 14112818788567273316
        children[0]:
            type: 'mirror'
            id: 0
            guid: 17350955661294397060
            metaslab_array: 34
            metaslab_shift: 33
            ashift: 12
            asize: 1000164294656
            is_log: 0
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 11541101181530606692
                path: '/dev/disk/by-partlabel/SW3-NETSTOR-SRV1-1'
                whole_disk: 1
                create_txg: 4
            children[1]:
                type: 'disk'
                id: 1
             guid: 12365645279327980714
                path: '/dev/disk/by-partlabel/SW3-NETSTOR-SRV2-1'
            whole_disk: 1
                create_txg: 4
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data

The important line for from zdb output:

guid: 12365645279327980714
path: '/dev/disk/by-partlabel/SW3-NETSTOR-SRV2-1'


The guid part needs to be updated to zpool.

We can update guid with the command:

~# zpool replace NETSTOR   -f

Example:

~# zpool replace NETSTOR 12365645279327980714 /dev/disk/by-partlabel/SW3-NETSTOR-SRV2-1 -f


Now check zpool status:


~# zpool status pool: NETSTOR state: DEGRADED status: One or more devices is currently being resilvered. The pool will     continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Tue Dec 6 16:12:53 2016     591M scanned out of 728M at 65,6M/s, 0h0m to go     590M resilvered, 81,14% done config:     NAME     STATE     READ WRITE CKSUM     NETSTOR     DEGRADED     0     0     0     mirror-0     DEGRADED     0     0     0     SW3-NETSTOR-SRV1-1    ONLINE     0     0     0     replacing-1     UNAVAIL     0     0     0     old     UNAVAIL     0     0     0 corrupted data     SW3-NETSTOR-SRV2-1 ONLINE     0     0     0 (resilvering)

errors: No known data errors

You need to wait for zpool to finish resilvering.


This ends our replacement procedure.