Physical Disk Failures

User avatar
Richard Siemers
Site Admin
Posts: 1333
Joined: Tue Aug 18, 2009 10:35 pm
Location: Dallas, Texas

Re: Physical Disk Failures

Post by Richard Siemers »

I was able to confirm that "Used Fail" chunklets is a good one to watch to get to 0. Just had a 1 TB NL drive fail:

Code: Select all

ESFWT800-1 cli% showpd -c 362
                              ------- Normal Chunklets -------- ---- Spare Chunklets ----
                              - Used - -------- Unused -------- - Used - ---- Unused ----
 Id CagePos Type State  Total OK  Fail Free Uninit Unavail Fail OK  Fail Free Uninit Fail
362 0:5:2   NL   failed  3724  0  1078    0   1046       0 1586  0     0    0      0   14
-----------------------------------------------------------------------------------------
  1 total                3724  0  1078    0   1046       0 1586  0     0    0      0   14


That number of failed chunklets is slowly ticking down over time as they, and I cant tell which, move or rebuild from parity.

"showpdch -sync" did not show anything.

"showpdch -mov" showed all the chunklets from the failed PD that had already been relocated, and 2 that were actively moving.

"showpdch 362" showed all the chunklets left on the drive, and the current 2 that were moving. This list is getting shorter and shorter, it only takes a short time per chunklet.

"showpdch -mov 362" shows just the 2 chunklets being moved off the failed drive.

What is interesting is that "showpd -c 362" shows all the remaining chunklets as "failed" and that number is shriking over time... however, "showpdch 362" shows all the chunklets as "normal" but its clearly evacuating them to other disks 2 at a time.
Richard Siemers
The views and opinions expressed are my own and do not necessarily reflect those of my employer.
afidel
Posts: 216
Joined: Tue May 07, 2013 1:45 pm

Re: Physical Disk Failures

Post by afidel »

Hmm, only 2 chunklets at a time? That seems like a rather slow way to restore availability. I was led to believe that recovery operations were done on a many to many basis like XIV but 2 chunklets concurrent sounds much closer to RID's on EVA.
User avatar
Richard Siemers
Site Admin
Posts: 1333
Joined: Tue Aug 18, 2009 10:35 pm
Location: Dallas, Texas

Re: Physical Disk Failures

Post by Richard Siemers »

This 2 chunklets moving at a time thing *seems* to be a new feature since we upgraded from 2.3.1 to 3.1.1. With 2.3.1, rebuilds would go fast enough to trigger our IOPS/PD alerts every 5 minutes for about 30 minutes total... then the rebuild would complete.

I suspect there is more to it than that... I *think* these chunks on the failed drive were still online/readable so it may have chosen an low priority move since availability was not impacted... I hope thats the case.

Would be nice to have some documentation of how drive errors are dealt with.
Richard Siemers
The views and opinions expressed are my own and do not necessarily reflect those of my employer.
afidel
Posts: 216
Joined: Tue May 07, 2013 1:45 pm

Re: Physical Disk Failures

Post by afidel »

Ah, that makes sensse, if it sees the drive as online but degraded it's logical to do a low priority evacuation.
3parlrn
Posts: 13
Joined: Mon Jun 15, 2015 11:34 am

Re: Physical Disk Failures

Post by 3parlrn »

What happens if someone pulls wrong disk out and wants to put it back ?

1, Does it move chunklets from removed disk to other PD ?
2, How to bring PD back online after putting it back inside ?
3, How to restore those chunklets back to the disk which was pulled out ?


Thanks for your recommendations and expert views
AMINHETFIELD
Posts: 6
Joined: Tue May 01, 2018 4:27 am

Re: Physical Disk Failures

Post by AMINHETFIELD »

Hello

I have some problem with hp 3par 7200 with 900GB FC HDD.
one of the HDDs is fail about 1 month ago , the pdid of my hdd is 0 19 , i replace it withe resvicemag procedure and everything is ok.
after 1 day my new hdd is normal and the failed disk is gone but next day new disk is fail , i replace the fail disk again and after 1 day everything is ok.
after 1 month hdd in 0 19 fail again and i replace it but after 2 day the new hdd has fail again.


cli% showpd

----Size(MB)---- ----Ports----
Id CagePos Type RPM State Total Free A B Cap(GB)
0 0:0:0 FC 10 normal 838656 146432 1:0:1* 0:0:1 900
1 0:1:0 FC 10 normal 838656 143360 1:0:1 0:0:1* 900
2 0:2:0 FC 10 normal 838656 585728 1:0:1* 0:0:1 900
3 0:3:0 FC 10 normal 838656 136192 1:0:1 0:0:1* 900
4 0:4:0 FC 10 normal 838656 147456 1:0:1* 0:0:1 900
5 0:5:0 FC 10 normal 838656 117760 1:0:1 0:0:1* 900
6 0:6:0 FC 10 normal 838656 148480 1:0:1* 0:0:1 900
7 0:7:0 FC 10 normal 838656 129024 1:0:1 0:0:1* 900
8 0:8:0 FC 10 normal 838656 148480 1:0:1* 0:0:1 900
9 0:9:0 FC 10 normal 838656 105472 1:0:1 0:0:1* 900
10 0:10:0 FC 10 normal 838656 0 1:0:1* 0:0:1 900
11 0:11:0 FC 10 normal 838656 0 1:0:1 0:0:1* 900
12 0:12:0 FC 10 normal 838656 0 1:0:1* 0:0:1 900
13 0:13:0 FC 10 normal 838656 0 1:0:1 0:0:1* 900
14 0:14:0 FC 10 normal 838656 0 1:0:1* 0:0:1 900
15 0:15:0 FC 10 normal 838656 1024 1:0:1 0:0:1* 900
16 0:16:0 FC 10 normal 838656 0 1:0:1* 0:0:1 900
17 0:17:0 FC 10 normal 838656 0 1:0:1 0:0:1* 900
18 0:18:0 FC 10 normal 838656 0 1:0:1* 0:0:1 900
19 0:19:0 FC 10 failed 838656 0 1:0:1 0:0:1* 900
20 0:21:0 FC 10 normal 838656 0 1:0:1 0:0:1* 900
21 0:22:0 FC 10 normal 838656 5120 1:0:1* 0:0:1 900
22 0:23:0 FC 10 normal 838656 2048 1:0:1 0:0:1* 900
23 0:20:0 FC 10 normal 838656 0 1:0:1* 0:0:1 900


cli% checkhealth

Checking alert
Checking cabling
Checking cage
Checking dar
Checking date
Checking ld
Checking license
Checking network
Checking node
Checking pd
Checking port
Checking rc
Checking snmp
Checking task
Checking vlun
Checking vv
Component ---------------Description--------------- Qty
Network Too few working admin network connections 1
PD PDs that are failed 1


cli% showcage

Id Name LoopA Pos.A LoopB Pos.B Drives Temp RevA RevB Model Side
0 cage0 1:0:1 0 0:0:1 0 24 26-30 320e 320e DCN1 n/a



cli% showversion

Release version 3.1.2 (MU2)
Patches: P10

Component Name Version
CLI Server 3.1.2 (MU2)
CLI Client 3.1.2 (MU2)
System Manager 3.1.2 (MU2)
Kernel 3.1.2 (MU2)
TPD Kernel Code 3.1.2 (MU2)

cli% servicemag start -pdid 19 -seucceeded

Expecting integer pdid, got: -succeeded

SAN.SER cli% servicemag start -pdid 19 -succeeded

Are you sure you want to run servicemag?
select q=quit y=yes n=no: y
servicemag start -pdid 19

... servicing disks in mag: 0 19

... normal disks:

... not normal disks: WWN [XXXXXXXXXXXXXXXX] Id [19] diskpos [0]



The servicemag start operation will continue in the background.

cli% showpd -space 19

-----------------(MB)------------------
Id CagePos Type -State- Size Volume Spare Free Unavail Failed
19 0:19:0 FC failed 838656 0 0 0 0 838656
---------------------------------------------------------------
1 total 838656 0 0 0 0 838656
SAN.SER cli% servicemag resume 0 19

Are you sure you want to run servicemag?
select q=quit y=yes n=no: y

servicemag status 0 19

The magazine is being brought online due to a servicemag resume.
The last status update was at Tue May 1 10:27:04 2018.
Chunklets relocated: 6 in 4 minutes and 45 seconds
Chunklets remaining: 2232
Chunklets marked for moving: 2232
Estimated time for relocation completion based on 47 seconds per chunklet is: 1 days, 5 hours, 8 minutes and 24 seconds
servicemag resume 0 19 -- is in Progress
cli% exit

may the os version is my problem?

please help me about this problem.

thank you
MammaGutt
Posts: 1578
Joined: Mon Sep 21, 2015 2:11 pm
Location: Europe

Re: Physical Disk Failures

Post by MammaGutt »

Could be OS.

Could also be the cage slot.

How are you getting your replacement drives? If they are from ebay or some third party these may have been used and have some SMART counters just waiting to fail the drive.
The views and opinions expressed are my own and do not necessarily reflect those of my current or previous employers.
AMINHETFIELD
Posts: 6
Joined: Tue May 01, 2018 4:27 am

Re: Physical Disk Failures

Post by AMINHETFIELD »

Thank you for reply

i buy my hdd from hp.
so if the slot is my problem , new hhd must be fail after the i insert the disk in slot.
but hdd fail after the chunklet relocation is end and hdd state is normal for 3 days or 1 month.
ailean
Posts: 392
Joined: Wed Nov 09, 2011 12:01 pm

Re: Physical Disk Failures

Post by ailean »

Slot could be causing some intermittent errors that add up over time, reaching a threshold that fails the disk.

Next time check the slot for any debris or pin damage just in case.

Maybe occasional showpd -e commands to see if any errors are climbing.
sanjac
Posts: 96
Joined: Thu Oct 26, 2017 1:21 am

Re: Physical Disk Failures

Post by sanjac »

at one of my drives i discovered 3Gib failed. when is time for concern? for how many failed chuncklets I have right to call support to change the drive?
Post Reply