HPE Storage Users Group https://www.3parug.com/ |
|
Physical Disk Failures https://www.3parug.com/viewtopic.php?f=18&t=582 |
Page 1 of 3 |
Author: | Richard Siemers [ Mon Feb 10, 2014 12:58 pm ] |
Post subject: | Physical Disk Failures |
PD failures are common... I wanted to discuss/share/learn how to properly audit/verify the PD failure and recovery process. I have seen drives fail, then come back online, then fail again a week later. I am curious as to what the workflow is for a PD failure, where in that workflow does it try to re-spin up the drive, or move readable chunklets off the drive vs rebuild them from parity.. at what point will it stop trying to re-spin up the drive and just rebuild everything from parity? How many different ways does a PD fail, and how does the system react differently to each? I can think of a few... You have 1 port-A or port-B failures, both ports failing, over 5 chunklets go bad on the drive (media errors), and a failed drive that can no longer be read from at all. So at the point of a drive fail, after an alert is sent out to the customer and HP... what is next? To see which drives are failed: showpd -failed -degraded Shows chunklets that are moved, scheduled to move or are moving: showpdch -mov Show chunklets that have moved, or are moving from a specific PD showpdch -from <pdid> It appears that "showpdch -sync" may reveal which chunklets are being rebuilt from parity. It appears that "showpdch -log" may show which chunklets are offline, but being serviced through parity reads, and logged writes, as in what happens during a service mag to the other 3 drives on a 4 drive magazine. One thing I would like to be able to do is confirm for the field tech that the system is ready for him to come onsite. What I do currently to "check" this is a couple things, because I am not 100% confident the first few are absolulte. 1: showpd -space <failed pd #> Code: ESFWT800-1 cli% showpd -space 285 -----------------(MB)------------------ Id CagePos Type -State- Size Volume Spare Free Unavail Failed 285 7:9:1 FC failed 285440 0 0 0 0 285440 ---------------------------------------------------------------- 1 total 285440 0 0 0 0 285440 If I don't see volume at 0, then I assume the drive evac/rebiuld is not complete yet. 2: showpdch -mov <failed pd #> Code: ESFWT800-1 cli% showpd -space 285 -----------------(MB)------------------ Id CagePos Type -State- Size Volume Spare Free Unavail Failed 285 7:9:1 FC failed 285440 0 0 0 0 285440 ---------------------------------------------------------------- 1 total 285440 0 0 0 0 285440 ESFWT800-1 cli% showpdch -mov Pdid Chnk LdName LdCh State Usage Media Sp Cl From To 42 584 tp-2-sd-0.144 514 normal ld valid Y N 285:793 --- 42 792 tp-2-sd-0.69 726 normal ld valid Y N 285:488 --- 42 1084 tp-2-sd-0.86 478 normal ld valid Y N 285:521 --- 102 574 tp-2-sd-0.140 917 normal ld valid Y N 285:785 --- 102 771 tp-5-sd-0.31 181 normal ld valid Y N 285:190 --- 102 1085 tp-2-sd-0.41 438 normal ld valid Y N 285:418 --- 109 580 tp-5-sa-0.3 42 normal ld valid Y N 285:47 --- 109 771 tp-2-sd-0.130 696 normal ld valid Y N 285:697 --- ... ... --------------------------------------------------------------------------- Total chunklets: 824 If I see any chunklets still on PDID 285 (the failed one) or that have the To field with data in it, I will assume the rebuild/evac is not done yet. Is there anyway to view the tasks that relocate/rebuild these chunklets? I dont see anything in my showtask history. |
Author: | ailean [ Tue Feb 11, 2014 7:00 am ] |
Post subject: | Re: Physical Disk Failures |
I tend to see three methods; 1) Disk fails, little warning and auto rebuild from parity. 2) Disk failing, sometimes get warnings and auto moves data elsewhere. 3) Disk not happy, maybe a few warnings or not available for allocations but requires manual servicing to start the data rebuild/move process before the engineer arrives. Support typically are aware if the disk is ready for replacement but not sure what info from the SP uploads they check for that, I suspect the estimates they sometime give are generic based on disk type/size. The fun tends to begin when the extra load from the rebuild fails another disk and/or when inserting the new disk doesn't go to plan. Three different service companies and over a dozen different engineers in 5 years has led to random events during replacements but no data loss. |
Author: | eve [ Fri Feb 14, 2014 3:26 am ] |
Post subject: | Re: Physical Disk Failures |
Hi, Run a showpd -c pdid (e.g. showpd -c 285) and check for the following columns to be zero: * NORMAL USED OK, NORMAL USED FAIL, NORMAL UNUSED FREE * SPARE USED OK, SPARE USED FAIL en SPARE UNUSED FREE If one is not zero, drive is not ready to be swapped. Cheers |
Author: | Richard Siemers [ Mon Feb 24, 2014 5:57 pm ] |
Post subject: | Re: Physical Disk Failures |
Thanks for that feedback. What determines what sort of servicemag they will do? I have seen cases where they used logging, which servicemag wasn't initiated until the tech and the part were onsite, and in other cases they do a full service mag several hours ahead before the the tech arrives. I presume its based on activity of the system, how does one determine which to use and when? |
Author: | ailean [ Tue Feb 25, 2014 5:52 am ] |
Post subject: | Re: Physical Disk Failures |
It's been a while since I've seen a full evac of a Mag, I'd guess performance, load and % full would be considered. Logging seems to be the norm now, I know they used to have concerns regarding how long you could run with Logging on but we've had failed inserts of new disks that left us running with Logging for several hours until the engineer was able to get hold of someone who knew enough under the hood 3PAR to work around the problem. It may have only been replacing an entire Mag (there was a time that FC450 disks weren't available so any failures were replaced with FC600 disks plus they had to have all the disks in the Mag the same size) where I've seen this or some early disk replacements where the Mag was only maybe 10% full and Logging was still a new feature. Although I have had to start the servicemag manually and tell the Engineer to come back in a few hours when Support have forgotten or had them ask me to do it because certain Support staff were using some remote portal that broke often (other teams appeared to have access to better tools at the time and didn't have a problem ). |
Author: | corge [ Tue Feb 25, 2014 9:48 am ] |
Post subject: | Re: Physical Disk Failures |
eve wrote: Hi, Run a showpd -c pdid (e.g. showpd -c 285) and check for the following columns to be zero: * NORMAL USED OK, NORMAL USED FAIL, NORMAL UNUSED FREE * SPARE USED OK, SPARE USED FAIL en SPARE UNUSED FREE If one is not zero, drive is not ready to be swapped. Cheers For drive ready to be replaced, only Used OK and Used FAIL need to be 0. HP doesn't care about the others. |
Author: | corge [ Tue Feb 25, 2014 9:53 am ] |
Post subject: | Re: Physical Disk Failures |
Our reporting shows that a drive has failed. I double check on the Inserv via command line. Code: showpd -failed -degraded I check servicemag to make sure nothing is running currently Code: servicemag status I issue the command below to get the model number of the drive, since HP will not have it for the company I work for Code: showpd -i <PD#> I issue the following to get the drive position, drive state, chunklet status. Code: showpd -c <PD#> The following two commands are also needed by support Code: showversion Code: showsys If replacing a single drive, I issue Code: servicemag start -log -pdid <PD#> The Inserv will begin preparing to take the magazine offline and log the chunklets normally bound for this magazine to other magazines in the system. Verify the magazine is ready to be pulled by issuing Code: servicemag status You should see SUCCEEDED when it's ready to be pulled. The orange indicator light on the magazine will be lit. Replace the drive in the magazine, put the magazine back in the Inserv and please wait for the orange light to go away or make sure all of the lights on the magazine are green and NOT blinking. Blinking lights indicate the drives are still spinning up upon initial insertion of the magazine. Back at command-line, type in Code: cmore showpd You should see the drive placement at the top and says NEW. This just shows you the Inserv sees the new drive and is ready to go. Issue the following to have the servicemag script resume the magazine. Code: servicemag resume <CAGE#> <MAGAZINE#> That is how it is done here. |
Author: | Richard Siemers [ Thu Mar 06, 2014 2:41 pm ] |
Post subject: | Re: Physical Disk Failures |
Excellent thank you for the step by step write up! |
Author: | eve [ Tue Mar 11, 2014 6:07 am ] |
Post subject: | Re: Physical Disk Failures |
corge wrote: eve wrote: Hi, Run a showpd -c pdid (e.g. showpd -c 285) and check for the following columns to be zero: * NORMAL USED OK, NORMAL USED FAIL, NORMAL UNUSED FREE * SPARE USED OK, SPARE USED FAIL en SPARE UNUSED FREE If one is not zero, drive is not ready to be swapped. Cheers For drive ready to be replaced, only Used OK and Used FAIL need to be 0. HP doesn't care about the others. HP does care about the others. I have been servicing 3PAR for quite some years and I know from experience that you may get issues if you do not have all zero |
Author: | eve [ Tue Mar 11, 2014 6:18 am ] |
Post subject: | Re: Physical Disk Failures |
corge wrote: Our reporting shows that a drive has failed. I double check on the Inserv via command line. Code: showpd -failed -degraded I check servicemag to make sure nothing is running currently Code: servicemag status I issue the command below to get the model number of the drive, since HP will not have it for the company I work for Code: showpd -i <PD#> I issue the following to get the drive position, drive state, chunklet status. Code: showpd -c <PD#> The following two commands are also needed by support Code: showversion Code: showsys If replacing a single drive, I issue Code: servicemag start -log -pdid <PD#> The Inserv will begin preparing to take the magazine offline and log the chunklets normally bound for this magazine to other magazines in the system. Verify the magazine is ready to be pulled by issuing Code: servicemag status You should see SUCCEEDED when it's ready to be pulled. The orange indicator light on the magazine will be lit. Replace the drive in the magazine, put the magazine back in the Inserv and please wait for the orange light to go away or make sure all of the lights on the magazine are green and NOT blinking. Blinking lights indicate the drives are still spinning up upon initial insertion of the magazine. Back at command-line, type in Code: cmore showpd You should see the drive placement at the top and says NEW. This just shows you the Inserv sees the new drive and is ready to go. Issue the following to have the servicemag script resume the magazine. Code: servicemag resume <CAGE#> <MAGAZINE#> That is how it is done here. A few thing to add * showpd -c Be sure to check for the following columns to be zero: - NORMAL USED OK, NORMAL USED FAIL, NORMAL UNUSED FREE - SPARE USED OK, SPARE USED FAIL en SPARE UNUSED FREE If one is not zero, drive is not ready to be swapped. * servicemag -log -pdid xx the -log option is only needed on S, T and V class systems where you have four drives on a magazine -log will divert write IO to the three remaining drives on that mag to other disks which will be played back to the disks during resume (read IO is from parity for all four drives) -log is the 3PAR recommended option for large drives If on S, T and V-Class systems the -log option is left out, you will issue a full servicemag which will copy all data from the three remaining drives on that mag to other disks This will take hours * If you run a showpd after replacing the failed drive, the new drive may show status "degraded" instead of new. This means that the drive is running old firmware. Just continue with the servicemag resume as the drive will be first upgraded during the resume If the drive shows "failed" try a reseat |
Page 1 of 3 | All times are UTC - 5 hours |
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group http://www.phpbb.com/ |