node rescue fails

cheese2
Posts: 13
Joined: Thu Oct 05, 2017 5:24 am

Re: node rescue fails

Post by cheese2 »

Well I have no idea where our purchasing department sourced the replacement from, but it definitely wouldn't have been eBay. It appears to have been "refurbished" but apparently not enough.

I have checked our original serial number against the public warranty checker and it shows as expired - the dates correspond to when it would have been sold to it's original owner. It was a proper factory refurb when it was supplied to us, but the records weren't updated when that happened (which is consistent with all the other HP/E hardware we acquired in the same way).
MammaGutt
Posts: 1578
Joined: Mon Sep 21, 2015 2:11 pm
Location: Europe

Re: node rescue fails

Post by MammaGutt »

The HP/HPE system serial number of the node you are booting is: CZ34029001
The 3PAR system serial number of your system is: 1610528

On all systems I've seen, the last few numbers should match so my guess is that you have the issue others are mentioning, that the node you got isn't a "clean" spare part but simply a node pulled from another working system.
The views and opinions expressed are my own and do not necessarily reflect those of my current or previous employers.
cheese2
Posts: 13
Joined: Thu Oct 05, 2017 5:24 am

Re: node rescue fails

Post by cheese2 »

Hmmmm... I'm not seeing where the HP/E serial of our good node matches the 3par serial either. However, looking at one of our other 3pars, both nodes do indeed have matching HP/E serials and they match the 3par serial. weird.
MammaGutt
Posts: 1578
Joined: Mon Sep 21, 2015 2:11 pm
Location: Europe

Re: node rescue fails

Post by MammaGutt »

Okay....

Do the serial of the other node in your Frankenstein3PAR match the serial of the replacement?


[ 38.528755] Assembly Serial Number: PCMBUA8TM5M0ZV <--- This should be unique per node
[ 38.538626] Assembly Part Number: QR483-63001 <--- This needs to be the same for the entire cluster
[ 38.548148] Saleable Serial Number: CZ34029001 <--- This needs to be the same for the entire cluster
[ 38.557323] Saleable Product Number: QR483A <--- This needs to be the same for the entire cluster
[ 38.567360] Spare Part Number: 683246-001 <--- This needs to be the same for the entire cluster
The views and opinions expressed are my own and do not necessarily reflect those of my current or previous employers.
cheese2
Posts: 13
Joined: Thu Oct 05, 2017 5:24 am

Re: node rescue fails

Post by cheese2 »

From the good node:

Assembly Serial Number: PCMBUA3TM3K03U <--- This should be unique per node It is.
Assembly Part Number: QR483-63001 <--- This needs to be the same for the entire cluster It is.
Saleable Serial Number: 2MD25201SL <--- This needs to be the same for the entire cluster It is not.
Saleable Product Number: QR483A <--- This needs to be the same for the entire cluster It is.
Spare Part Number: 683246-001 <--- This needs to be the same for the entire cluster It is.

So apparently its the Saleable Serial Number that needs changing or clearing. I understand there is a way to do this from the whack prompt but I'm reluctant to try it without some documentation and/or guidance.

In the meantime we are chasing our purchasing department to see where they got the replacement from. We did not request or authorize an unrefurbished part so hopefully they can be convinced to source a clean one from somewhere.

I am on a three week tour of sites and facilities so I may be slow updating this thread, but I appreciate all the help.
sjm
Posts: 62
Joined: Mon Jul 29, 2013 9:01 pm

Re: node rescue fails

Post by sjm »

We had a similar fault on our 7400 but was under warranty.

HPE tech bought a switch and hard cabled the his lappy and node0 (good working node) to node1

and then did rescue as he stated some network switches etc have issues.

It was just a cheap switch from like a tech shop but allowed him to set speed and duplex

Worked very well.
cheese2
Posts: 13
Joined: Thu Oct 05, 2017 5:24 am

Re: node rescue fails

Post by cheese2 »

Just a quick little update on this. We have received advice from the seller of the replacement node that running Node Rescue from the SP will reset the serial number, whereas node to node rescue doesn't. However only a hardware SP can perform a rescue because it requires a serial connection to the node as well as network - or is there a way to do it from the virtual?

Another line of enquiry has yielded this document:
https://support.hpe.com/hpsc/doc/public/display?docId=mmr_kc-0111944 which addresses a different issue but contains a procedure for initiating node rescue manually, and it includes the set perm sys_serial command. It makes no mention of requiring a serial port connection so I'm wondering whether it could work with a virtual SP.

Has anyone tried a node rescue from a virtual SP?
cheese2
Posts: 13
Joined: Thu Oct 05, 2017 5:24 am

Re: node rescue fails

Post by cheese2 »

Further information: Entering whack and running set perm sys_serial=2MD25201SL rejects the alphanumeric serial. It will accept set perm sys_serial=1610528 but it already had this number and still won't join the cluster. Watching the serial output of the new node as noderescue runs you can see the good node running set perm sys_serial=1610528 on it as part of the automated process. Noderescue still fails in the same way - this doesn't seem to be what we need.

So I took a stab in the dark and tried set perm saleable_serial=2MD25201SL and it seemed to run and accept the value, however later in the boot process the new node reports its Saleable Serial as CZ34029001 and it still won't join the cluster so that isn't what we need either. I wish there was a way to read the value of these parameters from within whack rather than setting them blindly, but I don't know enough about the syntax to take a guess.
cheese2
Posts: 13
Joined: Thu Oct 05, 2017 5:24 am

Re: node rescue fails

Post by cheese2 »

Minor Progress!

I have succeeded in changing the saleable serial number of the replacement node to match the existing good node! The whack command prom hp displays fields from the EEPROM containing identifying information, and the command prom hp edit lets you edit them line-by-line. This is where some information from an internal support document I read about two years ago came back to me (Oh, if only I still had access to it now...) On boot the system verifies the integrity of the eeprom data with a checksum and will halt if it doesn't match up, so I believe any time you make a change to the eeprom you need to follow it with prom checksum to update it. I did this and it resulted in an immediate fatal error, however after a hard reset the system booted and reports the correct saleable serial in the process.

The node still fails to join the cluster after a rescue. Nothing has changed in that respect. However what I'm counting as a tiny bit of progress is this:

Code: Select all

cli% shownode
                                                               Control    Data        Cache
Node --Name--- -State- Master InCluster -Service_LED ---LED--- Mem(MB) Mem(MB) Available(%)
   0 1610528-0 OK      Yes    Yes       Off          GreenBlnk    8192    8192          100
   1           Failed  No     No        Unknown      Unknown         0       0            0

It now recognizes that node 1 exists! Previously only node0 was listed.
Last edited by cheese2 on Tue Dec 12, 2017 8:57 am, edited 1 time in total.
cheese2
Posts: 13
Joined: Thu Oct 05, 2017 5:24 am

Re: node rescue fails

Post by cheese2 »

Success!

Analysis of the serial port boot logs of the replacement node revealed the following error:

Code: Select all

Prom Node ID Value and Slot ID mismatch. NodeID: 0 SlotID: 1

Which leads me to think there is a node ID value store in the EEPROM along with the serial numbers. Running prom edit steps through a different set of values than prom hp edit I tried yesterday, and sure enough Node ID is one of them:

Code: Select all

Whack>prom edit
Board Spin:       04
Size * 256 bytes: 04
Board Class:      920
Board Base:       200040
Board Rev:        A8
Assembly Vendor:  FXN
Assembly Year:    2013
Assembly Week:    44
Assembly Day:     04
Assembly Serial:  02021391
System Serial:    1610528
Node ID:          00
Midplane Type:    1b
Node Type:        40
W19:              0fffffff
Whack>

Now this isn't node 0 it's supposed to be node 1, so I changed Node ID to 01, ran a quick prom checksum which returned PASS instead of a fatal error, and then ran go to complete the boot process. The node came up and after a few moments joined the cluster!

Code: Select all

cli% shownode -i -svc
-------------------------------------------------------------------------Nodes--------------------------------------------------------------------------
Node --Name--- -Manufacturer- ---Assem_Part--- --Assem_Serial-- -Saleable_Serial- --Saleable_PN--- ----Spare_PN---- -------Model_Name------- -Assem_Rev-
   0 1610528-0 FXN            QR483-63001      PCMBUA3TM3K03U   2MD25201SL        QR483A           683246-001       HP_3PAR 7400                    004
   1 1610528-1 FXN            QR483-63001      PCMBUA8TM5M0ZV   2MD25201SL        QR483A           683246-001       HP_3PAR 7400                    004

There is still an error in the management console

Code: Select all

Cage 0, Interface Card 1 Failed (Interface Card Firmware Unknown {0x0} )
that needs further troubleshooting but that may be a separate issue. Needless to say I am *very* relieved to have this up and running. Thanks everyone for your ideas and suggestions through this process.
Post Reply