The Woes of CAM mps0: SCSI sense & mpssas_prepare_remove Errors FreeNAS

Howdy folks i have been having some problems with old Telsa recently, as you know Telsa is a 22TB Storage which acts as a SAN for two Type 1 Hypervisors in a Ha Cluster so reliability is a must. Awhile back i bought those Rosewil RSV-SATA-Cage-34 got 3 of them to go in my 4u Codegen chassis but they have been giving me a lot of trouble with errors relating to CAM SCSI Sense Errors so i decided to investigate this problem.

I bought a fire brand new Seagate Ironwolf 3TB from https://www.box.co.uk/ which was giving me a lot of trouble even if i swapped drives around it still gave me problems I off-lined the drive and ran short and long SMART tests on this new drive and could not find nothing wrong with the disk so i put the drive back in was fine for a good month then i had the same problem again but on another drive so i decided to change Forward Breakout SAS Cable but still had the problem even changed the Dell Perc H310I SAS Controller but still had problems. The problem was with random drives just like it was picky,

Here is the SCSI Sense Errors if you are interested,

(da6:mps0:0:16:0): Retrying command
(da6:mps0:0:16:0): READ(16). CDB: 88 00 00 00 00 01 09 0c 53 d0 00 00 00 40 00 00 
(da6:mps0:0:16:0): CAM status: CCB request completed with an error
(da6:mps0:0:16:0): Retrying command
(da6:mps0:0:16:0): READ(16). CDB: 88 00 00 00 00 01 09 0c 52 d0 00 00 01 00 00 00 
(da6:mps0:0:16:0): CAM status: SCSI Status Error
(da6:mps0:0:16:0): SCSI status: Check Condition
(da6:mps0:0:16:0): SCSI sense: ABORTED COMMAND asc:47,3 (Information unit iuCRC error detected)
(da6:mps0:0:16:0): Retrying command (per sense data)
(da6:mps0:0:16:0): READ(16). CDB: 88 00 00 00 00 01 09 23 0a f8 00 00 00 40 00 00 
(da6:mps0:0:16:0): CAM status: Data Overrun error
(da6:mps0:0:16:0): Retrying command
(da6:mps0:0:16:0): READ(16). CDB: 88 00 00 00 00 01 09 23 0b 78 00 00 00 40 00 00 
(da6:mps0:0:16:0): CAM status: Data Overrun error
(da6:mps0:0:16:0): Retrying command
(da6:mps0:0:16:0): READ(16). CDB: 88 00 00 00 00 01 09 23 0b 38 00 00 00 40 00 00 
(da6:mps0:0:16:0): CAM status: Data Overrun error
(da6:mps0:0:16:0): Retrying command
(da6:mps0:0:16:0): READ(16). CDB: 88 00 00 00 00 01 09 23 0a f8 00 00 00 40 00 00 
(da6:mps0:0:16:0): CAM status: SCSI Status Error
(da6:mps0:0:16:0): SCSI status: Check Condition
(da6:mps0:0:16:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da6:mps0:0:16:0): Retrying command (per sense data)
(da6:mps0:0:16:0): READ(16). CDB: 88 00 00 00 00 01 09 24 30 c8 00 00 00 40 00 00 
(da6:mps0:0:16:0): CAM status: SCSI Status Error
(da6:mps0:0:16:0): SCSI status: Check Condition
(da6:mps0:0:16:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da6:mps0:0:16:0): Retrying command (per sense data)

I decided to buy a X-Case RM212 Pro 2u 12 bay Hotswap Chassis from eBay which came with a Seasonic SS-600H2U Power supply. The case came with two SAS Backplanes one with a SAS Explander and the other without. I moved Telsa to the new Chassis and installed the SAS Expander as I’m only using one SAS Controller which was great for a month and the problem came back but this time it was related to da10 which is a Seagate 1TB ST1000DM010-2EP1 CC43 drive in the iSCSI array with a 6 Disk Mirror, ahh this is annoying my emails were bombarded with this but this time the drive was being detached which is scary a drive being detached in a high Disk IO array for my two Hypervisors this is not good. I off-lined this drive and could not find any problems with it so i put it back in it was fine for a 10 hours then the same happened again. Output of dmesg,

mps0: mpssas_prepare_remove: Sending reset for target ID 15

da10 at mps0 bus 0 scbus0 target 15 lun 0

mps0: da10: <ATA ST1000DM010-2EP1 CC43> s/n Z9AR1MFP detached

(da10:mps0:0:15:0): WRITE(10). CDB: 2a 00 02 2b 5a c0 00 00 60 00

Unfreezing devq for target ID 15

(da10:mps0:0:15:0): CAM status: CCB request aborted by the host

(da10:mps0:0:15:0): Error 5, Periph was invalidated

(da10:mps0:0:15:0): Periph destroyed

mps0: SAS Address for SATA device = 3c2f56516485484e

mps0: SAS Address from SATA device = 3c2f56516485484e

da10 at mps0 bus 0 scbus0 target 15 lun 0

da10: <ATA ST1000DM010-2EP1 CC43> Fixed Direct Access SPC-4 SCSI device

da10: Serial Number Z9AR1MFP

da10: 600.000MB/s transfers

da10: Command Queueing enabled

da10: 953869MB (1953525168 512 byte sectors)

da10: quirks=0x8<4K>

ses0: da10,pass11: Element descriptor: 'ArrayDevice04'

ses0: da10,pass11: SAS Device Slot Element: 1 Phys at Slot 4

ses0:  phy 0: SATA device

ses0:  phy 0: parent 500605b0000274bf addr 500605b0000274a4

mps0: mpssas_prepare_remove: Sending reset for target ID 15

da10 at mps0 bus 0 scbus0 target 15 lun 0

mps0: da10: Unfreezing devq for target ID 15

<ATA ST1000DM010-2EP1 CC43> s/n Z9AR1MFP detached

(da10:mps0:0:15:0): Periph destroyed

mps0: SAS Address for SATA device = 3c2f56516485484e

mps0: SAS Address from SATA device = 3c2f56516485484e

da10 at mps0 bus 0 scbus0 target 15 lun 0

da10: <ATA ST1000DM010-2EP1 CC43> Fixed Direct Access SPC-4 SCSI device

da10: Serial Number Z9AR1MFP

da10: 600.000MB/s transfers

da10: Command Queueing enabled

da10: 953869MB (1953525168 512 byte sectors)

da10: quirks=0x8<4K>

ses0: da10,pass11: Element descriptor: 'ArrayDevice04'

ses0: da10,pass11: SAS Device Slot Element: 1 Phys at Slot 4

ses0:  phy 0: SATA device

ses0:  phy 0: parent 500605b0000274bf addr 500605b0000274a4

But here is what i find what is strange, if you notice zpool status and cam devlist shows no problems at all,

zpool status iSCSI

  pool: iSCSI

state: ONLINE

  scan: resilvered 124M in 0 days 00:00:03 with 0 errors on Fri Jan 15 15:40:31 2021

config:


    NAME                                            STATE     READ WRITE CKSUM

    iSCSI                                           ONLINE       0     0     0

      mirror-0                                      ONLINE       0     0     0

        gptid/7c39221d-ea2c-11e9-b7f4-0017087d2e9a  ONLINE       0     0     0

        gptid/7cea36c6-ea2c-11e9-b7f4-0017087d2e9a  ONLINE       0     0     0

        gptid/470b7803-36c0-11eb-80e1-00074309b09a  ONLINE       0     0     0

      mirror-1                                      ONLINE       0     0     0

        gptid/7d87604a-ea2c-11e9-b7f4-0017087d2e9a  ONLINE       0     0     0

        gptid/7e3eae42-ea2c-11e9-b7f4-0017087d2e9a  ONLINE       0     0     0

        gptid/f5b603a3-36d6-11eb-80e1-00074309b09a  ONLINE       0     0     0


errors: No known data errors
<ATA ST1000DM003-1ER1 CC45>        at scbus0 target 0 lun 0 (pass0,da0)
<ATA ST1000DM003-1ER1 CC45>        at scbus0 target 1 lun 0 (pass1,da1)
<ATA ST1000DM010-2EP1 CC43>        at scbus0 target 2 lun 0 (pass2,da2)
<ATA ST1000DM010-2EP1 CC43>        at scbus0 target 3 lun 0 (pass3,da3)
<ATA ST3000DM003-2AE1 0001>        at scbus0 target 8 lun 0 (pass4,da4)
<ATA ST3000VN007-2AH1 SC60>        at scbus0 target 9 lun 0 (pass5,da5)
<ATA ST3000VN007-2E41 SC60>        at scbus0 target 10 lun 0 (pass6,da6)
<ATA ST3000DM008-2DM1 CC26>        at scbus0 target 11 lun 0 (pass7,da7)
<ATA ST3000VN000-1HJ1 SC60>        at scbus0 target 12 lun 0 (pass8,da8)
<ATA ST3000VN007-2AH1 SC60>        at scbus0 target 13 lun 0 (pass9,da9)
<GOOXI Bobcat 0d00>                at scbus0 target 14 lun 0 (ses0,pass10)
<ATA ST1000DM010-2EP1 CC43>        at scbus0 target 15 lun 0 (da10,pass11)
<ATA ST1000DM010-2EP1 CC43>        at scbus0 target 16 lun 0 (pass12,da11)

But i found the problem, when I first started using this chassis with the Power Supply i noticed that the Fan on the Power Supply was always screaming and always stayed at full RPM which i thought was a bit strange because Seasonics documentation shows that the Fan is thermal controlled i investigated and noticed that the case came with a single Molex to a 3 way splitter which is scary pulling all those amps on a single 12v 15A Rail you see this Power Supply has 3 12V Rails. 12v1 15A, 12v2 15A and 12v3 17A i mean these 6 3TB Seagate Ironwolf pulls around 2amps each then you have peak as well as the 6 1TB Seagate Drives that in the 6 Disk Mirror which pulls 2.5amps each i mean that is scary.

I should of noticed the adapter but the reason why because it didn’t look like a adapter as you can see here,

Removed that adapter and the backplane now has 3 dedicated Molex Power from the Rails which is a lot happier but the Fan is no longer screaming like it was.

https://www.truenas.com/community/attachments/1611103563033-png.44429/

So the Adapter was causing the problem guessing the drive was spinning down then back up before the OS detected it maybe.

Here are some pics of the new chassis.

Overall Telsa is back up and running and i do not recommend these Rosewill RSV-SATA-Cage-34 cages.

Author:

Leave a Reply

Your email address will not be published. Required fields are marked *