Using smartmontools-5.38 with series 2/5/5Z controllers with firmware 17380 onwards
Posted in General by Phil WilsonHi folks, in the last few days I have come across an issue running smartmontools-5.38 with the series 2, 5 and 5Z controllers on firmware 17380 onwards with SATA drives.
Just a bit of background ..
By default (configurable using the expose_physicals module options parameter), the aacraid driver exposes the physical disks attached to the controller to the operating system but prevents the disk driver from attaching them by setting the no_uld_attach flag. The disks can be accessed via the SCSI generic driver though dangerous operations are blocked (e.g. writes).
Programs like smartctl are then able to send SMART commands to a physical disks via the associated SCSI generic device node. If the disk is SATA (-d sat specified) then smartctl makes use of SAT (SCSI ATA translation) to “wrap up” the SMART command in a SCSI CDB (command descriptor block) which is then sent via the SCSI generic and aacraid driver to the controller firmware.
The problem
From controller firmware 17380 onwards, the SAT layer (in the controller firmware) more completely implements the SAT specification, particularly with regard to commands that are sent where the host application (for example smartctl) requires the ATA register information to be read back. Typically the commands that would require this are ATA non-data protocol.
Specifically with smartctl, there are two commands used where the application needs to see the ATA registers to determine the status of the drive.
ATA_CHECK_POWER_MODE (ATA command 85h - smartctl needs to see COUNT in the register information to determine the power mode)
ATA_RETURN_STATUS (SMART command B0 with features DAh - smartctl needs to see the LBA mid and high values to determine if the drive is reporting a SMART threshold exceeded error - for example using smartctl -H …).
When a SAT CDB is issued and the host application wishes to see the ATA register information, it sets the CK_COND (check condition) bit in the SAT passthrough CDB. This indicates to the SATL that the ATA register information should be returned in an ATA status return descriptor. Firmware 17380 now implements this as per the SAT specification which requires the SATL code to generate a check condition and in response to a request sense (or in this case using autosense), the sense key set to RECOVERED ERROR and additional sense qualifier to ATA PASSTHROUGH INFORMATION AVAILABLE, the ATA status return descriptor is then included in the sense data.
Unfortunately, the CHECK CONDITION causes the aacraid driver to also return a “host_status” indicating a controller internal error occurred (DID_ERROR). This propagates back to smartctl which exits. My colleague is currently looking at the aacraid driver to see how to avoid this but in case it helps anyone here is a temporary workaround (modification of smartctl behaviour) - note this will ignore host_status SG_ERR_DID_ERROR which isn’t desirable long-term. From the smartmontools-5.38 source dir …
patch -p1 < diff below os_linux.cpp
662a663
> #define LSCSI_DID_ERROR 0x7 /* Need to work around aacraid driver quirk */
787c788,791
< return -EIO; /* catch all */
---
> /* Check for DID_ERROR - workaround for aacraid driver quirk */
> if (LSCSI_DID_ERROR != io_hdr.host_status) {
> return -EIO; /* catch all if not DID_ERR */
> }
… and recompile
I’ll post back soon as I know more from my colleague.
thanks, Phil.
August 7th, 2009 at 5:14 pm
I guess this is why we’ve had issues with smartctl on our 5805 raid-card.
After we flashed from stock retail firmware (Adaptec RAID 5805 Firmware/BIOS Update Ver. 5.2.0 Build 16501 - 18.feb09) to 24.jun09 17380 firmware ( Adaptec RAID 5805 Firmware/BIOS Update Ver. 5.2.0 Build 17380) we had issues reading of smart values from our sata disks.
dmesg shows us:
[877834.182094] scsi 0:1:4:0: [sg5] Sense Key : Recovered Error [current] [descriptor]
[877834.182100] Descriptor sense data with sense descriptors (in hex):
[877834.182103] 72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
[877834.182112] 00 4f 00 c2 00 50
[877834.182116] scsi 0:1:4:0: [sg5] Add. Sense: ATA pass through information available
I guess this relate to the problem described above.
Do you suggest us flashing back to an older firmware? Will this be fine?
Should we patch smartmontools and use a local copy until a new kernel driver is released? (since I assume your firmware is actually implemented the standards correctly, and we don’t want to go back to older and “hacky” firmware that actually worked).
ETA on new kernel driver?
Thanks
Yet Another Tech
August 12th, 2009 at 1:01 pm
Hi Norrs, sorry for the very long delay in coming back to you on this. I wanted to try this with the older firmware to see what was happening and have just been able to get into my lab after vacation. With the earlier firmware, it appears that the CK_COND bit in the SAT CDB is ignored and no sense data is returned so in other words, SMART RETURN STATUS doesn’t work.
I think, under these circumstances, smartctl relies on the attribute values (first 362 bytes of the device SMART data structure) returned in response to SMART READ DATA and does a comparison for attributes flagged as pre-fail against the thresholds (returned in response to SMART READ ATTRIBUTE THRESHOLDS). If SMART RETURN STATUS had worked, the drive would have done this by itself.
I would suggest using the newer firmware and a modified local copy of smartmontools. It would take some time to get this fixed in the driver and then it would need to be approved and merged with the driver at kernel.org before filtering through to distributions. I’ll post back here once I have some more news on this.
thanks, Phil
August 13th, 2009 at 4:35 pm
Thanks for your reply Phil.
I’ll keep watching this nice blog for an update so we know when we can fetch a patch
Thanks in advance.
November 18th, 2009 at 11:57 pm
I get this message when trying to patch:
Hunk #2 FAILED at 788.
1 out of 2 hunks FAILED — saving rejects to file os_linux.cpp.rej
Anyone know what to do?
November 19th, 2009 at 9:58 am
Hi Ondhest, sorry about that - would you be able to try this. If it doesn’t help (works on my machine) let me know and I can email you directly.
thanks, Phil
November 24th, 2009 at 8:20 am
Hi folks, just to add, the latest driver on the Adaptec support website 1.1.5-24900 fixes the driver issue (returning DID_ERROR instead of DID_OK when a SAT command with CC set causes a target to return a CC).
thanks, Phil