Home | About the Adaptec Linux Blog | Adaptec Trusted Storage


Using smartmontools-5.38 with series 2/5/5Z controllers with firmware 17380 onwards

Posted in General by Phil Wilson

Hi folks, in the last few days I have come across an issue running smartmontools-5.38 with the series 2, 5 and 5Z controllers on firmware 17380 onwards with SATA drives.

Just a bit of background ..

By default (configurable using the expose_physicals module options parameter), the aacraid driver exposes the physical disks attached to the controller to the operating system but prevents the disk driver from attaching them by setting the no_uld_attach flag. The disks can be accessed via the SCSI generic driver though dangerous operations are blocked (e.g. writes).

Programs like smartctl are then able to send SMART commands to a physical disks via the associated SCSI generic device node. If the disk is SATA (-d sat specified) then smartctl makes use of SAT (SCSI ATA translation) to “wrap up” the SMART command in a SCSI CDB (command descriptor block) which is then sent via the SCSI generic and aacraid driver to the controller firmware.

The problem

From controller firmware 17380 onwards, the SAT layer (in the controller firmware) more completely implements the SAT specification, particularly with regard to commands that are sent where the host application (for example smartctl) requires the ATA register information to be read back. Typically the commands that would require this are ATA non-data protocol.

Specifically with smartctl, there are two commands used where the application needs to see the ATA registers to determine the status of the drive.

ATA_CHECK_POWER_MODE (ATA command 85h - smartctl needs to see COUNT in the register information to determine the power mode)

ATA_RETURN_STATUS (SMART command B0 with features DAh - smartctl needs to see the LBA mid and high values to determine if the drive is reporting a SMART threshold exceeded error - for example using smartctl -H …).

When a SAT CDB is issued and the host application wishes to see the ATA register information, it sets the CK_COND (check condition) bit in the SAT passthrough CDB. This indicates to the SATL that the ATA register information should be returned in an ATA status return descriptor. Firmware 17380 now implements this as per the SAT specification which requires the SATL code to generate a check condition and in response to a request sense (or in this case using autosense), the  sense key set to RECOVERED ERROR and additional sense qualifier to ATA PASSTHROUGH INFORMATION AVAILABLE, the ATA status return descriptor is then included in the sense data.

Unfortunately, the CHECK CONDITION causes the aacraid driver to also return a “host_status” indicating a controller internal error occurred (DID_ERROR). This propagates back to smartctl which exits. My colleague is currently looking at the aacraid driver to see how to avoid this but in case it helps anyone here is a temporary workaround (modification of smartctl behaviour) - note this will ignore host_status SG_ERR_DID_ERROR which isn’t desirable long-term. From the smartmontools-5.38 source dir …

patch -p1 < diff below  os_linux.cpp


662a663
> #define LSCSI_DID_ERROR 0x7 /* Need to work around aacraid driver quirk */
787c788,791
< return -EIO; /* catch all */
---
> /* Check for DID_ERROR - workaround for aacraid driver quirk */
> if (LSCSI_DID_ERROR != io_hdr.host_status) {
> return -EIO; /* catch all if not DID_ERR */
> }

… and recompile

I’ll post back soon as I know more from my colleague.

thanks, Phil.

6 Responses to “Using smartmontools-5.38 with series 2/5/5Z controllers with firmware 17380 onwards”

  1. norrs Says:

    I guess this is why we’ve had issues with smartctl on our 5805 raid-card.
    After we flashed from stock retail firmware (Adaptec RAID 5805 Firmware/BIOS Update Ver. 5.2.0 Build 16501 - 18.feb09) to 24.jun09 17380 firmware ( Adaptec RAID 5805 Firmware/BIOS Update Ver. 5.2.0 Build 17380) we had issues reading of smart values from our sata disks.

    dmesg shows us:

    [877834.182094] scsi 0:1:4:0: [sg5] Sense Key : Recovered Error [current] [descriptor]
    [877834.182100] Descriptor sense data with sense descriptors (in hex):
    [877834.182103] 72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
    [877834.182112] 00 4f 00 c2 00 50
    [877834.182116] scsi 0:1:4:0: [sg5] Add. Sense: ATA pass through information available

    I guess this relate to the problem described above.

    Do you suggest us flashing back to an older firmware? Will this be fine?

    Should we patch smartmontools and use a local copy until a new kernel driver is released? (since I assume your firmware is actually implemented the standards correctly, and we don’t want to go back to older and “hacky” firmware that actually worked).

    ETA on new kernel driver?

    Thanks
    Yet Another Tech :-)

  2. Phil Wilson Says:

    Hi Norrs, sorry for the very long delay in coming back to you on this. I wanted to try this with the older firmware to see what was happening and have just been able to get into my lab after vacation. With the earlier firmware, it appears that the CK_COND bit in the SAT CDB is ignored and no sense data is returned so in other words, SMART RETURN STATUS doesn’t work.

    I think, under these circumstances, smartctl relies on the attribute values (first 362 bytes of the device SMART data structure) returned in response to SMART READ DATA and does a comparison for attributes flagged as pre-fail against the thresholds (returned in response to SMART READ ATTRIBUTE THRESHOLDS). If SMART RETURN STATUS had worked, the drive would have done this by itself.

    I would suggest using the newer firmware and a modified local copy of smartmontools. It would take some time to get this fixed in the driver and then it would need to be approved and merged with the driver at kernel.org before filtering through to distributions. I’ll post back here once I have some more news on this.

    thanks, Phil

  3. norrs Says:

    Thanks for your reply Phil.
    I’ll keep watching this nice blog for an update so we know when we can fetch a patch ;-)

    Thanks in advance.

  4. ondhest Says:

    I get this message when trying to patch:
    Hunk #2 FAILED at 788.
    1 out of 2 hunks FAILED — saving rejects to file os_linux.cpp.rej

    Anyone know what to do?

  5. Phil Wilson Says:

    Hi Ondhest, sorry about that - would you be able to try this. If it doesn’t help (works on my machine) let me know and I can email you directly.

    
    --- os_linux.cpp.orig   2009-11-19 10:12:48.000000000 +0000
    +++ os_linux.cpp        2009-11-19 10:52:24.000000000 +0000
    @@ -660,6 +660,7 @@
     #define SG_IO_RESP_SENSE_LEN 64 /* large enough see buffer */
     #define LSCSI_DRIVER_MASK  0xf /* mask out "suggestions" */
     #define LSCSI_DRIVER_SENSE  0x8 /* alternate CHECK CONDITION indication */
    +#define LSCSI_DID_ERROR 0x7 /* Need to work around aacraid driver quirk */
     #define LSCSI_DRIVER_TIMEOUT  0x6
     #define LSCSI_DID_TIME_OUT  0x3
     #define LSCSI_DID_BUS_BUSY  0x2
    @@ -784,7 +785,10 @@
                     (LSCSI_DID_TIME_OUT == io_hdr.host_status))
                     return -ETIMEDOUT;
                 else
    -                return -EIO;    /* catch all */
    +               /* Check for DID_ERROR - workaround for aacraid driver quirk */
    +               if (LSCSI_DID_ERROR != io_hdr.host_status) {
    +                       return -EIO;    /* catch all if not DID_ERR */
    +               }
             }
             if (0 != masked_driver_status) {
                 if (LSCSI_DRIVER_TIMEOUT == masked_driver_status)
    

    thanks, Phil

  6. Phil Wilson Says:

    Hi folks, just to add, the latest driver on the Adaptec support website 1.1.5-24900 fixes the driver issue (returning DID_ERROR instead of DID_OK when a SAT command with CC set causes a target to return a CC).

    thanks, Phil

Leave a Reply