Any advice/info for vdump failures to DLT?

From: George Michaelson <ggm_at_dstc.edu.au>
Date: Wed, 03 Dec 1997 11:18:44 +1000 (EST)

We are having severe problems using TZ88 DLT devices as backup media for
our main RAID server. Any suggestions or advice on how to resolve this
would be very gratefully received. Right now, apart from the first filestore
being dumped we have NO backup coverage on a 40Gb RAID server for 50 people
and as you can imagine, the systems group is nervous beyond all recovery!

System:

        AlphaServer 800 5/333
        128Mb memory
        internal systems disk on scsi0 RZ28
                wide SCSI
        tz13 and tz14 TZ** on scsi1 off psiop0/pci0 slot 12
                wide SCSI
        xcr0 RAID controller on pci0 slot 14
        re0/1/2/3/4 off xcr0. 9Gb barracudas.
                raid is fast/wide SCSI.

backup:

        /sbin/vdump -0 -u -b 64 -f /dev/nrmt0l $PART
          for PART in various filesystems off internal and RAID filestores.

        all filesystems are advfs. Some are >10GB occupied.

        first backup to DLT usually works. subsequent backups fail with:

 vdump: unable to write to device </dev/nrmt0l>; [5] I/O error
 vdump: unable to prompt input for retry on device; [25] Not a typewriter
 /sysadm/backup/backup.1: ERROR dump return code = 1

This happens after some amount of data has been written. Its not an immediate
fail. Its not clear what the DLT does on failure, or what tape positioning
exists after the failure when the next vdump pass begins.

Backups are being done on a multi-user system with live mounted filesystems.

I have been able to get vdump to work piping into dd bs=64k which makes me
suspect there is some problem with the datarate mismatch coming off advfs
on RAID/fast-wide into wide SCSI. However since this is always being done
on a 'fresh' DLT state, it may just be mimicing the ability to write to
the start of the DLT ok.

We've tried reordering the SCSI chain to see if a specific DLT is the problem,
changing from passive to active termination on the SCSI, we're about to try
a shorter cable (its on 1m SCSI from the CPU with 1/3m between the DLT and
a notional 1m internal to each I suppose) None of this seems to make any
difference.

vdump man page mentions -N for norewind and -U for nounload. surely with
/dev/nrmt<x> this is an irrelevancy? (sigh) I guess I'll try them too.

        -George
--
George Michaelson         |  DSTC Pty Ltd
Email: ggm_at_dstc.edu.au    |  University of Qld 4072
Phone: +61 7 3365 4310    |  Australia
  Fax: +61 7 3365 4311    |  http://www.dstc.edu.au
uerf -R -o full shows:
********************************* ENTRY     1. *********************************
----- EVENT INFORMATION -----
EVENT CLASS                             ERROR EVENT 
OS EVENT TYPE                  199.     CAM SCSI 
SEQUENCE NUMBER                 24.
OPERATING SYSTEM                        DEC OSF/1 
OCCURRED/LOGGED ON                      Wed Dec  3 04:46:11 1997
OCCURRED ON SYSTEM                      foxtail 
SYSTEM ID                 x0007001B
SYSTYPE                   x00000000
----- UNIT INFORMATION -----
CLASS                         x0001     TAPE 
SUBSYSTEM                     x0000     DISK 
BUS #                         x0001
                              x0068     LUN x0
                                        TARGET x5
----- CAM STRING -----
ROUTINE NAME                            ctape_iodone 
----- CAM STRING -----
                                        Unexpected CCB status 
----- CAM STRING -----
ERROR TYPE                              Hard Error Detected 
----- CAM STRING -----
DEVICE NAME                             DEC     TZ88     (C) DEC.TZ88 
----- CAM STRING -----
                                        Active CCB at time of error 
----- CAM STRING -----
                                        Command timed out 
ERROR - os_std, os_type = 11, std_type = 10
----- ENT_CCB_SCSIIO -----
*MY ADDR                  x07F8F580
CCB LENGTH                    x00C0
FUNC CODE            x01
CAM_STATUS                    x004B     CAM_CMD_TIMEOUT 
                                        SIM QFRZN 
PATH ID              1.
TARGET ID            5.
TARGET LUN           0.
CAM FLAGS                 x00000080
                                        CAM_DIR_OUT 
*PDRV_PTR                 x07F8F228
*NEXT_CCB                 x00000000
*REQ_MAP                  x07F80200
VOID (*CAM_CBFCNP)()      x00496F40
*DATA_PTR                 x400BA020
DXFER_LEN                 x00010000
*SENSE_PTR                x07F8F250
SENSE_LEN            x48
CDB_LEN              x06
SGLIST_CNT                    x0000
CAM_SCSI_STATUS               x0000     SCSI_STAT_GOOD 
SENSE_RESID          x00
RESID                     x00010000
CAM_CDB_IO           x00000000000000000001000A
CAM_TIMEOUT               x00000132
MSGB_LEN                      x0000
VU_FLAGS                      x0000
TAG_ACTION           x00
********************************* ENTRY     2. *********************************
----- EVENT INFORMATION -----
EVENT CLASS                             ERROR EVENT 
OS EVENT TYPE                  199.     CAM SCSI 
SEQUENCE NUMBER                 23.
OPERATING SYSTEM                        DEC OSF/1 
OCCURRED/LOGGED ON                      Wed Dec  3 04:46:11 1997
OCCURRED ON SYSTEM                      foxtail 
SYSTEM ID                 x0007001B
SYSTYPE                   x00000000
----- UNIT INFORMATION -----
CLASS                         x0022     DEC SIM 
SUBSYSTEM                     x0000     DISK 
BUS #                         x0001
                              x0068     LUN x0
                                        TARGET x5
----- CAM STRING -----
ROUTINE NAME                            ss_abort_done 
----- CAM STRING -----
                                        SCSI abort has been performed 
********************************* ENTRY     3. *********************************
----- EVENT INFORMATION -----
EVENT CLASS                             ERROR EVENT 
OS EVENT TYPE                  199.     CAM SCSI 
SEQUENCE NUMBER                 22.
OPERATING SYSTEM                        DEC OSF/1 
OCCURRED/LOGGED ON                      Wed Dec  3 04:46:11 1997
OCCURRED ON SYSTEM                      foxtail 
SYSTEM ID                 x0007001B
SYSTYPE                   x00000000
----- UNIT INFORMATION -----
CLASS                         x0022     DEC SIM 
SUBSYSTEM                     x0000     DISK 
BUS #                         x0001
                              x0068     LUN x0
                                        TARGET x5
----- CAM STRING -----
ROUTINE NAME                            ss_perform_timeout 
----- CAM STRING -----
                                        timeout on disconnected request 
----- UNSUPPORTED ENTRY -----
CAM ENTRY                 x0000040E     SIM_WS 
********************************* ENTRY     4. *********************************
Received on Wed Dec 03 1997 - 02:29:24 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:37 NZDT