SUMMARY: slow backups over the network

From: Roy, Kim <kroy_at_ss.ca.gov>
Date: Tue, 28 Mar 2000 09:54:07 -0800

Hello,

Thanks to everyone who responded to my question. All suggestions were very
helpful.

I ended up doing a number of things. I installed patch 3 for 4.0F, i
installed the special kzpba patch and I upgraded Legato to 5.5.2. I also
ensured all 100mb/fd settings were hardcoded on my systems and related
network equipment. So far my backups have been considerably faster. The
65gb volume that was taking 15hrs to backup is now taking 5 hrs! It's only
been about 4 days since all the changes, but so far so good! Thanks again!

Some suggestions from the list below:

>From Tom Webster:

I need some additional clarification -- Is your problem with the total
amount of time that NSR takes to backup a number of servers, or is it
with the amount of time that NSR takes to backup large volumes on
specific servers?

If it is the former, and you have a lot of medium sized servers, it
is a matter of tweaking server parallelism and the number of sessions
per device until you get better through-put. You are going to want to
increase the number of clients and the parallelism until you can
saturate the server's link. You will also need to adjust the number
of sessions per device to spread the load across all of your drives
as best you can. This will maximize the through-put of multiple
clients (with multiple filesystems).

If your problem is that you have a couple of systems, which have
large disk arrays configured as large logical disks, then it is
a slightly different problem. NSR is only going to run a single
backup stream from any filesystem. If the filesystems are huge,
it is then a function of how fast the single NSR process can
move the data to tape.

You mention that you have backups that are running 15 hours for a
65GB volume.

65GB 66560MB
---- == ------- == 1.23MB/s
15hr 54000s

This is pretty slow for a TZ88 if nothing else is being backed up.
You should see something around 1.5-1.7MB/s if memory serves.

Things to look at:

1. The I/O subsystem on the backup server.

   a. Try to make sure that the tape drives are on their own bus.
      Away from disk drives.

   b. On a TL812, don't have any more than two drives on a bus.
      One drive per bus is better, but two should work well.

   c. Check OS patches. I think there are patches out against
      both the KSPSA adapters and the FWD Qlogic adapters.

   d. Check your adapter and drive firmware revs.

   How fast can you spool data to tape from the local system?

2. If the remote systems are slower than the local system
   and you have plenty of bandwidth, check the I/O subsytem on
   the client.

   a. Is the client properly tuned? Are you trying to backup
      multiple RAID sets that share spindles at the same time?
      This can increase seek times and slow the process.

   b. How fast can you move data on and off of the array?

3. If both systems I/O seems OK, are you sure you aren't having
   a network problem. Try FORCING all of the systems to 100FD
   rather than relying on auto-negotiation. A duplex mismatch
   can be hard to spot, but will hose up xfer rates. Check
   how fast you can FTP data from the client to the server.

If none of that fixes the problem, there are some other things
you can do:

1. Upgrade the TZ88's to TZ89's, but like I noted your speed
   seems low already.

2. Rather than a seperate dedicated SAN style network, you
   should be able to add interfaces onto an additional VLAN
   on the CISCOs if you have the ports and the NICs. If the
   switches aren't saturated, but you are getting a lot of
   broadcast traffic this could help.

3. Make the systems with large arrays storage nodes, with
   locally attached tape drives.

4. In the case of really large arrays that need to be backed
   up quickly. There are a number of vendors that make
   tape RAID arrays -- i.e. five tape drives acting as a
   single drive you allow you to spool 3-4 times the data
   to the logical drive. I don't know if NSR supports them.

>From Stan Horwitz:

I recently started maintaining a Legato NetWorker mailing list. The topic
you raised was discussed there quite heavily last week. You might want to
subscribe to the list and check out the archived postings. You can do so
on the Web at http://listserv.temple.edu/archives/networker.html

>From William H. Magill:

Make DAMM certain that the switch and the CPU are both HARD configured for
100/FDX - do not trust the autoconfig. The auto speed detection works, but
the auto-config (duplex detection) does not. Not on Cisco, not on 3 Com not
on Alpha....we've got all of them. The problem is with the "mix and
match."

The dirty secret of this industry auto-config problem is that there are
no detectible symptoms EXCEPT - large file transfers take forever.
Interactive response seems normal. Any kind of short transaction oriented
actiivity seems normal. No errors are logged in any counters that anybody
has been able to show me. The only "indicator" is that large transfers take
forever.

Other than that, you need to keep the DLT's streaming.
We have a Storage Tec Timber wolf with DLT7000s....( 89's, I think) we have
8 "Target Sessions" configured on the media, and 4 or 8 "Parallalelism"
configured on each client (only visible with "details" selected.)

FDDI is not worth the time and effort. We just use as dedicated 100meg
switched ethernet network for 24 of our 30 servers. The other 6 run over
our production network (desktop Unix boxes.)

Our daily incrementals run about 4 hours (3am-7:30am) and our once a week
total dump takes about 10-12 hours. Each system has 2-10 gig being backed
up, but only the totals see that full load. I don't know what the real
total numbers are for byte loads - I haven't counted them in a while.

>From Bennet Fauber:

There was a patch for the SCSI CAM layered product that addressed degraded
service with tape units. Is it possible that there are timeouts and/or
errors being written to your system log that Legato is "recovering" from
and not telling you about? Here is the URL for the description of the
problem and the patch.

http://ftp.service.digital.com/patches/public/unix/v4.0d/kzpba_v40d_bl13.REA
DME

>From Mike Iglesias:

If you're not using Legato 5.5.2, get it. 5.5.1 runs slower on some
Alphas because it's using newer instructions that have to be emulated
on the older systems, slowing down saves. We saw that on some of
our systems.

>From John Ziomek:

I would suggest that you definately upgrade those tape drives from tz88's to
tz89's. I believe that takes you from 2.2mb/sec to 5.0mb/sec. You also will
want to make sure that you are running all the latest and greatest patches
for 5.5 of legato, there have been many patches since the orginal release.
Also, check your parallelism in legato, you should be optimized for backup
with your parallelism set to at least 16 I believe. You can also put in a
gigabit switch and cards in your servers. We just did this for about$12k for
4 servers, including the switch. Also, double check which controllers are in
your backup servers. If your using kzpsa's you should switch to kzpba's, and
use only 2 drives per controller....

There are more responses, but that covers just about all the contents of the
responses. Thanks again to everyone and to this list!

Original postings
*************************************************
I am experiencing some severe lag time with my network backups using Legato
5.5 with Tru64 UNIX 4.d through 4.0f. I am using a 48 slot TL 812 with 4
tape drives, all TZ88s. Would upgrading to faster tape drives help? Which
tape drives are recommended? Management won't buy off on implementing a
dedicated fddi ring for backup purposes, so I have to think of something
else quick. Some backups run up to 15 hours (for about 65GB of data). If
anyone could suggest ways of speeding up these backups, I would be most
appreciative.
I'm being asked for more detail and I apologize for not providing more in my
original question. My network is a 100mb/full-duplex. All of the servers I
am backing up, including my Legato server are directly connected to a
100mb/full-duplex Cisco switch. My network statistics show very little or
no collisions or errors. I've scheduled my backups so as not to have more
than 2 backups running at a time. The Legato server configs are mostly
defaults.
***************************************************

Kim Roy
UNIX Consultant, ITD
Received on Tue Mar 28 2000 - 17:54:29 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:40 NZDT