unbearable DNS time outs recently.

From: chas <panda_at_peace.com.my>
Date: Thu, 30 Apr 1998 13:51:29 +0800 (SGT)

I've had named running on an Alpha 500 for 2 years without
problem. For the past 2 weeks, it has got slower and slower
and today it is unbearable.

A typical nslookup session shows that our DNS server
(duke.neuronet.com.my - 202.184.153.3) is timing out whereas
our upstream provider's DNS server (relay1.jaring.my) can
resolve the very same domain names :

# nslookup
Default Server: duke.neuronet.com.my <--- my DNS server
Address: 202.184.153.3

> infoseek.com
Server: duke.neuronet.com.my
Address: 202.184.153.3

DNS request timed out.
    timeout was 2 seconds.
DNS request timed out.
    timeout was 4 seconds.
DNS request timed out.
    timeout was 8 seconds.
*** Request to duke.neuronet.com.my timed-out

> server relay1.jaring.my <---- change to upstream DNS
Default Server: relay1.jaring.my
Address: 192.228.128.11

> infoseek.com
Server: relay1.jaring.my
Address: 192.228.128.11

Non-authoritative answer:
Name: infoseek.com
Address: 204.162.96.2

The same happens with most US sites. Resolving local
(xxx.com.my) domains, I sometimes time out once with
my own DNS server too.

----------------------------------------------

Notes :

1. Our DNS only serves primary for about 40 domains
   and secondary for 1 domain.
2. In the past 2 weeks, we have added .com, .org and
   .ec domains for the first time. Prior to that, all
   domains were local to Malaysia (.com.my)
3. 'top' has shown that named has shot up from about
   6 MB to 12 MB over the past 2 weeks. Only 5 new domains
   were added during that period.
4. The international line from Malaysia to the US has
   apparently been deteriorating the past 3 weeks, but
   this would not explain why my upstream provider's DNS
   can resolve domain names that I time out on. They're
   just one hop away.
5. I only have 50 users whose workstations use our name
   server.
6. External traffic to our websites has not substantially
   increased in the past 2 weeks.
7. The DEC 500 has 96 MB ram and only runs 2 webservers,
   DNS and mail. Top claims that 30 MB are free. It has
   been running DU3.2c and named for 2 years without problem.
8. Some US sites (eg. oracle.com) resolve the domain name
   at a resonable speed ... so I my initial thought was that
   it was just a bottleneck on a segment on the internet.
   However, that wouldn't explain why my DNS times out when
   my upstream can resolve, say, dejanews.com
  
I can not see why the addition of non-local domains
would cause this so I presume points 2 & 3 are a coincidence.

I do hope that our name server is answering queries for the
domains that we host (eg. tanjungrhu.com.my, mdc.com.my).

I intend to :
- set up a caching DNS server to query our upstream
   provider and tell my internal users to use it.
- move the webservers to a different machine. DNS and mail
   only on the DEC500.
But don't think that this is going to cure things.

Anyone have any ideas ?

Any way to monitor just how many queries named is answering and
tracking the bottleneck ?

A typical traceroute to infoseek.com is :

# traceroute infoseek.com

Tracing route to infoseek.com [204.162.96.2]
over a maximum of 30 hops:

  1 * * * Request timed out.
  2 111 ms 110 ms 110 ms 161.142.32.25
  3 100 ms 100 ms 110 ms e0.ttk7.jaring.my [161.142.219.8]
  4 130 ms 110 ms 110 ms e0.ttk15.jaring.my [161.142.219.16]
  5 170 ms 170 ms 60 ms h0-0.bkj15.jaring.my [161.142.0.81]
  6 110 ms 110 ms 121 ms fe0-0.bkj16.jaring.my [161.142.78.16]
  7 351 ms 290 ms 561 ms 205.174.74.205
  8 381 ms 390 ms 381 ms Hssi4-1-0.GW2.SFO1.ALTER.NET [157.130.193.45]
  9 * 400 ms * 113.ATM10-0-0.XR2.SCL1.ALTER.NET
[146.188.145.62
]
 10 * 391 ms * 194.ATM2-0-0.GW2.PAO1.ALTER.NET
[146.188.144.65]

 11 * * * Request timed out.
 12 * * * Request timed out.
 13 391 ms 400 ms 381 ms corp-bbn.infoseek.com [204.162.96.2]

Trace complete.

Yes, the first hop to my router does look dodgy...
don't know why it does that either but it seems to
channel traffic fine so I've never worried about it.

Sorry for the long post... but can't pinpoint the problem and the
more evidence the better. Thank you for any ideas,

chas
Received on Thu Apr 30 1998 - 08:13:47 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:37 NZDT