System clock slows down from Tony McConnell on 1999-11-01 (tru64-unix-managers)

From: Tony McConnell <anthony_at_creaker.demon.co.uk>
Date: Sun, 31 Oct 1999 21:24:59 +0000

Please send any replies, with a cc to tonym_at_datel-technology.co.uk
Thanks in advance.

We have a problem with time on the DEC Alpha machines that we are using. Our
application runs in a hot standby configuration on two identical machines.
The machines run Digital Unix 4.0d on a Single Board Computer 21604A 233Mhz
with 256MB. Also, the same kernel build is used on both these machines. In
fact,
exactly the same /vmunix is used.

We use NTP (we are currently using ntp4.98b) synchronised to a number of
radio clocks, to keep the machines normally within 10ms of GMT.

The system behaviour is as follows:

1. Run our harness+comms applications standalone on MACHINE_1.
2. Cause a system load resulting in, on average, 30% processor
    utilisation for harness, and 10% for comms.
3. Watch the behaviour of ntpd over 24 hours.

The system and ntpd behave perfectly well.

Change so that the harness+comms applications are running standalone on
MACHINE_2, with the same system load. Again, there are no problems
synchronising the machines.

Ok, now:

1. Run the master harness+comms applications on MACHINE_1.
2. Run the standby harness+comms applications on MACHINE_2.
3. Watch the behaviour of ntpd on the master + standby machines
    over 12 hours.

The MACHINE_2 machine, _slows_ down by ~10 seconds per hour.

Halt + power cycle machines and change mastership, so that the
master runs on MACHINE_2, and the standby runs on MACHINE_1.
The weird drifting behaviour now occurs on MACHINE_1!

So, OK, it's easy to blame NTP, so we'll disable it on MACHINE_2,
disable the call to 'ntpdate' on boot up, and set the time from the
radio clock by hand:

1. Stop using NTP completely on MACHINE_2.
2. Reboot both machines. MACHINE_1 syncs to the time and
    remains in sync.
3. Set the time of MACHINE_2 by hand, using 'date' and the
    radio clock.
4. Wait 1 hour. The times are still the same.
5. Start master app with same system load. Wait 1 hour.
    Times are still the same.
6. Start standby application on MACHINE_2. MACHINE_2 is immediately
    ~2 seconds behind.
7. Wait 1 hour. MACHINE_2 is now ~10 seconds behind.
8. Wait 6 hours, MACHINE_2 is now ~60 seconds behind.

Again, this behaviour follows the standby. Note that the system load just
hastens the process.

Looking at process statistics, the only difference between master + standby
machines is that the standby comms uses 10MB more memory, and the standby
harness uses 13MB more memory.

The CPU usage on the master is always higher than the cpu usage on the
standby.

vmstat statistics show everything more or less equal, but slightly more page
faults on the standby.

[ Let me know if you want to see these, and I'll send them to you ]

Note: we cannot step the time periodically, due to the nature of the hot
standby mechanism. And I don't think we can slew the time by 10 seconds per
hour.

Anyone got any ideas? It looks like it's related to memory usage, but does
anybody
have any experience of this kind of situation? Something not servicing clock
interrupts quickly enough?

Thanks for any help you can give.
Received on Sun Oct 31 1999 - 21:24:00 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:40 NZDT