SUMMARY: DU4.0D malloc problem from Timothy W. Berger on 1998-12-17 (tru64-unix-managers)

From: Timothy W. Berger <berger_at_seas.ucla.edu>
Date: Wed, 16 Dec 1998 10:21:17 -0800 (PST)

Hello everyone,

I would like to thank the following people for their prompt, helpful
replies:

Bryan at Compaq
Serguei Patchkovskii <patchkov_at_ucalgary.ca>
Kjell Andresen <kjell.andresen_at_usit.uio.no>
Marie-Claude Vialatte <MC.Vialatte_at_cust.univ-bpclermont.fr>

I will try all that they suggest and let the newsgroup know how everything
turned out next week by posting another summary. I have too much to do
to be able to try everything suggested until the end of this week. I
hope that this is satisfactory to everyone. If not, please let me know
the proper operating procedure as I am new to this group. By the end of
next Monday, I will post again.

Below, I have included the original request followed by a summary of
replies.

Thanks to all again,

Tim Berger
berger_at_seas.ucla.edu
(310)825-0875

Original Post:
---------------------------------------------------------------------------
Hi everybody,

I'm currently encountering a problem with running big jobs on an ALPHA 600
5/333 workstation. The jobs require 400Mb-800Mb of RAM and we only have
512Mb available so we use the swap space to help out. However, we get
very slow console response (expected) but then we get console lockup and
the following error in the Console log:

Dec 10 18:10:05 turb12 vmunix: malloc failed: bucket size = 524288, #of
failures = 1, ra 0xfffffc00003a2f8c
Dec 10 20:08:45 turb12 vmunix: malloc failed: bucket size = 131072, #of
failures = 1, ra 0xfffffc00003a2f8c
Dec 10 20:39:12 turb12 vmunix: malloc failed: bucket size = 65536, #of
failures = 1, ra 0xfffffc00003a2f8c
Dec 10 20:55:26 turb12 vmunix: malloc failed: bucket size = 32768, #of
failures = 1, ra 0xfffffc00003a2f8c
Dec 10 21:01:35 turb12 vmunix: malloc failed: bucket size = 12288, #of
failures = 1, ra 0xfffffc00003a2f8c

At the same time, we RPC connection time out errors on other workstations
connected to this ALPHA via NFS when other users want to access their
files.

Anyone have any insight as to whether the errors listed above is serious
and/or what they exactly mean?

Thanks in advance,

Tim
berger_at_seas.ucla.edu
(310)825-0875

---------------------------------------------------------------------------
---------------------------------------------------------------------------
Here's a summary of replies:

---------------------------------------------------------------------------
From: Bryan <noreturn_at_all.to_me>

Tim

Saw your message to the Alpha manager list. If you have a software
contract with Compaq, you should call them, there are some patches for
this sort of thing. If no contract, they will service you on a per-call
billing.

Bryan
Compaq
---------------------------------------------------------------------------

From: "Serguei Patchkovskii" <patchkov_at_ucalgary.ca>

Hi,

> Anyone have any insight as to whether the errors listed above is serious
> and/or what they exactly mean?

Unless you got a better answer to your question (I hadn't seen the summary,
but then I might have missed it), here are my 2 cents worth:

I *think* (I would like to know, but Digital kernel documentation is a bit
sketchy, and I do not have access to the kernel sources) this error means
that your kernel is unable to allocate memory needed by one of its subsystems.
Depending on how gracefully that subsystem will handle the failure this may
(or may not) lead to panic or lockup. I certainly had seen repeated malloc
failures to lock up one of our machines (which is a heavily used NFS server).

Unless there is a memory leak in the kernel (in which case you can only
delay, but not prevent this error), increasing vm-heappercent parameter
(which determines the fraction of the physical memory available for kernel
heap, with the default of 7 percent) should avoid the problem. To adjust
it, add the following lines:

vm:
   vm-heappercent=nn

to your /etc/sysconfigtab, and reboot. I would try to increase this value
in icrements of 1 percent and see what happens.

A known source of a kernel memory leak are 'monitor' and 'syd', both available
on the Freeware CD. If you run either of these continuously, your kernel
*will* eventually run out of memory and lock up or panic. (It takes about
24hours of continuous execution of the 'monitor' to lock up a 64Mb system).
Quitting and restarting these programs seems to free the tied-up memory.

The above is 50% based on going through the Digital's kernel tuning manual,
and 50% pure speculation on my part, so please do not take it too seriously.
If you got any answers from someone who really knows, rather than like me
guesses the reason for this message, I'd like to hear about it...

Regards,

/Serge.P

---------------------------------------------------------------------------
From: Kjell Andresen <kjell.andresen_at_usit.uio.no>

> I'm currently encountering a problem with running big jobs on an ALPHA 600
> 5/333 workstation. The jobs require 400Mb-800Mb of RAM and we only have
> 512Mb available so we use the swap space to help out. However, we get
> very slow console response (expected) but then we get console lockup and
> the following error in the Console log:
>
> Dec 10 18:10:05 turb12 vmunix: malloc failed: bucket size = 524288, #of
> failures = 1, ra 0xfffffc00003a2f8c
> Dec 10 20:08:45 turb12 vmunix: malloc failed: bucket size = 131072, #of
> failures = 1, ra 0xfffffc00003a2f8c
> Dec 10 20:39:12 turb12 vmunix: malloc failed: bucket size = 65536, #of
> failures = 1, ra 0xfffffc00003a2f8c
> Dec 10 20:55:26 turb12 vmunix: malloc failed: bucket size = 32768, #of
> failures = 1, ra 0xfffffc00003a2f8c
> Dec 10 21:01:35 turb12 vmunix: malloc failed: bucket size = 12288, #of
> failures = 1, ra 0xfffffc00003a2f8c
>
> At the same time, we RPC connection time out errors on other workstations
> connected to this ALPHA via NFS when other users want to access their
> files.
>
> Anyone have any insight as to whether the errors listed above is serious
> and/or what they exactly mean?

There should have been a patchset #3 released some weeks ago to fix
among others errors in nsfd (as far as I've understood).

Take a look at
http://www.uio.no/~kjell/du/malloc.txt and
http://www.uio.no/~kjell/du/malloc-sum.txt

I'm waiting for the nfs-patch(es) from my local support agent and have
been doing so so Nov. 20th..

Good luck!

Kjell

PS: Additions to the above mentioned URLs are welcome

---------------------------------------------------------------------------
From: MC.Vialatte_at_cust.univ-bpclermont.fr

Le 13-Dec-98 Timothy W. Berger a ecrit :

>
> Yes, we have 512Mb RAM with an additional 1.5 Gb of swap space and when we
> check how much swap space is available we have more than enough for
> everything. I'm suspecting that with DU4.0D, we may need a patch to help
> out with this problem.

        I have no other idea.

> The bad thing is we don't have software support
> and we don't know which patch to use.

        You can get patch in public area of DEC service FTP at :
                ftp://ftp.service.digital.com/pub/osf
        There is a directory for each DU version, and there you will
        find
        - a README which gives some information on patches,
        - a PS file which says how to apply the patch
        - and a "jumbo" patch which contains all patches, and a detailed
          information about each patch.

___________________________________________________________________
Marie-Claude Vialatte | Telephone : +33 4 73 40 77 08
  CUST | Fax : +33 4 73 40 75 10
  BP 206 | Email : mc.vialatte_at_cust.univ-bpclermont.fr
  63174 AUBIERE Cedex | WWW : http://cust.univ-bpclermont.fr

---------------------------------------------------------------------------------------------------
Received on Wed Dec 16 1998 - 18:22:46 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:38 NZDT