From: Martin.Dusek@pregis.cz Sent: Friday, September 17, 1999 12:09 AM To: Hibberd, Jeremy Subject: RE: System freeze - memory bottleneck Hi Jeremy we have similar configuration, i.e. Tru64 V4.0D, TruCluster v1.5 with two AS4100 5/600 with 4GB memory, SAP R/3 3.0F, oracle 7.3.4 and later 8.0.5. We had hard problems with lack of memory, system very often "hanged". There were two main problems: First, everything nearly hanged during generation of huge files as core dumps or adding datafile to a tablespace. Nevertheless system doestn't fully freeze in these situations so we don't solve this. Second, default UNIX kernel memory parameters are unusable with large memory usage. Main reasons: 1. system begins to page when the number of free pages gets under vm-page-free-target (default 128), but this paging is very "slow" - not aggresive and it cannot keep pace with memory demands. So free pages get immediately under vm-page-free-min 2. under vm-page-free-min (default 20) the paging becomes very aggressive but it is too late. Free memory otfen reaches vm-page-free-reserved 3. under vm-page-free-reserved (default 10) only privileged tasks cen get memory until it is freed Another problem is with swapping - you should never allow that SAP workprocesses are outswapped because it terribly increases response time (in transaction st03 you can see wait times in hundreds of miliseconds). You should set the memory parameters so as free memory is reached by paging and not by swapping. Swapping is possible only for tasks which are launched in the morning but then are sleeping all the time. This is not the case of oracle processes os SAP workprocesses: dw.sapPP0_DVEBMGS02 pf=/usr/sap/PP0/SY... oraclePP0 (DESCRIPTI... So now other reasons of memory problems: 4. further, when free pages go under vm-page-free-swap (default 74), the "soft" swap begins - system outswaps all tasks that have been idle for 30 seconds or more. It is not the worse problem as mentioned above. But! 5. when free pages go under vm-page-free-optimal (default 74, the same as vm-page-free-swap - don't ask me why. It does't give any sense) the hard swapping begins - so your SAP workprocesses are doing nothing else than moving down and up between memory and swap disks. Typicall you can see hundreds or thousands page outs in vmstat. Solution? Increase vm-page-free-target so as the VM system has anough time to free nonactive pages. Increase vm-page-free-min so the system has enough time to free memory by aggressive paging and never (NEVER!!) goes under vm-page-free-reserved. Increase vm-page-free-swap but set it much lower than vm-page-free-target so as the system first pages, than swaps. Set vm-page-free-swap higher than vm-page-free-optimal so as the hardswap "never" begins but the soft swap is possible (long idle tasks can be outswapped). It is on you if you allow idle task swapping or not. In my parameters the swapping never (or nearly never) begins because vm-page-free-min=192 and vm-page-free-swap=128. You can change this. How to do it? Create file /etc/sapr3.stanza.vm with contents: vm: vm-page-prewrite-target=384 vm-page-free-target=512 vm-page-free-min=192 vm-page-free-swap=128 vm-page-free-optimal=74 vm-page-free-hardswap=2048 cd /etc sysconfig -m -f ./sapr3.stanza.vm Then I relinked the kernel by doconfig but I'm not sure if it is necessary. Perhaps rebooting the system could be enough. Dont' be afraid to increase all parameters mush more. Mr. Kejzlar from Pilsen University has set the limits to thousands and it works. You can find a good manual at: http://www.unix.digital.com/faqs/publications/base_doc/DOCUMENTATION/V40 D_HTML/V40D_HTML/AQ0R3FTE/CHVMXXXX.HTM Regards, Martin _______________________________________________ Martin Dusek martin.dusek@pregis.cz IT BC departement manager tel: +420-428-359571 PREGIS a.s., Smetanova 45 +420-602-437235 46621 Jablonec nad Nisou fax: +420-428-317844 Czech Republic ============================================================================ From: Partin, Kevin S [Kevin.Partin@SW.Boeing.com] Sent: Friday, September 17, 1999 6:37 AM To: Hibberd, Jeremy Subject: RE: System freeze - memory bottleneck I have seen this problem on a PW433au running 4.0D. What is happening is that all of the swap space is being used up. Once the machine runs out of swap space, it will 'freeze' until the process causing the problem finishes or crashes. The 'freeze' is the machine not being able get any free memory pages. You are probably running with strict swap allocation. In reality, you are probably not using any, or much, swap space, but the strict allocation policy will allocate all swap space even though nothing is swapping. In my situation, it was a very badly behaving X-Window application, with a huge memory requirement, that caused by system to freeze. Kevin ------------------------------------------ Kevin S. Partin The Boeing Company 13100 Space Center Blvd. Mail Code: JHOU-2230 Houston, TX 77059 Phone: 281-244-4088 Pager: 713-549-0713 Facsimile: 281-244-4984 Email: mailto:kevin.s.partin@boeing.com ============================================================================ One really needs to take a look at your sysconfigtab settings. If you are using defaults, I would guess that that the large report uses up all the UBC. This forces out all the other processes. As Oracle is using shared memory segments they all get swapped out and nothing can run. It's amazing you can't even get a login shell though. It definitely sounds as if you are getting thrashing of the UBC. Take a look at your ubc settings and the section on memory tuning on the system and admin guide under http://www.unix.digital.com. You might want to monitor the UBC with dbx -k /vmunix /dev/mem from a high priority root shell. That should help get output while the problem is occurring. Marco Marco Luchini Unix support Acco-UK ============================================================================ From: Davis, Alan [Davis@Tessco.Com] Sent: Friday, September 17, 1999 6:23 AM To: Hibberd, Jeremy Subject: RE: System freeze - memory bottleneck Without more performance info it's hard to tell, but my first guess is that you should look at vm parameter ubc-maxpercent and related parameters. This type of freeze can be seen when a single process takes the majority of physical ram for i/o buffering. By reducing ubc-maxpercent you limit the amount of ram stolen by i/o buffers and keep enough memory to run the other processes. Alan Davis ============================================================================ You don't have enough physical memory for your workload when the large report is running. Once it starts running and hogging all available physical memory all the pages for everything else in user space quickly gets paged out, and in particular all the presently inactive stuff gets paged out. Once that happens it's extremely unlikely that anything will get paged in enough to actually run for long until the hog goes away. You need to tune the system to limit the working set of this huge application or you need to get more physical memory. Buying more memory may turn out to be both easier in the short run and cheaper in the long run; attempting to tune a multi-user system to deliver good performance when it has to also work with very demanding applications is a thankless and hopeless task. Tom Dr. Thomas P. Blinn + UNIX Software Group + Compaq Computer Corporation 110 Spit Brook Road, MS ZKO3-2/W17 Nashua, New Hampshire 03062-2698 Technology Partnership Engineering Phone: (603) 884-0646 Internet: tpb@zk3.dec.com Digital's Easynet: alpha::tpb ACM Member: tpblinn@acm.org PC@Home: tom@felines.mv.net ====================================================================== From: alan@nabeth.cxo.dec.com Sent: Friday, September 17, 1999 10:09 AM To: Hibberd, Jeremy Subject: Re: System freeze - memory bottleneck The report generation is probably doing a lot of sequential I/O, which causes the file system to go into read-ahead mode. This quickly fills the buffer cache with data from the file being read, and flushes most everything else out. Dirty data will be written, may cause a short burst of I/O. Since the buffer cache uses the same memory as the processes, the kernel will try to free up more memory for the cache, which causes the high paging load. Part of the freeze is probably related the high I/O load. If you're using a small number of medium speed devices for the page/swap space, it can take a long time to page a few GB of memory. If the Oracle data space is pagable, it is probably a good candidate for paging out. Paging it back will when the report is through the file(s), will be another long I/O load. You might try running sys_check to see if can offer any advise on how to tune the virtual memory subsystem. The V4.0D documentation should have a tuning guide that can help explain what the suggestions are doing.