SUMMARY: force direct access IO in clusters? from Udo Grabowski on 2001-08-11 (tru64-unix-managers)

From: Udo Grabowski <udo.grabowski_at_imk.fzk.de>
Date: Fri, 10 Aug 2001 16:55:01 +0200

Hello !

I'm giving up. Thanks to LHERCAUD_at_bouyguestelecom.fr, Udo de Boer,
Arnold Sutter and Hannes Visagie for trying to help me.

I made several attempts to get direct access IO working or
to find a workaround:

1. read all manpages and docs I could get on this direct IO
    stuff and all related commands. No hint how this can be
    made working.

2. multiple mounts of a disk (fileset) on a shared SCSI bus on
    different hosts read_only onto a CDSL cannot be done (why ? how ?);
    LHERCAUD and I are in agreement that this should work.

3. chfile -l on <file> permits synchronous update, but still
    only over the CFS server

4. last resort (suggested by Udo de Boer): writing a test C program
    with open(file, O_DIRECTIO | ..., ...) and writing some 100 Megs
    to the file does something on performance, but I don't know
    what it does, and it does not what the name suggests, namely
    direct IO; traffic goes still to the CFS server !
    As I understood the open() manpage, I consider this to be a BUG.
    Also tried some combinations with O_NONBLOCK, O_SYNC, all without
    succcess.

There's obviously no way to circumvent the CFS server.
Seems that direct IO on Tru64 is just vapourware. Udo suggested to
switch to Linux' Global File System, and Selden J.Ball already
pointed me to VMS, where everything I need seems to work as expected.
But it's too late for us to exchange the whole operating system.
Hannes Visagie has a working environment where all this caching
and direct IO is managed and performed by Interbase (certainly
using raw disks and managing an own filesystem), but we cannot rework
all our applications yet and need standard Unix access to the files.

As a consequence, I will be much more careful and critical when
reading Compaq's and their salesman's advertisements and announcements
of their products and alleged capabilities.

Thanks for all your help !
===========================
Original post:
> Here are the promised questions emerging from my previous
> summary today (see
>http://www.ornl.gov/its/archives/mailing-lists/tru64-unix-managers/2001/08/msg00133.html
> ).
>
> But first I like to award the prize for the all-time quickest reply
> to Tachyon-Man Udo de Boer, who answered this question even before
> I posted it (seems you are one of those very few people who really
> know how to deal with that quantum relativity stuff [:-)] !!!
> Evidently this is the fastest expert group in the universe.
>
> After all that mess with the DRD layer (which was confirmed
> to have misunderstandable documentation and advertising by
> Bard Tesaker and Dan Goetzman after my summary, which both
> heard of rumours that the desired functionality may will be
> implemented in 5.2 or later), I'm now seeking for an alternative
> approach. The situation is as follows:
>
> 8 cluster members (ES40/TC5.1 PK3) will access a couple of HSG80 units
> concurrently on a shared fiber channel SCSI 2 bus (so we need not
> use NFS, which is much slower). Half of a EMA 12000 consists of read-only
> data (2 TB) in about a dozen units, large files (tens of MBs), the rest
> is read-write, smaller chunks (100 kB to a few Mb) in another couple of
> units. The accessors are self-written standard Fortran 90 and C++
> programs and a few perl scripts, which will read the near future
> ESA Envisat satellite data and produce atmospheric gas profiles from that.
> There are between 32 and 64 simultaneous accessors on all hosts, some
> of them will read a file concurrently, but writing is always exclusive,
> although to the same filesystem (domain, fileset). The directories
> contents shall stay coherent across the members, as some processes
> need the results from other processes as input. So at least the
> read-write units must be managed by the CFS. But the read-only units
> obviously need no cache coherence. With the current server/client model,
> all processes accidently reading from the same unit will direct their
> request to the server for that unit, blocking that machine and leaving
> all the expensive fiber channel equipment unemployed on the accessing
> hosts. Balancing all loads by distributing the servers among the hosts
> of course is a must, but does not eliminate the problem since we cannot
> guarantee that a process accesses only a specific unit (so the process
> just would have to be relocated to the server for that unit). Instead,
> the access pathes will go criss-cross through the memory channels
> and will now and then pile up on some unlucky servers (which it would
> not if we had direct access IO...).
>
> For the read-only disks, the first thing I tried (simple minded as I am)
> was making a CDSL and mounting the same device read-only on that CDSL
> for every machine. This works for NFS mounts of home directories even
> read-write. Obviously, for disks this fails at the second mount with
> a 'device busy' (namely, the disk). How can it be made possible to
> mount one disk read-only onto several machines without having requests
> going to the CFS server ?
>
> The read-write stuff is a bit more complicated. I noticed a behaviour
> of CFS which could essentially kill our project. When writing larger
> amounts, essentially nothing is written through the fibre channel ports,
> but first hangs around somewhere in the CFS cache. The actual writing is
> done when a) someone accesses the file or directory (or fileset?), or
> b) the update daemons caching time has expired (usually 30 seconds).
> Typical speeds are 16 MB/sec. The accessing process has returned or
> often even ended already when the writing starts. While the
> synchronization is in progress, all other accessing processes hang
> waiting for the contents update. Now, if many processes write and
> access, this could end up in total blocking when the CFS server is
> synchronizing all the time (which is even slower since everything
> must go through that server...). My tests with only 6 processes
> showed a slowdown up to a factor of 10 for a 300 MB file.
> Are there any countermeasures to avoid such an IO catastrophe ?
>
> Udo de Boer in his answer from the near future has pointed (?will point?)
> me to the 'chfile -l on <file>' command, which enables synchronous
> writing to a file circumventing the cache. Sounds like this is
> essentially the same as direct IO, which should then use the direct
> access IO feature. I will try that, and in my summary you will read then
> that this unluckily fails to work with DA IO. Every access there will not
> only go through the CFS, but it will also be cached there. This
> feature only works if the writing command is issued on the CFS server
> for that file. Seems it needs a direct channel to the server's DRD.
> The same behaviour for Fortran programs opening files with the
> 'buffered_io="NO"' qualifier (= -nobuffered_io option to f90).
>
> What I'm seeking for is a possibility to mark a whole HSG80 unit or
> adfvsd domain as directly host attached from the OS level. But I've
> only seen application-level solutions to that yet, which is not
> feasible for us.
>
> Has someone more reliable information if and when the CFS will
> have a distributed cache which then could use the DRD direct
> access IO to provide the missing functionality (which is implemented
> already in VMS, as Selden E. Ball Jr. pointed out)?
>

-- 
Dr. Udo Grabowski                           email: udo.grabowski_at_imk.fzk.de
Institut f. Meteorologie und Klimaforschung II, Forschungszentrum Karslruhe
Postfach 3640, D-76021 Karlsruhe, Germany           Tel: (+49) 7247 82-6026
http://www.fzk.de/imk/imk2/ame/grabowski/           Fax:         "    -6141

Received on Fri Aug 10 2001 - 14:56:33 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:42 NZDT