force direct access IO in clusters ? from Udo Grabowski on 2001-08-10 (tru64-unix-managers)

From: Udo Grabowski <udo.grabowski_at_imk.fzk.de>
Date: Thu, 09 Aug 2001 18:23:13 +0200

Hello again !

Here's are promised questions emerging from my previous
summary today (see
http://www.ornl.gov/its/archives/mailing-lists/tru64-unix-managers/2001/08/msg00133.html
).

But first I like to award the prize for the all-time quickest reply
to Tachyon-Man Udo de Boer, who answered this question even before
I posted it (seems you are one of those very few people who really
know how to deal with that quantum relativity stuff :-) !!!
Evidently this is the fastest expert group in the universe.

After all that mess with the DRD layer (which was confirmed
to have misunderstandable documentation and advertising by
Bard Tesaker and Dan Goetzman after my summary, which both
heard of rumours that the desired functionality may will be
implemented in 5.2 or later), I'm now seeking for an alternative
approach. The situation is as follows:

8 cluster members (ES40/TC5.1 PK3) will access a couple of HSG80 units
concurrently on a shared fiber channel SCSI 2 bus (so we need not
use NFS, which is much slower). Half of a EMA 12000 consists of read-only
data (2 TB) in about a dozen units, large files (tens of MBs), the rest
is read-write, smaller chunks (100 kB to a few Mb) in another couple of
units. The accessors are self-written standard Fortran 90 and C++
programs and a few perl scripts, which will read the near future
ESA Envisat satellite data and produce atmospheric gas profiles from that.
There are between 32 and 64 simultaneous accessors on all hosts, some
of them will read a file concurrently, but writing is always exclusive,
although to the same filesystem (domain, fileset). The directories
contents shall stay coherent across the members, as some processes
need the results from other processes as input. So at least the
read-write units must be managed by the CFS. But the read-only units
obviously need no cache coherence. With the current server/client model,
all processes accidently reading from the same unit will direct their
request to the server for that unit, blocking that machine and leaving
all the expensive fiber channel equipment unemployed on the accessing
hosts. Balancing all loads by distributing the servers among the hosts
of course is a must, but does not eliminate the problem since we cannot
guarantee that a process accesses only a specific unit (so the process
just would have to be relocated to the server for that unit). Instead,
the access pathes will go criss-cross through the memory channels
and will now and then pile up on some unlucky servers (which it would
not if we had direct access IO...).

For the read-only disks, the first thing I tried (simple minded as I am)
was making a CDSL and mounting the same device read-only on that CDSL
for every machine. This works for NFS mounts of home directories even
read-write. Obviously, for disks this fails at the second mount with
a 'device busy' (namely, the disk). How can it be made possible to
mount one disk read-only onto several machines without having requests
going to the CFS server ?

The read-write stuff is a bit more complicated. I noticed a behaviour
of CFS which could essentially kill our project. When writing larger
amounts, essentially nothing is written through the fibre channel ports,
but first hangs around somewhere in the CFS cache. The actual writing is
done when a) someone accesses the file or directory (or fileset?), or
b) the update daemons caching time has expired (usually 30 seconds).
Typical speeds are 16 MB/sec. The accessing process has returned or
often even ended already when the writing starts. While the
synchronization is in progress, all other accessing processes hang
waiting for the contents update. Now, if many processes write and
access, this could end up in total blocking when the CFS server is
synchronizing all the time (which is even slower since everything
must go through that server...). My tests with only 6 processes
showed a slowdown up to a factor of 10 for a 300 MB file.
Are there any countermeasures to avoid such an IO catastrophe ?

Udo de Boer in his answer from the near future has pointed (?will point?)
me to the 'chfile -l on <file>' command, which enables synchronous
writing to a file circumventing the cache. Sounds like this is
essentially the same as direct IO, which should then use the direct
access IO feature. I will try that, and in my summary you will read then
that this unluckily fails to work with DA IO. Every access there will not
only go through the CFS, but it will also be cached there. This
feature only works if the writing command is issued on the CFS server
for that file. Seems it needs a direct channel to the server's DRD.
The same behaviour for Fortran programs opening files with the
'buffered_io="NO"' qualifier (= -nobuffered_io option to f90).

What I'm seeking for is a possibility to mark a whole HSG80 unit or
adfvsd domain as directly host attached from the OS level. But I've
only seen application-level solutions to that yet, which is not
feasible for us.

Has someone more reliable information if and when the CFS will
have a distributed cache which then could use the DRD direct
access IO to provide the missing functionality (which is implemented
already in VMS, as Selden E. Ball Jr. pointed out)?

Sorry for all those (a bit too) long postings, but this stuff is really
complicated to explain, and we are desperately in a hurry to get this
into production finally. I will appreciate any help and will summarize.

-- 
Dr. Udo Grabowski                           email: udo.grabowski_at_imk.fzk.de
Institut f. Meteorologie und Klimaforschung II, Forschungszentrum Karslruhe
Postfach 3640, D-76021 Karlsruhe, Germany           Tel: (+49) 7247 82-6026
http://www.fzk.de/imk/imk2/ame/grabowski/           Fax:         "    -6141

Received on Thu Aug 09 2001 - 16:25:00 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:42 NZDT