SUMM: Is this true taht many directory entries degrades perf. from Guy Dallaire on 1997-11-27 (tru64-unix-managers)

From: Guy Dallaire <gdallaire_at_gustave.revenu.gouv.qc.ca>
Date: Wed, 26 Nov 1997 16:17:38 -0500 (EST)

It looks like this is true. But it depends on the level of I/Os that the
files contained in the directories are subject to. In our case, the files
are mostly application executables and/or scripts. They get used maybe
10-12 times a day for the most part, so I think the administrators here are
panicking.

Here is the original question:

The guys in charge of UNIX system administration here are telling us that
they want to move stuff around because we could have a performance penalty
if we keep placing things (files and directories) in a certain way.

They want to move around a lot of directory structures in oder to work
around a potential performance problem. The issue here is that this would
force our division to rewrite our in-hous configuration managment tools
completly and this incurs a MAJOR amount of work.

Here is what they say that I would like you gurus to validate:

Theay are telling us that it is not recommended to have more that 320
directory entries under a particular directory branch (I assume that they
mean, starting from root) They are telling us that if there is more than
320 entries, the system has to use what they call "extensions" and that the
prerformance problem grows exponentially with the number of extensions used
!

I never saw any recommendations like that, or maybe they misunderstood
something.

Does anyone know what they might be talking about ? Most of our servers are
Dec Alphas with ADVFS, we also have a couple of HP-SUX machines and a
couple of Sun Servers....

----------------------------------------------------------

Thanks to the following people for their excellent technical input (Sorry
if I forgot someone)

alan_at_nabeth.cxo.dec.com: (Alan Rollow)

        Maybe, but it depends on the file system.

        A directory is just a special sort of file which contains
        file name/address combinations. The detailed organization
        of an entry will depend on the file system.

        How file space is organized and addressed depends on the file
        system. On AdvFS files prefer to be organized as a single
        extent having a starting LBN number and length. When a file
        can't be stored as a single extent it isn't contiguous and
        some performance penalties may apply when you have to seek
        between two different parts of the disk to read all the data.

        On UFS, space is organized in a four level structure. The
        basic allocation size of the file system is the file system
        block which is 8 KB of data. This 8 KB is always contiguous.
        The first 12 blocks of a file have their addresses stored in
        the inode for the file. These blocks may be scattered widely
        across the disk or they may be contiguous. But, their addresses
        are immediately available from the inode, which will be locked
        in memory while the file is open.

        The inode has space for 15 block addresses. The remaining three
        are used to store indirect blocks. The first indirect block is
        the address of a file system block, which contains more block
        addresses of the file. Since each address is a 32 bit number,
        an 8 KB block can store 2048 addresses (16 MB of data). The
        2nd indirect block contains the address of a block, which
        contains the addresses of more first level indirect blocks;
        2048 * 2048 addresses. The 3rd indirect block contains the
        address of block which contains the addresses of 2nd level
        indirect blocks and so on down the line. The 2nd indirect
        can address 32 GB of disk space. The 3rd indirect block is
        needed for larger files.

        In a naive file system implementation, sequentially reading a
        large file would require reading the indirect blocks back to
        an addresses for each new block read. But, this UFS implementation
        has been around years and is hardly naive. Notably, all the inode
        data and indirect blocks will be cached. Since the lowest level
        indirect blocks will be touched the most, they will also tend to
        be in the buffer cache when needed. The 2nd and 3rd level indirect
        blocks, when used, may not be in the cache, but they're only needed
        every 16 MB of the file. The extra one or two 8 KB reads needed
        to get the lower level indirect blocks are insignificant compared
        to the amount of data they describe.

        The concern of your system managers probably comes from the fact
        that large directories are large files and therefore use indirect
        blocks. The particular limit probably assumes that each directory
        entry uses a fixed amount of space and when there are that many
        entries, the directory file is larger than 96 KB described by the
        indirect blocks. This particular assumption is wrong since direct
        entries are variable length and only need 12 bytes of overhead
        per entry. The rest is used to store the name. You could store
        over 6000 names in 96 KB if each name was only four characters long.
        (whether you can have 6000 names in four bytes is a different
         issue).

        When a directory is over 96 KB, then one extra block needs to
        read to determine where the rest of the data is. This will be
        less than 8% of the total I/O to sequentially read the file.
        And that's the highest it will ever be. That single 8 KB will
        give access to as much as 16 MB more data and shouldn't have to
        read more than once.

        Large directories on UFS do cause performance problems, but they
        have to much larger than a mere 96 KB. File lookup is done
        sequentially. The kernel maintains an cache of file name
        entries, but large directories will need more entries than
        the cache has and the cache won't be effective. When this
        happens file name lookups revert the time to sequentially
        read the directory file. Reasonably large directories should
        be cached just like ordinary files, so such a lookup may run
        at close to memory speeds, but when the file becomes too large
        it won't be cached and you're back to disk speed.

        Other issues show up when directories become large. The best
        example is ls(1). Just a basic ls(1) command with no options
        reads all the file names and sort them. This sort is probably
        done entirely in memory, so memory will have to be allocated
        for the data, the names sort and then printed. In a large
        directory this could be a significant amount of memory which
        could cause the particular ls(1) command to page or cause the
        system to page so that it will have the memory it needs. When
        the other options are used (-l especially) even more work is
        needed because it takes more memory for the additional information
        and each file needs a stat(2) system call performed on which may
        require another sequential read of the directory. I think this
        makes ls(1) with -l have O(n*n) performance.

        AdvFS behaves differently at large directory sizes. It uses
        a different storage method so won't have the indirect block
        problem. I have observed that at large sizes file lookup
        becomes CPU bound, with all the CPU time going to kernel mode.
        This mis-feature has been recognized and I'm told will be
        fixed in a future release. I wasn't told what release.

        In summary, yes you may have a problem, when directories get
        large. But, it won't necessarily happen with 320 directories.
        It may be at a few hundred, a few thousand or 100,000. Where
        there is problem depends on the file system, the directory size
        and how the directory is accessed. You want to study your boundary
        conditions for how large an implementation can be and then see
        if the time to perform a file lookup dominates the time process
        a file. When this is the case, then you want to think about
        changing the design.
----------------------------------------------

J. Dean Brock <brock_at_cs.unca.edu>

At some point the size of the directory file will exceed one page and
additional pages must to be allocated for the directory. It is
possible that those additional pages will be allocated in new AdvFS
file extents.

Run showfile -x on a directory to see if this has happened:
****************************************************************
showfile -x /users

         Id Vol PgSz Pages XtntType Segs SegSz Log Perf File
     2.8001 1 16 2 simple ** ** ftx 100% users

    extentMap: 1
        pageOff pageCnt vol volBlock blockCnt
              0 2 1 3321248 32
        extentCnt: 1
****************************************************************
This particular directory (which has 423 entries) has a directory
file which occupies two pages but one extent.

Because the size of an AdvFS directory entry is about sixteen plus
the length of a component name, you would expect pages to be allocated
about every 300 files, so that might be where the 320 came from.

> the system has to use what they call "extensions" and that the
> prerformance problem grows exponentially with the number of extensions
used

In the worse case, sequential search is used to find a directory entry,
so directory lookup does grow linearally with the number of directory
entries.

Yeah, looking up files in 10,000 entry directories could get
slow. You'd be better of to use 100 100-entry subdirectories.
However, a few large directories really isn't much of a problem.

--------------------------------------

Berry Kercheval <berry_at_join.com>

THis is, well, essentially correct. Under Unix filesystems directories are
stored as files in the filesystem, and map names to 'inodes'.

Inodes are a data structure that lives on the disk and contains things like
the permissions, last-modified-time, size, and the disk blocks that belong
to
this file. The inode lives in a portion of a block and usually contains
the
block numbers of the first few blocks in a file; thus most small files'
blocks' pointers can live entirely in the inode structure.

If the file is large enough that more disk blocks are needed to hold its
data,
then the last 'block' pointer in the inode points to a block that contains
block pointers. This is called an 'indirect block' and is what I think
your
admins are referring to when they say 'extensions'.

If the indirect block is not enough, the last entry in *it* points to a
block
of pointers to indirect blocks, or a double indirect block. The last
pointer
of the double indirect block points to a triple indirect block (if needed).

Now, since directories themselves are files, their data is stored in blocks
pointed to by these data structures. if a directory gets too large, its
later
entries must be located through indirect blocks and each time a file is
accessed (technically, when it's name is resolved -- old Unix systems used
a
kernel routine called 'namei' to map names to inodes) the structure must be
traversed. Thus the bigger the directory the worse the problem.

Modern Unix systems should implement 'name translation caches' to avoid
this
penalty after the first access to a file, but like all caches the size is
limited.

I suspect they are NOT saying no more than 320 from root, but no more than
320
in any one directory; generally a good idea if possible.

-----------------------------------
John Hascall

I presume they are talking about the inode data block pointers.

+ A directory consists of directory entries.
+ A directory entry consists of a filename, a 4-byte inode# (see
   the inode# with "ls -i") and 2 short ints.
+ An inode consists of a "stat" structure (what you see with "ls -l"),
   and 12 block pointers.

If your file (directory file in this case) is 12 blocks
long or less, the 12 block pointers point directly to data
blocks (e.g., the contents of the directory). If the file
is longer than you have to use the indirect block pointers
which point to a block of pointers pointing to actual data
blocks. If the file gets really big, then they use triple-
indirection.

I'm not sure where they got the specific number 320 because
the size of a directory entry varies with the length of the
file name (e.g. the entry for "SuperDuperSpecialFrobozer.c"
takes more room than the entry for "sdsf.c"). If I am doing
the math correctly:

     8192 blocks * 12 blocks = 98304 KB

then 98KB is the largest file (or directory) which can stay
with direct data pointers. And the maximum directory entry
size is:

     4 bytes inode + (2 * 2byte shorts) + 256 bytes max filename len = 264

and that gives:

     98304 / 264 = 372 entries (best case).

Note that directory entries are inserted into the directory with a
"first fit" algorithm, so 372 is a best case (assuming no wastage).
And, indeed, there is some overhead in traversing the indirect
pointers.

Besides this consideration, note that searching for an entry
in a directory (for open, create, remove, etc) is a sequential
operation, so that is another (stronger, in my opinion) reason
to limit the number of entries in a directory. Personally,
I like to keep my directories to 8KB or less.

There is also good reason to use filenames of 16 characters
or less since that is the maximum length filename which can
go in the name-to-inode, "namei", cache. Note by filename
here I mean each component of the filename. That is:

     /usr/var/adm/spool/whatever.log

is ok as each part (e.g., "spool") is 16 chars or less,
but this:

     /tmp/my-this-is-a-long-filename.txt

would have "tmp" in the namei cache, but not
"my-this-is-a-long-filename.txt".

Hope this is some help.
John Hascall

--------------------------------------------------
Guy Dallaire
Groupe Conseil DMR
Tel.:
MRQ: 652-5703
DMR: 653-6882 ext 6072
E-mail: guy.dallaire_at_riq.qc.ca, guy_dallaire_at_dmr.ca
Received on Wed Nov 26 1997 - 22:48:26 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:37 NZDT