NSR, tr, and locales

From: Didier Wagenknecht, Security Officer, ADM-SIC, EPFL <wagen_at_slalpha1.epfl.ch>
Date: Fri, 26 May 1995 14:12:11 +0200

Hi everybody,

I had problems with the saveindex function of Networker. It outputs sometime
garbage, complaining that there is no index to save for "&^%&%scrambled*&%".

By looking in NSR why I was having such messages, I found that saveindex was a
script, and that this garbled output came from a conversion with 'tr'. In fact,
tr prints scrambled output only if the locale is not english. Here are some
tests to show this:

LANG=
LC_COLLATE="fr_CH.ISO8859-1"
LC_CTYPE="fr_CH.ISO8859-1"
LC_MONETARY="fr_CH.ISO8859-1"
LC_NUMERIC="fr_CH.ISO8859-1"
LC_TIME="fr_CH.ISO8859-1"
LC_MESSAGES="fr_CH.ISO8859-1"
LC_ALL=fr_CH.ISO8859-1

osftag.geo.dec.com> echo "Slallpha1.epfl.ASD" | tr '[A-Z]' '[a-z]'
sKaKKXGa1.ZXKK.asd

As we are more or less all managers, there is a strong probability that one
day or another, you'll write a script with a 'tr' command inside. BEWARE!
Here is the explanation from Mr. Agassis, at DEC:
"
Believe it or not, this is the correct behavior.

        To understand why, remember that a character range like [a-z]
        represents all the characters that collate between a and z.
        In the default system configuration (the C locale), those
        characters are a, b, c, d, e, ..., z, just as you would
        expect. In the de_CH.ISO8859-1 locale though, [a-z] includes
        many more than 26 lowercase letters. That's because the
        locale has a collation order like this:

            <a>
            <A>
            <feminine>
            <a-acute>
            <A-acute>
            ...
            <z>
            <Z>

        By looking at this order, and by thinking a little about how
        tr must work, it's not too hard to understand why the funny
        output in the second example is indeed correct.

        The moral of the story is to avoid using character ranges.
        The hip, internationalized way to do what you want to do is as
        follows:

            $ echo ABCDEFGHIJKLMNOPQRSTUVWXYZ | tr '[:lower:]' '[:upper:]'
            ABCDEFGHIJKLMNOPQRSTUVWXYZ

        Hope this helps.
"

I corrected the saveindex command of Networker, and everything is fine now.
I hope some other people won't be caught by this no so obvious comportement
of 'tr', especially in countries whith a different locale than en_US.ISO8859-1.

Cheers,

Didier

_____________________________________________________________
  Didier Wagenknecht, Service Informatique Central (SIC/SL)
  Ecole Polytechnique Federale de Lausanne
  CH-1015 Lausanne (Switzerland)
  E-mail : wagen_at_sic.adm.epfl.ch
  Phone : (+41.21) 693.22.18, Fax : (+41.21) 693.22.20
Received on Fri May 26 1995 - 08:12:47 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:45 NZDT