[lug] grep question
jeffrey.haemer at gmail.com
Mon Jun 11 07:08:36 MDT 2007
Okay, now I have time. Here's a little more background, in five, easy
(1) In Unix, collation (the order of characters) and expressions built on
collation order ("[A-Z]") used ASCII collating order.
A few things made people re-think that assumption.
The obvious thing was character sets with more than 128 characters. Only a
few languages can be written without funny letters. Of modern languages, I
think the list is English, Indonesian, Hawaiian, and Swahili. If you do an
ls(1), where should files that start with Thai characters sort, and what
order should they come in? What should sort(1) do with a list of Danish
first names? The Germans and Japanese finally got enough money that Unix
Different, but in the same category, was EBCDIC. If you wanted to make a
Unix work-alike -- say, grep(1) -- for an old, IBM mainframe, how should it
behave? IBM had always had enough money, but finally started caring about
Different, but in a different category, was the desktop market. MS-DOS had
case-insensitive filenames, and everyone's marketing department thought that
they could finally sell Unix to some people who'd gotten used to Windows.
(2) To address these, POSIX invented a mechanism to specify a collating
order that's separate from the character-set order. Used to be that if you
wanted to sort backwards, you'd say "sort -r". Today, you can create a new
collating sequence, install it, tell the system to use that order, and then
call "sort" without a flag. See how much better that solution is? Me
neither. And when's the last time anyone asked us, anyway.
(3) This mechanism was one of several innovations that came to Unix around
the same time, all for similar reasons. For example, your keyboard has a
dollar sign; some keyboards have pound signs or Euro symbols; some even have
more than one. Some places, they write ten thousand as "10000" , some as
"10,000" , some as "10.000" some as "1,0000" . Don't you want to be able to
tell a system how to print prices in Saudi Riyals or Kuwaiti Dinars? Yeah,
me neither. People who make really a lot of money selling computers all do.
(4) On systems that approximate POSIX-conformance, these behaviors are
governed by environment variables called things like LC_MONETARY and LC_TIME
and LC_COLLATE. There is, however, one ring that rules them all. Okay,
two rings: LANG and LC_ALL. They differ in subtle but boring ways. Use
LANG: it's fewer characters to type. If you try "echo $LANG" you'll see
what rules someone has told your system you want.
(5) To provide normal, predictable, sane behavior -- or, as it's known in
marketing circles, "traditional Unix behavior" -- say LANG=C. You can say
other stuff that works, too, like LANG=POSIX or LANG=XOPEN or even (I'm
pretty sure -- all of this is from memory) unset LANG.
The first of these, LANG=C, is the fewest characters to type.
On 6/11/07, karl horlen <horlenkarl at yahoo.com> wrote:
> --- Jeffrey Haemer <jeffrey.haemer at gmail.com> wrote:
> > export LANG=C
> > will cure this problem.
> > (If you want a long explanation, let me know and
> > I'll write one tomorrow.
> > Right now, I'm, uh, otherwise occupied.)
> i'm not having the problem, but i'd be curious to hear
> your explanation...
> The fish are biting.
> Get more visitors on your site using Yahoo! Search Marketing.
> Web Page: http://lug.boulder.co.us
> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> Join us on IRC: lug.boulder.co.us port=6667 channel=#colug
Jeffrey Haemer <jeffrey.haemer at gmail.com>
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the LUG