This patch makes it so regular expression ranges, such as /[A-Z]/, do not break if internationalization is enabled. The problem is this: Many international locales place uppercase letters next to lowercase letters. While this results in a more sensible "ls" output, this also breaks scripts that assume /[A-Z]/ matches only upper case and /[a-z]/ only matches lower case. The way the patch works around this issue is to use traditional ASCII ordering of characters if both characters in a range are ASCII characters. If either of the characters in a range are not ASCII, such as /[Á-Z]/ (the first letter in this range is an A with an accute accent), this code will use the wcscoll() routine to determine the range. Some issues before this can become a part of Gawk: 1) I have to sign the paperwork assigning copyright to the FSF. For legal reasons, I have to physically sign a paper and give it to them. 2) This may break on non-ASCII systems (as I recall, Gawk still has support for non-ASCII systems). 3) Maybe have an environmental variable with reenables the old Gawk behavior. I'll have to use a static variable so we don't do an expensive getenv() call every time we look at a character. - Sam *** gawk-3.1.5/dfa.c.orig 2005-07-26 13:07:43.000000000 -0500 --- gawk-3.1.5/dfa.c 2006-11-02 15:32:41.000000000 -0600 *************** *** 2638,2646 **** wcbuf[2] = work_mbc->range_sts[i]; wcbuf[4] = work_mbc->range_ends[i]; ! if (wcscoll(wcbuf, wcbuf+2) >= 0 && ! wcscoll(wcbuf+4, wcbuf) >= 0) ! goto charset_matched; } /* match with a character? */ --- 2638,2663 ---- wcbuf[2] = work_mbc->range_sts[i]; wcbuf[4] = work_mbc->range_ends[i]; ! /* If both characters are ASCII characters, we use the ASCII ! * ordering of the characters to determine the range. This way, ! * i18n doesn't break regexes like /[A-Z]/ (which is supposed to ! * mean "upper case only", and should never match lower-case) */ ! if (wcbuf[2] < 128 && wcbuf[4] < 128) ! { ! if (wcbuf[0] >= wcbuf[2] && ! wcbuf[4] >= wcbuf[0]) ! { ! goto charset_matched; ! } ! } ! else ! { ! if (wcscoll(wcbuf, wcbuf+2) >= 0 && ! wcscoll(wcbuf+4, wcbuf) >= 0) ! { ! goto charset_matched; ! } ! } } /* match with a character? */ *** gawk-3.1.5/doc/gawk.texi.orig 2006-11-02 15:40:43.000000000 -0600 --- gawk-3.1.5/doc/gawk.texi 2006-11-02 16:26:02.000000000 -0600 *************** *** 3830,3876 **** @section Where You Are Makes A Difference Modern systems support the notion of @dfn{locales}: a way to tell ! the system about the local character set and language. The current ! locale setting can affect the way regexp matching works, often ! in surprising ways. In particular, many locales do case-insensitive ! matching, even when you may have specified characters of only ! one particular case. ! ! The following example uses the @code{sub} function, which ! does text replacement ! (@pxref{String Functions}). ! Here, the intent is to remove trailing uppercase characters: ! ! @example ! $ echo something1234abc | gawk '@{ sub("[A-Z]*$", ""); print @}' ! @print{} something1234 ! @end example ! ! @noindent ! This output is unexpected, since the @samp{abc} at the end of @samp{something1234abc} ! should not normally match @samp{[A-Z]*}. This result is due to the ! locale setting (and thus you may not see it on your system). ! There are two fixes. The first is to use the POSIX character ! class @samp{[[:upper:]]}, instead of @samp{[A-Z]}. ! The second is to change the locale setting in the environment, ! before running @command{gawk}, ! by using the shell statements: ! ! @example ! LANG=C LC_ALL=C ! export LANG LC_ALL ! @end example ! ! The setting @samp{C} forces @command{gawk} to behave in the traditional ! Unix manner, where case distinctions do matter. ! You may wish to put these statements into your shell startup file, ! e.g., @file{$HOME/.profile}. ! ! Similar considerations apply to other ranges. For example, ! @samp{["-/]} is perfectly valid in ASCII, but is not valid in many ! Unicode locales, such as @samp{en_US.UTF-8}. (In general, such ! ranges should be avoided; either list the characters individually, ! or use a POSIX character class such as @samp{[[:punct:]]}.) For the normal case of @samp{RS = "\n"}, the locale is largely irrelevant. For other single byte record separators, using @samp{LC_ALL=C} will give you --- 3830,3858 ---- @section Where You Are Makes A Difference Modern systems support the notion of @dfn{locales}: a way to tell ! the system about the local character set and language. In particular, ! many locales do case-insensitive matching, even when you may have ! specified characters of only one particular case. ! ! In order to be compatible with traditional AWK scripts that ! assume an ASCII ordering of letters, if both characters in a ! regular expression range, such as @samp{[A-Z]} are ASCII, Gawk will ! use ASCII ordering to determine the characters in the range. This, in ! particular, preserves the case sensitivity that ! traditional AWK scripts have utilized. ! ! This behavior is different than the behavior in earlier versions of ! Gawk. In earlier versions of Gawk, the current locale always ! determined what characters to put in a regular expression ! range. This behavior gave surprising results: Previously case-sensitive ! character ranges became case-insensitive, breaking AWK scripts. ! ! One consequence of this change is that @samp{[A-Za-z]} no longer ! matches accented letters in non-English locales. If this behavior ! is needed, use the POSIX character class @samp{[[:alpha:]]}, which ! matches all alphabetic characters. Another option is to use an accented ! character in the regular expression range, which will reinstate ! Gawk's older behavior. For the normal case of @samp{RS = "\n"}, the locale is largely irrelevant. For other single byte record separators, using @samp{LC_ALL=C} will give you