AWK annoyances

Awk is a simple, yet powerful language for processing text files and generating reports. Originally implemented in the late 1970s, the language was greatly extended in the mid-1980s and is described in the 1987 book The Awk Programming Language. For people who do not wish to purchase this book, it is also fully described in the freely downloadable Effective Awk programming manual for Gawk.

Awk is a standard part of UNIX and Unix clones. Over the years, no less than five different C AWK interpreters, a Java-language AWK interpreter, and two AWK compilers have been made. Here is a chart comparing the five C-language AWK interpreters:

Name Primary Maintiner Last release
Original AWK Brian Kernighan April 24, 2005
Mawk Michael D. Brennan September 1996
Gawk Arnold D. Robbins July 26, 2005
Busybox AWK Dmitry Zakharov October 29, 2006
MKS AWK The OpenSolaris team October 2006

All five of these AWK implementations are open source, and can be freely downloaded here. Just click on the name to download the file in question. Note that the AWK in Busybox is just a small part of a much bigger toolkit for embedded systems; this particular Awk, when compiled, results in the smallest AWK binary of any of the above five Awks. Gawk creates the largest binary.

The original AWK is just that: The very first Awk implementation; this implementation has code going back to the late 1970s. It has been updated to compile and run on modern systems, such as Linux, Windows, and FreeBSD. This version of AWK became open-source code in 1996. This is the default AWK that comes with FreeBSD.

The next AWK implementation to be implemented is MKS AWK. This implementation dates back to the mid-1980s. The source code of this implementation recently became public when OpenSolaris was released. OpenSolaris uses both this verion of AWK and earlier versions of the original AWK. I have ported this version to Linux; click on "MKS AWK" to download the Linux port.

The next AWK implmentations to be implemented are the open-source Gawk and Mawk implementations. Both the Free Software Foundation and Michael Brennan wanted to have a free version of Awk in the late 1980s and early 1990s; not aware of the other's work, these two independant free implementations of AWK were made around the same time. Mawk is the default AWK that comes with Debian and Debian-derived distributions, such as Ubuntu. Gawk is the default AWK that comes with most other Linux distributions.

The next AWK implementation to be made is the AWK that comes with Busybox. Busybox is a project to make the standard UNIX tools available using as little memory and disk space as possible. Dmitry Zakharov implemented AWK for Busybox starting in 2002. While earlier versions had a number of bugs and incompatibilities, Mr. Zakharov has been actively maintaining this version of AWK; more recent versions are both POSIX-compliant and able to run legacy AWK scripts. This is, not surprisingly, the smallest AWK implementation.

There are some other interesting AWK implementations out there: xgawk extends AWK to have Database connectivy and other features missing in the traditional AWKs. awka is a project that allows one to make C programs from AWK scripts. jawk is a project to implement AWK in Java.

Annoyances

With the number of implementations of AWK out there, it is no surprise that there are some incompatibilities between versions.

Internationalization breaks AWK

One of the biggest annoyances is that using a non-C/POSIX locale breaks AWK scripts in both Gawk and the current version of MKS AWK. The problem is this: A case-sensitive regular expression range, such as /[A-Z]/ loses its case sensitivity in just about any non-English locale. There are two solutions which are portable across currently used AWK implementations:
  1. Set the LC_ALL and LANG environmental variables to have a locale of "C" before running AWK. This only works in shell scripts that call Gawk/MKS AWK.
  2. Use the ugly regular expressions /[ABCDEFGHIJKLMNOPQRSTUWXYZ]/ and /[abcdefghijklmnopqrstuvwxyz]/.
In particular, the regular expressions /[[:upper:]]/ and /[[:lower:]]/ are not a portable option, for reasons I will detail later on in this article.

There is a perception that POSIX requires these regular expression to break in non-C/English locales. This is not true; the standard merely states that ranges may break. E.G. Posix 9.3.5 section 7: "In other [non-POSIX] locales, a range expression has unspecified behavior". (The "POSIX" locale is also known as the "C" locale).

Dr. Kernighan (the "K" in AWK), when dealing with this issue said

strcoll is meant for sorting, where merging upper and lower case may make sense (though note that unix sort does not do this by default either). it is not appropriate for regular expressions
(See the "FIXES" file included with his implementation of AWK)

Since POSIX allows internationalization to maintain compatibility with legacy AWK scripts, and one of the three original implementors of AWK feels that such scripts must not be broken, I have a patch for Gawk that fixes this problem. This patch maximizes compatibility and minimizes the number of scripts that will break; ranges with non-ASCII characters still have the international-aware behavior.

POSIX character classes are not universal

POSIX character classes are not universal across AWK implementations. In particular, the Mawk implementation of AWK does not have POSIX character class support. Since this is the default AWK that comes with Debian and Ubuntu, one can not use POSIX character classes in regular expressions like /[[:upper:]]/ and /[[:lower:]]/ without breaking AWK scripts on these widely-used Linux distributions.

POSIX character classes seem to be pretty rare in AWK scripts; there is only one bug reported in the Ubuntu bug database where someone had a problem with this.

I have a patch that adds POSIX character class support to Mawk. Considering Debian's speed of development and Ubuntu's seeming lack of interest in updating their core utilities, it will probably take years for this patch to become a part of these distributions.

Indeed, I am not the first person to try and update Mawk's regex engine. Aleksey Cheusov, in the summer of 2005, patched Mawk to use an external regular expression engine. I have made a copy of the patch which people can download. His approach is different; instead of updating Mawk to have more features in its own regular expression engine, he simply has Mawk use an external engine.

His patch allows a variety of external regular expression engines to be used. For example, to use libc's regex engine:

./configure && make

Or, to use the "tre" regex engine:

CFLAGS='-O3 -I/usr/include/tre' LDFLAGS='-ltre' ./configure && make

Note that, after applying his patch, autoconf needs to be run to create a new 'configure' script.