Nasty Perl Unicode bug

Sam Trenholme's webpage

Main - Résumé - Blog - Site Map - Software

Support this website This document describes a nasty Unicode bug Perl has. In this text, bold text were commands I typed in; non-bold text is stuff the computer wrote to the terminal. The accented character is a 2-byte UTF-8 sequence.

$ /usr/bin/perl --version

This is perl, v5.8.0 built for i386-linux-thread-multi
(with 1 registered patch, see perl -V for more detail)

Full Perl license text removed for brevity

$ /usr/local/bin/perl --version

This is perl, v5.8.8 built for i686-linux

Full Perl license text removed for brevity

$ echo á | /usr/bin/perl -pe 's/á/aye/'
á
$ echo á | /usr/local/bin/perl -pe 's/á/aye/'
aye

So, is there any way to work around this problem? Nope. You might think "use utf8" will fix this issue. It doesn't.

$ cat unicode.char
á
$ cat unicode.script
use utf8;

open(A,"< unicode.char");

while(<A>) {

        $_ =~ s/á/aye/;
        print;

}
$ /usr/bin/perl unicode.script
aye
$ /usr/local/bin/perl unicode.script
á

As you can see, "use utf8" just causes Perl 5.8.0 to do the right thing, yet breaks Perl 5.8.8. So maybe we can fix this with a conditional statement.

$ cat unicode.script.2
$vers=sprintf("%vd",$^V);

if($vers =~ /5.8.0/) {
 use utf8;
}

open(A,"< unicode.char");

while(<A>) {

 $_ =~ s/á/aye/;
 print;

}
$ /usr/bin/perl unicode.script.2
á
$ /usr/local/bin/perl unicode.script.2
aye

At this point, I gave up. These days, I write in either awk (for simple stuff) or Python (for complicated stuff). For example, none of the four freely downloadable awk interpreters have this problem:

$ echo á | busybox awk '{gsub(/á/,"aye");print}'
aye
$ echo á | gawk '{gsub(/á/,"aye");print}'
aye
$ echo á | mawk '{gsub(/á/,"aye");print}'
aye
$ echo á | bwk-awk '{gsub(/á/,"aye");print}'
aye

The nice thing about awk is that there is a Posix standard out there; this guarantees that I can write my awk scripts in a manner that will work on any modern system with an awk interpreter.

The nice thing about Python is that there is a strong committment from the Python community to not arbitrarily break things or make changes which break scripts between bugfix releases.