Debugging Encoding Problems

Encoding problems can be difficult to debug. All layers of the software involved must do the right thing, as otherwise mojibake can result. Time must be spent ensuring that the display terminal and text editors are configured correctly, and that any software written properly handles inputs from and outputs to the filesystem, databases, or other software components. If possible, standardize on a particular encoding. On my home systems, I use UTF-8 everywhere possible:

$ perl -E 'say "\x{0123}"'
ģ
$ set | egrep -ai 'utf|unicode'
LANG=en_US.UTF-8
PERL_UNICODE=ASL

Deciding where the encoding should take place is also important. When dealing with remote web services, or a database owned by a different group, the encoding should ideally be agreed on, and filtering done as soon as the data comes from the remote service. This way, poorly encoded data cannot sneak deep into an unsuspecting software stack. In other cases, the decision may be between allowing a particular library handle the encoding, as opposed to a higher level software stack: these pages are UTF-8 encoded by the XSL <xsl:output … encoding="utf-8"/> statement; the Perl involved uses output_as_bytes and :raw to preserve the XSL library output.

Testing output will depend on the OS and software involved; for Mac OS X, I use BBEdit, UnicodeChecker, Hex Fiend, and various unix utilities, such as xxd(1), od(1) (nope, never figured out or liked the options to hexdump), or perl(1).

$ echo バカ | perl -ple 's/([^ -~])/sprintf "\\x{%x}", ord $1/ge'
\x{30d0}\x{30ab}