Python and Java internationalization.

Victor Ng and I had a discussion on the differences between Java and Python concerning internationalization. There were two levels of the conversations - the general, high-level comparison and the low-level examples shot back and forth.

Victor’s example is to take a Unicode string and extract all of the valid US-ASCII/ISO-8859 characters out of it. In Python, you’d write:


someUnicodeString.encode('ascii', 'ignore')

That’s nice, concise and simple. I like it, and Python definitely is elegant. In Java, you’d do:


someUnicodeString.getBytes(String charset)

That’s when it gets a little trickier in Java. The charset parameter is the canonical name of a character-to-byte encoder. It’s responsible for mapping a single Unicode Character object to one or more bytes. When you use the character set “us-ascii”, you will get a byte array with all characters that cannot fit into that set returned as “?”. If you want to discard those unknown characters, there’s no obvious fast way I can see to do that in Java.

Looking at the example more, the character set parameter gives you a nice amount of flexibility in how the characters will be converted. You can use language specific encoders, such as Shift_JIS or X-EUC-JP, or you can use actual Unicode formats like UTF-8. Again though, in relation to this example, if I wanted to ignore unknown characters in Java I would probably have to iterate over them in some horrible loop that checked the validity of each item (which is what Python is doing behind the scenes).

In the Python side, the first parameter is the encoder name. I cannot though find a list of the encoder names; I’ll keep looking.

On the same note this is one of the hurdles I’ve faced when trying to use Python for non-personal projects - documentation. The Java API documentation for the String.getBytes method has this blurb:

Parameters:
charsetName - the name of a supported charset

Clicking on the Charset link gives me information about how to programatically determine which character encoders are available at runtime. Now in the Python string methods documentation, we have:

encode([encoding[,errors]])
Return an encoded version of the string. Default encoding is the current default string encoding. errors may be given to set a different error handling scheme. The default for errors is ’strict’, meaning that encoding errors raise a ValueError. Other possible values are ‘ignore’ and ‘replace’. New in version 2.0.

No help to find the names. Digging around more, I found the list of codecs included with Python. Missing from the list are a ton of character sets - most notably (from my perspective) any Asian character sets such as Shift_JIS, ISO-2022-JP and X-EUC-JP. I did find some discussions on the Python development mailing lists concerning adding Japanese codecs, but I wasn’t sure if they’re in there or not. Firing up Python, I started poking around with the codecs library. First I checked to make sure what I was doing:


>>> import codecs
>>> codecs.lookup(”us-ascii”)

Python then spit out the function information about that codec object - very nice. Also, the Python codec database gives me the same runtime access to which codecs are available - also very nice. Moving on, I then tried:


>>> codecs.lookup("shift_jis")

This resulted in:


Traceback (most recent call last):
File “<stdin>”, line 1, in ?
LookupError: unknown encoding: shift_jis

Nothing. I checked for the others, including Chinese encoders/decoders, and still nothing. Messing around more, I found (in the Debian distribution) python-japanese-codecs. Pulling that in actually gave me the codecs that were required, and I was pleasantly surprised. Unfortunately I don’t think there’s any plans to bundle these codecs with Python:

All other encodings such as the CJK ones to support Asian scripts should be implemented in separate packages which do not get included in the core Python distribution and are not a part of this proposal.

That’s a pain in the ass, but I can live with it.

Anyways, that’s it for now. I’m going to keep looking into Python, and I’d like to investigate how byte streams can be converted to Unicode streams and vice-versa. Note this isn’t an attack - it’s just me looking into Python’s internationalization support (which I’m learning about) and comparing it to Java (which I have previously learned about).

Leave a Comment