2006-03-31

When you deal with katakana in your program

Katakana is one of several sets of characters used to write Japanese . Katakana is a phonetic alphabet whereas kanji is an ideogram -- each kanji character has a meaning and a set of pronunciations whereas each katakana character has only a pronunciation.

It's good to know that there is another kind of phonetic alphabet in Japanese -- hiragana. Hiragana and katakana are like lower and upper cases. Each hiragana character has a corresponding katakana character. Hiragana being off-topic, I don't go farther with hiragana here.

You can see a table of katakana characters here. For historical reasons, there is another set of katakana characters called half width katakana. You can see a half width katakana table here.

When you talk about non half width katakana specifically, you should call it full width katakana. Half width and full width came from how those sets of characters are displayed and printed. Typically, half width katakana characters occupies half the width of kanji characters whereas full width katakana occupies the same width as kanji.

When computers were much less capable, half width katakana was the only way to represent Japanese on computers. Consisted of 63 characters and taking the same resolution to display and print, katakana was easy enough to handle even in old days. Handling thousands of kanji characters requiring much higher display and print resolution had not been practical until the mid 80's. Representing Japanese only with katakana is somewhat like representing English only with upper case letters, which was common at the dawn of computing.

There are two pronunciation modifier symbols used in katakana -- voice sound mark and half voiced sound mark. For example, using Unicode code points and character names, U+30AC (KATAKANA LETTER GA) is U+30AB (KATAKANA LETTER KA) with a voiced sound mark attached. When half width katakana was designed and implemented, they decided not to have precomposed character of a katakana character with a pronunciation modifier.

Rather, it's represented, displayed, and printed as two consecutive characters. This is because it takes more dots and/or screen resolution to represent a katakana character with a pronunciation modifier as a precomposed single character.

Time went by and computers had become powerful enough to represent around 6000 Japanese characters requiring higher display and print resolution than Latin letters. Full width katakana was introduced in addition to half width katakana so that katakana characters are displayed and printed more properly.

As you can imagine, there is no need to use half width katakana now. However, for backward compatibility, half width katakana is still available and many people end up using it simply because it's available. And unfortunately, a typical user doesn't care half with and full width. Here arises needs for conversion between half width and full width katakana -- you have to normalize Japanese input data to half width or full width. Otherwise, search won't yield expected results.

There is another occasion where half width to full width conversion is necessary -- for email. The vast majority of Japanese email is in ISO-2022-JP charset, which lack half width katakana. There are cases where half width katakana is contained in a ISO-2022-JP text data, but ISO-2022-JP defined by RFC 1468 doesn't have half width katakana.

As described so far, the basic difference between full width and half width katakana is that the former has precomposed character whereas the latter doesn't. When you convert a half width katakana string into full width, you have to recognize a sequence of a katakana character followed by a pronunciation modifier and convert them into a single precomposed character.

No comments: