2006-03-31

When you deal with katakana in your program

Katakana is one of several sets of characters used to write Japanese . Katakana is a phonetic alphabet whereas kanji is an ideogram -- each kanji character has a meaning and a set of pronunciations whereas each katakana character has only a pronunciation.

It's good to know that there is another kind of phonetic alphabet in Japanese -- hiragana. Hiragana and katakana are like lower and upper cases. Each hiragana character has a corresponding katakana character. Hiragana being off-topic, I don't go farther with hiragana here.

You can see a table of katakana characters here. For historical reasons, there is another set of katakana characters called half width katakana. You can see a half width katakana table here.

When you talk about non half width katakana specifically, you should call it full width katakana. Half width and full width came from how those sets of characters are displayed and printed. Typically, half width katakana characters occupies half the width of kanji characters whereas full width katakana occupies the same width as kanji.

When computers were much less capable, half width katakana was the only way to represent Japanese on computers. Consisted of 63 characters and taking the same resolution to display and print, katakana was easy enough to handle even in old days. Handling thousands of kanji characters requiring much higher display and print resolution had not been practical until the mid 80's. Representing Japanese only with katakana is somewhat like representing English only with upper case letters, which was common at the dawn of computing.

There are two pronunciation modifier symbols used in katakana -- voice sound mark and half voiced sound mark. For example, using Unicode code points and character names, U+30AC (KATAKANA LETTER GA) is U+30AB (KATAKANA LETTER KA) with a voiced sound mark attached. When half width katakana was designed and implemented, they decided not to have precomposed character of a katakana character with a pronunciation modifier.

Rather, it's represented, displayed, and printed as two consecutive characters. This is because it takes more dots and/or screen resolution to represent a katakana character with a pronunciation modifier as a precomposed single character.

Time went by and computers had become powerful enough to represent around 6000 Japanese characters requiring higher display and print resolution than Latin letters. Full width katakana was introduced in addition to half width katakana so that katakana characters are displayed and printed more properly.

As you can imagine, there is no need to use half width katakana now. However, for backward compatibility, half width katakana is still available and many people end up using it simply because it's available. And unfortunately, a typical user doesn't care half with and full width. Here arises needs for conversion between half width and full width katakana -- you have to normalize Japanese input data to half width or full width. Otherwise, search won't yield expected results.

There is another occasion where half width to full width conversion is necessary -- for email. The vast majority of Japanese email is in ISO-2022-JP charset, which lack half width katakana. There are cases where half width katakana is contained in a ISO-2022-JP text data, but ISO-2022-JP defined by RFC 1468 doesn't have half width katakana.

As described so far, the basic difference between full width and half width katakana is that the former has precomposed character whereas the latter doesn't. When you convert a half width katakana string into full width, you have to recognize a sequence of a katakana character followed by a pronunciation modifier and convert them into a single precomposed character.

2006-03-06

Japanese Spams Gmail Cannot Filter

One month ago or so, I started seeing Japanese spams which Gmail fails to filter much more frequently than before. Until then, Gmail spam filter had been doing a decent job with Japanese spams, but that's not the case now. I'm receiving dozens of Japanese spams which Gmail fails to filter everyday now, which is quite annoying.

The distinguishing characteristics of those spams is that their subjects are claimed to be in the ISO-2022-JP charset but actually in the SHIFT_JIS charset. And they are encoded in base 64. e.g.
Subject: =?ISO-2022-JP?B?kWaQbJBsjciSsouzk6+NRInvgqmC54LMgqiSbYLngrmBQg==?=
The problem is that, according to my observation, because of the false claim, Gmail understands the subject as a random string hence its spam filter doesn't work as it should. Here's how Gmail looks to understand the subject:


The false claim is a result of sloppy understanding of how to compose a Japanese email. It's ironic that the sloppiness works in favor of the spammers against Gmail's spam filter.

I'd really like Gmail to cope with it soon. Let me point out that this spamming technique is not Japanese specific; it can be employed for other languages as well.

Added on 2006-03-20:
At the time I published this posting, I notified Google about it. I don't know how it contributed, but now, Gmail's SPAM filter seems to be able to cope with SPAMs of this kind to some extent.

2006-03-03

Why Firefox's Share Is Small in Japan

This posting is inspired by a Slashdot Japan story of the same title. Some of the points made in the story are elaborated here.

According to a presentation in Mozilla Japan Seminar, Firefox's share is substantially smaller in Japan than in other regions. At this point, Firefox has 12% share world-wide, 20% in Europe, 15% in North America, 10% across Asia, and as small as 4% in Japan. Let me think why.

Japanese-Unfriedly
I think the prime reason is that Firefox is so Japanese-unfriendly. Firefox's web site design, download instruction, documentation, default font and character parameters have or used to have rooms for improvement for Japanese speakers. Average computer users in Japan are not comfortable with them. Even though there are a fair number of Japanese speakers involved with Firefox, Japanese localization is not enough.

For example, Japanese characters having many strokes on average, readability of serif fonts is noticeably less than san serif on computer screens. Screen resolution is still not enough to display serifs of a complex character at 10 point or so. Think about the dawn of personal computer around 1980. Characters being displayed at 5 dot by 7 dot, serif font design was impractical.

Firefox had long been employing a serif font (Mincho) as the default Japanese font. Whereas Internet Explorer's default Japanese font is a san serif font (Gothic). From version 1.5, Firefox's default Japanese font is san serif though.

I suspect that in Japan, English proficiency rate among computer users is quite low -- probably the world lowest. There are regions where general English proficiency rate is lower than Japan. But in those regions, computer users are in the privileged class and they tend to speak English.

Japanese Friendly Alternatives
There is another noteworthy factor. Before Firefox became popular, several free tab browsers using IE component emerged in Japan. Firefox's biggest appeal (arguably) being the tab feature, computer users in Japan got good IE based alternatives before Firefox. Those free tab browsers are written by Japanese programmer, those browsers are comfortable for Japanese users in terms of documentation, default parameters, and the feature set.

Postscript on 2006-09-28
Originally, I mentioned that one reason of low English proficiency in Japan was education. Not the quality of English education, but the fact that education is done in Japanese all the way. Whereas many non English speaking countries, college education is done in English.

A reader pointed out that in most countries, education is done in local language up to graduate school. I asked a Peruvian and an Italian friends about it. The reader is proved to be right.