Thursday, October 16, 2008

Why did I get "??????" ?

Recently, I saw two i18n related questions about JSR 223 scripting engine for JRuby. Reading those questions, I thought they confused how to handle non-ASCII Strings in a Java way. Java’s i18n mechanism is not so complicated, but it has several ways of encoding and decoding characters to/from Unicode. As many Java programmers know, Java VM has arrays of Unicode code points to express characters and converts them to a specific encoding such as UTF-8, Cp1250, or Shift-JIS when conversion is needed. Java programmers might not care about i18n mechanism since Java has a brilliant idea of a default encoding. However, encodings are not platform independent. I wrote this entry for those who don’t want to have “?????” outputs from Java programs anymore.

1. Basics of Unicode and encodings

First of all, programmers should read Joel Spolsky’s great article, “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) [http://www.joelonsoftware.com/articles/Unicode.html].” This article explains a concept of Unicode, encodings and history of those features. You will surely have better ideas about Unicode, which is “just a theoretical concept.” As in the article, every single character on the earth has its own “magic number” called a code point. For example, code points for Greek characters, α, β, γ, are \u03B1, \u03B2, \u03B3, and Japanese characters, あ, い, う are \u3042, \u3044, \u3046. These code points should be converted to familiar encodings such as UTF-8 when a program needs to show human readable characters on an output window of IDEs, terminals, or web browsers. How to convert these Unicode code points from/to one of character encodings might be a culprit that causes confusion.

2. Typical ways of conversion



Java has several mechanisms to convert Unicode from/to character encodings. Look at the chart that shows typical eight ways of conversion. The chart doesn’t depict all of them, but probably covers common conversions that programmers encounter i18n troubles. Among these, I’ll pick (1), (2), (3) up in this entry because these often cause troubles while using JRuby engine.


(1) How Strings in Java programs get converted?

The first one is the most basic and most frequently used conversion performed by Java compiler. Java compiler converts human readable characters in Java programs to Unicode code points. Programmers don’t need to care about this conversion unless *.java files are saved by using platform’s default encoding and never compiled on another platform whose default encoding is not the same. Suppose someone saves a Java program with non-ASCII characters on Windows XP, and compiles it on Ubuntu. What would happen? Because default encodings of both system are not the same, he or she would have “?????” outputs if “-encoding” option is not used while compiling. A straightforward way of specifying the encoding information is, for example:

javac –encoding Cp1250 foo.java

Recently, people don’t use javac command directly but compile on IDE. As far as I know NetBenas and Eclipse can do this. On NetBeans, We can have the same effect when we select a project on Projects window, right click, select Properties, sources category, then set Encoding of *.java files. On Eclipse, when we select a project on Explorer window, right click, select Properties, Resource, check Other in Text file encoding section, then set appropriate encoding for *.java files, we can change the encoding.
Ant and Maven have options that exactly work equal to javac’s –encoding option. Ant javac task allows us to set the character encoding of *.java files by an encoding attribute. We can set the encoding in pom.xml of Maven as follows:

<build>
<plugins>
<plugin>
<artifactid>maven-compiler-plugin</artifactid>
<version>RELEASE</version>
<configuration>
<encoding>UTF-8</encoding>
</configuration>
</plugin>
</plugins>
</build>



(2)/(3) How to read/write non-ASCII characters from/to files?

When Java programmers read scripts, text or others from files, and write information, text, … to files, they should care about encodings in some cases. Again, think of the default encoding. Input files might be written on Windows platform whose default encoding is not UTF-8, and be read on Linux platform whose default encoding is UTF-8. How can Java VM convert non-ASCII characters from/to Unicode code points correctly? The one of answers is to use java.io.InputStreamReader and java.io.OutputStreamWriter. Java classes that extends Reader/Writer class are Unicode aware and converts characters automatically; however, classes except InputStreamReader/Writer can’t change the encoding from default one to another by themselves. When programmers want to set non-default encodings, followings would work:
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.UnsupportedEncodingException;

public class I18nFileReaderWriter {
public static void main(String[] args)
throws FileNotFoundException, UnsupportedEncodingException, IOException {
System.out.println(System.getProperty("file.encoding"));

String inputname = "input.txt";
String outputname = "output.txt";
BufferedReader reader =
new BufferedReader(new InputStreamReader(new FileInputStream(inputname), "Shift-JIS"));
BufferedWriter writer =
new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputname), "EUC-JP"));
String str;
while ((str = reader.readLine()) != null) {
System.out.println(str);
writer.write(str, 0, str.length());
writer.newLine();
}
reader.close();
writer.close();
}
}


Although my PC’s default encoding is UTF-8, I saved an input file by Shift-JIS encoding. After reading it with the conversion, the program wrote Strings out to a file by EUC-JP encoding. See? I checked each file’s encoding on FireFox, using view menu as in Figure 1 and 2.
Figure 1
Figure 2
Another good answer to read non-ASCII file is to use java.util.Scanner, which is added to JDK 5. In case that Reader type object is not required afterword, Scanner class might make code simpler than InputStreamReader as shown below:
Scanner scanner = new Scanner(new File(inputname), "Shift_JIS");
while (scanner.hasNextLine()) {
System.out.println(scanner.nextLine());
}
scanner.close();



3. JRuby vs. JRuby script engine

Strings in JRuby are not the arrays of Unicode code points although JRuby is a Java application. Because JRuby is an implementation of Ruby, JRuby has byte arrays of Strings to meet Ruby specification. About a year ago, I knew this mechanism at jruby-dev ML when I posted a question to figure out why non-ASCII characters passed from JRuby engine were not handled correctly. The answer is here, http://www.nabble.com/I18n-problem-in-StrNode-and-ByteList-to13431845.html#a13431845. In terms of a byte level, JRuby uses InputStream/OutputStream basically for its I/O while ordinary Java programs use Reader/Writer.
JSR 223 JRuby script engine follows the ordinary Java way and uses Reader/Writer classes always. Thus, JRuby engine is responsible to bridge between Reader/Writer and InputStream/OutputStream. When a Reader type object is set to ScriptEngine.eval() method, JRuby engine reads whole script and passes it to JRuby as a String. When a filename is set to a context of ScriptEngine by using ScriptEngine.FILENAME key, JRuby engine does’t use Reader object at all and creates an InputStream object based on the filename, then pass it to JRuby. This implementation is funny and has overhead but is necessary to fullfil javax.script API and fit them in JRuby. JRuby engine also wraps OutputStream since Writer type objects are required. That’s why I added WriterOutputStream class to JRuby engine.


What I wrote about here is only a part of Java and JRuby engine’s I18n mechanisms. I’ll add more later since this entry becomes pretty long.

6 comments:

Anonymous said...

what happened to the other one?

Anonymous said...

not bad.

Anonymous said...

ok. I found an information here that i want to look for.

Anonymous said...

haha.

Anonymous said...

sure, why not!

Anonymous said...

yeah! its much better,