Wednesday, October 29, 2008

The sound of PyPy

PyPy is, of course, an implementation of Python, but its sound is a bit embarrasing for Japaense to refer to in public. The sound is the same as the one of breasts in an informal way of mentioning them, especially babies' words. If someone pronounced PyPy by Japanese accent, other people might imagine breasts although they knew he or she talked about a computer languange. Japanese surely know it is just the name in a foreign language, but another naming might have been better for Japanese to mean it in offices.

Tuesday, October 28, 2008

Why did my browser display "??????" ?


In the previous post, I wrote about some of Java’s i18n mechanisms especially when Java programs are compiled, plus read/write files. These are (1), (2), (3), illustrated in Figure 1, and we could do everything in a Java way. However, when we think about (6), (7), we need to know how communications between web containers and web browsers are going on in addition to Java’s i18n. All of you may know, communications must follow HTTP, which is defined by RFC2616(http://www.ietf.org/rfc/rfc2616.txt). HTTP has a field to save user agents' languages such as English or Spanish, but doesn't have any field for character encodings. To read and show characters correctly, we need to apply correct character set names in the right places. In this post, I'm going to write about (6) and (7) in Figure 1, which are Servlet and JSP programmings.


Typical ways of conversion – (6)/(7) Servlet and JSP

I’m going to start with a very simple combination of Tomcat and Servlet to clear the problem and solutions. The first file is a simple HTML file shown below:
[test0.html]
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html lang="ja">
<head>
<title>Simple Test 0</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<script type="text/javascript">
function escapeText() {
str = encodeURI(document.forms[0].message.value)
document.forms[0].action="/ginkgo/SimpleFormServlet0?message=" + str
document.forms[0].submit()
}
</script>
</head>
<body>
<h3>てすと0</h3>
<div>
<form onsubmit="escapeText()" method="get">
[HTTP GET] Message: <input name="message" type="TEXT" size="20"/>
</form>
</div>
<div>
<form action="/ginkgo/SimpleFormServlet0" method="post">
[HTTP POST] Message: <input name="message" type="TEXT" size="20"/>
</form>
</div>
</body>
</html>

This HTML file has two input fields, one is for HTTP GET method, another is for HTTP POST method. Even though Java based web containers can handle it, we should encode non-ASCII text typed in the input field for HTTP GET method because those are sent to the web server as a part of URI. Since RFC of HTTP defines URIs should not be non-ASCII characters, I added the escapeText() function for HTTP GET submittion. Javascript's encodeURI function converts characters into UTF-8 followed by % exactly defined in RFC. For example, Japanese characters "あいう" result in "%E3%81%82%E3%81%84%E3%81%86." On the other hand, HTTP POST method doesn't have such requirement, so we can simply send parameters without escaping them.

Sumitted parameters in the HTML file is received by the Servlet, SimpleFormServlet0.java, shown below, which does no conversion so that we can see incorrect outputs first. The servlet forwards the input text to JSP, result.jsp, after receiving the form parameter:
[SimpleFormServlet0.java]
package yellow;

import java.io.IOException;
import java.net.URLDecoder;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

public class SimpleFormServlet0 extends HttpServlet {
public static final String paramName = "message";
public static final String charset = "UTF-8";

protected void processRequest(HttpServletRequest request, HttpServletResponse response, String text)
throws ServletException, IOException {
request.setAttribute("servletName", getServletName());
request.setAttribute("inputText", text);
getServletContext().getRequestDispatcher("/result.jsp").forward(request, response);
}

@Override
protected void doGet(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
String text = URLDecoder.decode(request.getParameter(paramName), charset);
processRequest(request, response, text);
}

@Override
protected void doPost(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
processRequest(request, response, request.getParameter(paramName));
}

@Override
public String getServletInfo() {
return "This servlet outputs incorrect characters.";
}

}

[result.jsp]
<%@page contentType="text/html" pageEncoding="UTF-8"%>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<%
String context = request.getContextPath();
String servletName = (String)request.getAttribute("servletName");
%>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>What did I get?</title>
</head>
<body>
<div>
<%= context + "/" + servletName %> の実行結果<hr/>
Method: <%= request.getMethod() %><br/>
Message: <%= request.getAttribute("inputText") %><br/>
</div>
</body>
</html>


When I typed Japanese characters in the input text field and submitted them, I got incorrect outputs as I expected.

What was happened inside of the web container? This result shows that Java application failed to convert characters from native encodings to Unicode. For communication between web browsers and Java applications (web containers), Java applications can't find a correct character encoding of given parameters because of HTTP. Some web containers such as IBM's WebSphere guess what encoding is appropriate in that communication and converts characters automatically; however, Tomcat, Glassfish and probably most web containers do not.

To convert submitted parameters into correct Unicode strings, we have three options. The first one is to convert them in the Servlet. Since given characters are incorrectly converted, revert them to an byte array then convert them again by using a correct character encoding. I added one method to SimpleFormServlet0 to perform this reconversion steps:
[SimpleFormServlet1.java]
package yellow;

import java.io.IOException;
import java.io.UnsupportedEncodingException;
import java.net.URLDecoder;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

public class SimpleFormServlet1 extends HttpServlet {
public static final String paramName = "message";
public static final String charset = "UTF-8";

protected void processRequest(HttpServletRequest request, HttpServletResponse response, String text)
throws ServletException, IOException {
String inputText = getUnicodeString(text, charset);
request.setAttribute("inputText", inputText);
request.setAttribute("servletName", getServletName());
getServletContext().getRequestDispatcher("/result.jsp").forward(request, response);
}

protected String getUnicodeString(String s, String charsetName)
throws UnsupportedEncodingException {
return new String(s.getBytes("8859_1"), charsetName);

}

@Override
protected void doGet(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
String text = URLDecoder.decode(request.getParameter(paramName), charset);
processRequest(request, response, text);
}

@Override
protected void doPost(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
processRequest(request, response, request.getParameter(paramName));
}

@Override
public String getServletInfo() {
return "This servlet outputs correct characters.";
}

}

When I changed the servlet name in document.forms[0].action of test0.html from /ginkgo/SimpleFormServlet0 to /ginkgo/SimpleFormServlet1 and tried again, I got the correct result.

The second option to convert submitted characters into correct ones is to use ServletRequest.setCharacterEncoding() method shown below:
[SimpleFormServlet2.java]
package yellow;

import java.io.IOException;
import java.io.UnsupportedEncodingException;
import java.net.URLDecoder;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

public class SimpleFormServlet2 extends HttpServlet {
public static final String paramName = "message";
public static final String charset = "UTF-8";

protected void processRequest(HttpServletRequest request, HttpServletResponse response, String text)
throws ServletException, IOException {
request.setAttribute("servletName", getServletName());
request.setAttribute("inputText", text);
getServletContext().getRequestDispatcher("/result.jsp").forward(request, response);
}

@Override
protected void doGet(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
request.setCharacterEncoding("UTF-8");
String text = URLDecoder.decode(request.getParameter(paramName), charset);
processRequest(request, response, text);
}

@Override
protected void doPost(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
request.setCharacterEncoding("UTF-8");
processRequest(request, response, request.getParameter(paramName));
}

@Override
public String getServletInfo() {
return "Correct outputs by using HttpServletRequest.setCharacterEncoding().";
}
}
This answer seems simple, but we need to care what API document says. According to the API document, this method "must be called prior to reading request parameters or reading input using getReader(). Otherwise, it has no effect." In another words, we might not be able to use this method when our web applications work on some web frameworks which internally access request object before programmers do something. Suppose I wrote an appliation on Struts and set form parameters in an ActionForm type object by using a feature of Struts, setCharacterEncoding() method would have no effect because request object had been used before I accessed it when an Action type object got executed. In this case, the first option would be more feasible choice.

The third option is to use ServletRequest.setCharacterEncoding() method in a Servlet filter as in below:
[CharacterEncodingFilter.java]
package yellow;

import java.io.IOException;
import javax.servlet.Filter;
import javax.servlet.FilterChain;
import javax.servlet.FilterConfig;
import javax.servlet.ServletException;
import javax.servlet.ServletRequest;
import javax.servlet.ServletResponse;

public class CharacterEncodingFilter implements Filter {

private FilterConfig filterConfig = null;

public void doFilter(ServletRequest request, ServletResponse response,
FilterChain chain)
throws IOException, ServletException {
request.setCharacterEncoding("UTF-8");
try {
chain.doFilter(request, response);
} catch (Throwable t) {
t.printStackTrace();
}
}

public void destroy() {
}

public void init(FilterConfig filterConfig) {
this.filterConfig = filterConfig;
}
}
This solution would work well unless web applications or frameworks require to use some filter as the first one. In fact, filtering worked well with Struts. However, as far as I tried, setting character encoding in the filter never worked when I used HTTP GET method whereas I got correct outputs by HTTP POST method.


So far, I wrote about how to avoid getting "?????" by programming, we have more choices. Both Tomcat and Glassfish offer us a way of setting encodings by configurations. In case of Tomcat, we can set encodings by URIEncoding option of Connector configuration in serverl.xml.
<Connector port="8080" protocol="HTTP/1.1" 
connectionTimeout="20000"
redirectPort="8443"
URIEncoding="UTF-8" />

Like its name, this option affects parameters sent by HTTP GET method only. Thus, this option would be complement to Servlet filtering. Glassfish gives us a more flexible and convenient option. We can set encoding in sun-web.xml of each web application as in below:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE sun-web-app PUBLIC "-//Sun Microsystems, Inc.//DTD Application Server 9.0 Servlet 2.5//EN"
"http://www.sun.com/software/appserver/dtds/sun-web-app_2_5-0.dtd">
<sun-web-app error-url="">
<context-root>/poplar</context-root>
<class-loader delegate="true"/>
<jsp-config>
<property name="keepgenerated" value="true">
<description>Keep a copy of the generated servlet class' java code.</description>
</property>
</jsp-config>
<parameter-encoding default-charset="UTF-8"/>
</sun-web-app>
This configuration worked both HTTP GET and POST methods of SimpleFormServlet0 perfectly. Besides, we can set differenct character encoding in each web application on a single Glassfish server, plus editable on IDE since it is located under the WEB-INF directory. This option would be the best when we can choose Glassfish for our web container.

Let me add breif explanation about JSP. As in result.jsp above, JSP usually have both
<%@page contentType="text/html" pageEncoding="UTF-8"%>
and
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
A pageEncoding attribute in the former tag is used when a JSP file is read and compiled by Java VM, while the latter is used to render by a web browser. That is why JSP pages have both.

To determine an appropriate character encoding in communications between web containers and web browsers is complicated. Web containers might have that features, and web frameworks might give us options. However, at least we should do something when we need to handle non-ASCII characters filled in HTML form fields exactly once. If conversion is done more than once, ????? would show up on a web browser. Don't forget that!

Thursday, October 16, 2008

Why did I get "??????" ?

Recently, I saw two i18n related questions about JSR 223 scripting engine for JRuby. Reading those questions, I thought they confused how to handle non-ASCII Strings in a Java way. Java’s i18n mechanism is not so complicated, but it has several ways of encoding and decoding characters to/from Unicode. As many Java programmers know, Java VM has arrays of Unicode code points to express characters and converts them to a specific encoding such as UTF-8, Cp1250, or Shift-JIS when conversion is needed. Java programmers might not care about i18n mechanism since Java has a brilliant idea of a default encoding. However, encodings are not platform independent. I wrote this entry for those who don’t want to have “?????” outputs from Java programs anymore.

1. Basics of Unicode and encodings

First of all, programmers should read Joel Spolsky’s great article, “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) [http://www.joelonsoftware.com/articles/Unicode.html].” This article explains a concept of Unicode, encodings and history of those features. You will surely have better ideas about Unicode, which is “just a theoretical concept.” As in the article, every single character on the earth has its own “magic number” called a code point. For example, code points for Greek characters, α, β, γ, are \u03B1, \u03B2, \u03B3, and Japanese characters, あ, い, う are \u3042, \u3044, \u3046. These code points should be converted to familiar encodings such as UTF-8 when a program needs to show human readable characters on an output window of IDEs, terminals, or web browsers. How to convert these Unicode code points from/to one of character encodings might be a culprit that causes confusion.

2. Typical ways of conversion



Java has several mechanisms to convert Unicode from/to character encodings. Look at the chart that shows typical eight ways of conversion. The chart doesn’t depict all of them, but probably covers common conversions that programmers encounter i18n troubles. Among these, I’ll pick (1), (2), (3) up in this entry because these often cause troubles while using JRuby engine.


(1) How Strings in Java programs get converted?

The first one is the most basic and most frequently used conversion performed by Java compiler. Java compiler converts human readable characters in Java programs to Unicode code points. Programmers don’t need to care about this conversion unless *.java files are saved by using platform’s default encoding and never compiled on another platform whose default encoding is not the same. Suppose someone saves a Java program with non-ASCII characters on Windows XP, and compiles it on Ubuntu. What would happen? Because default encodings of both system are not the same, he or she would have “?????” outputs if “-encoding” option is not used while compiling. A straightforward way of specifying the encoding information is, for example:

javac –encoding Cp1250 foo.java

Recently, people don’t use javac command directly but compile on IDE. As far as I know NetBenas and Eclipse can do this. On NetBeans, We can have the same effect when we select a project on Projects window, right click, select Properties, sources category, then set Encoding of *.java files. On Eclipse, when we select a project on Explorer window, right click, select Properties, Resource, check Other in Text file encoding section, then set appropriate encoding for *.java files, we can change the encoding.
Ant and Maven have options that exactly work equal to javac’s –encoding option. Ant javac task allows us to set the character encoding of *.java files by an encoding attribute. We can set the encoding in pom.xml of Maven as follows:

<build>
<plugins>
<plugin>
<artifactid>maven-compiler-plugin</artifactid>
<version>RELEASE</version>
<configuration>
<encoding>UTF-8</encoding>
</configuration>
</plugin>
</plugins>
</build>



(2)/(3) How to read/write non-ASCII characters from/to files?

When Java programmers read scripts, text or others from files, and write information, text, … to files, they should care about encodings in some cases. Again, think of the default encoding. Input files might be written on Windows platform whose default encoding is not UTF-8, and be read on Linux platform whose default encoding is UTF-8. How can Java VM convert non-ASCII characters from/to Unicode code points correctly? The one of answers is to use java.io.InputStreamReader and java.io.OutputStreamWriter. Java classes that extends Reader/Writer class are Unicode aware and converts characters automatically; however, classes except InputStreamReader/Writer can’t change the encoding from default one to another by themselves. When programmers want to set non-default encodings, followings would work:
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.UnsupportedEncodingException;

public class I18nFileReaderWriter {
public static void main(String[] args)
throws FileNotFoundException, UnsupportedEncodingException, IOException {
System.out.println(System.getProperty("file.encoding"));

String inputname = "input.txt";
String outputname = "output.txt";
BufferedReader reader =
new BufferedReader(new InputStreamReader(new FileInputStream(inputname), "Shift-JIS"));
BufferedWriter writer =
new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputname), "EUC-JP"));
String str;
while ((str = reader.readLine()) != null) {
System.out.println(str);
writer.write(str, 0, str.length());
writer.newLine();
}
reader.close();
writer.close();
}
}


Although my PC’s default encoding is UTF-8, I saved an input file by Shift-JIS encoding. After reading it with the conversion, the program wrote Strings out to a file by EUC-JP encoding. See? I checked each file’s encoding on FireFox, using view menu as in Figure 1 and 2.
Figure 1
Figure 2
Another good answer to read non-ASCII file is to use java.util.Scanner, which is added to JDK 5. In case that Reader type object is not required afterword, Scanner class might make code simpler than InputStreamReader as shown below:
Scanner scanner = new Scanner(new File(inputname), "Shift_JIS");
while (scanner.hasNextLine()) {
System.out.println(scanner.nextLine());
}
scanner.close();



3. JRuby vs. JRuby script engine

Strings in JRuby are not the arrays of Unicode code points although JRuby is a Java application. Because JRuby is an implementation of Ruby, JRuby has byte arrays of Strings to meet Ruby specification. About a year ago, I knew this mechanism at jruby-dev ML when I posted a question to figure out why non-ASCII characters passed from JRuby engine were not handled correctly. The answer is here, http://www.nabble.com/I18n-problem-in-StrNode-and-ByteList-to13431845.html#a13431845. In terms of a byte level, JRuby uses InputStream/OutputStream basically for its I/O while ordinary Java programs use Reader/Writer.
JSR 223 JRuby script engine follows the ordinary Java way and uses Reader/Writer classes always. Thus, JRuby engine is responsible to bridge between Reader/Writer and InputStream/OutputStream. When a Reader type object is set to ScriptEngine.eval() method, JRuby engine reads whole script and passes it to JRuby as a String. When a filename is set to a context of ScriptEngine by using ScriptEngine.FILENAME key, JRuby engine does’t use Reader object at all and creates an InputStream object based on the filename, then pass it to JRuby. This implementation is funny and has overhead but is necessary to fullfil javax.script API and fit them in JRuby. JRuby engine also wraps OutputStream since Writer type objects are required. That’s why I added WriterOutputStream class to JRuby engine.


What I wrote about here is only a part of Java and JRuby engine’s I18n mechanisms. I’ll add more later since this entry becomes pretty long.