Tuesday, October 28, 2008

Why did my browser display "??????" ?


In the previous post, I wrote about some of Java’s i18n mechanisms especially when Java programs are compiled, plus read/write files. These are (1), (2), (3), illustrated in Figure 1, and we could do everything in a Java way. However, when we think about (6), (7), we need to know how communications between web containers and web browsers are going on in addition to Java’s i18n. All of you may know, communications must follow HTTP, which is defined by RFC2616(http://www.ietf.org/rfc/rfc2616.txt). HTTP has a field to save user agents' languages such as English or Spanish, but doesn't have any field for character encodings. To read and show characters correctly, we need to apply correct character set names in the right places. In this post, I'm going to write about (6) and (7) in Figure 1, which are Servlet and JSP programmings.


Typical ways of conversion – (6)/(7) Servlet and JSP

I’m going to start with a very simple combination of Tomcat and Servlet to clear the problem and solutions. The first file is a simple HTML file shown below:
[test0.html]
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html lang="ja">
<head>
<title>Simple Test 0</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<script type="text/javascript">
function escapeText() {
str = encodeURI(document.forms[0].message.value)
document.forms[0].action="/ginkgo/SimpleFormServlet0?message=" + str
document.forms[0].submit()
}
</script>
</head>
<body>
<h3>てすと0</h3>
<div>
<form onsubmit="escapeText()" method="get">
[HTTP GET] Message: <input name="message" type="TEXT" size="20"/>
</form>
</div>
<div>
<form action="/ginkgo/SimpleFormServlet0" method="post">
[HTTP POST] Message: <input name="message" type="TEXT" size="20"/>
</form>
</div>
</body>
</html>

This HTML file has two input fields, one is for HTTP GET method, another is for HTTP POST method. Even though Java based web containers can handle it, we should encode non-ASCII text typed in the input field for HTTP GET method because those are sent to the web server as a part of URI. Since RFC of HTTP defines URIs should not be non-ASCII characters, I added the escapeText() function for HTTP GET submittion. Javascript's encodeURI function converts characters into UTF-8 followed by % exactly defined in RFC. For example, Japanese characters "あいう" result in "%E3%81%82%E3%81%84%E3%81%86." On the other hand, HTTP POST method doesn't have such requirement, so we can simply send parameters without escaping them.

Sumitted parameters in the HTML file is received by the Servlet, SimpleFormServlet0.java, shown below, which does no conversion so that we can see incorrect outputs first. The servlet forwards the input text to JSP, result.jsp, after receiving the form parameter:
[SimpleFormServlet0.java]
package yellow;

import java.io.IOException;
import java.net.URLDecoder;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

public class SimpleFormServlet0 extends HttpServlet {
public static final String paramName = "message";
public static final String charset = "UTF-8";

protected void processRequest(HttpServletRequest request, HttpServletResponse response, String text)
throws ServletException, IOException {
request.setAttribute("servletName", getServletName());
request.setAttribute("inputText", text);
getServletContext().getRequestDispatcher("/result.jsp").forward(request, response);
}

@Override
protected void doGet(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
String text = URLDecoder.decode(request.getParameter(paramName), charset);
processRequest(request, response, text);
}

@Override
protected void doPost(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
processRequest(request, response, request.getParameter(paramName));
}

@Override
public String getServletInfo() {
return "This servlet outputs incorrect characters.";
}

}

[result.jsp]
<%@page contentType="text/html" pageEncoding="UTF-8"%>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<%
String context = request.getContextPath();
String servletName = (String)request.getAttribute("servletName");
%>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>What did I get?</title>
</head>
<body>
<div>
<%= context + "/" + servletName %> の実行結果<hr/>
Method: <%= request.getMethod() %><br/>
Message: <%= request.getAttribute("inputText") %><br/>
</div>
</body>
</html>


When I typed Japanese characters in the input text field and submitted them, I got incorrect outputs as I expected.

What was happened inside of the web container? This result shows that Java application failed to convert characters from native encodings to Unicode. For communication between web browsers and Java applications (web containers), Java applications can't find a correct character encoding of given parameters because of HTTP. Some web containers such as IBM's WebSphere guess what encoding is appropriate in that communication and converts characters automatically; however, Tomcat, Glassfish and probably most web containers do not.

To convert submitted parameters into correct Unicode strings, we have three options. The first one is to convert them in the Servlet. Since given characters are incorrectly converted, revert them to an byte array then convert them again by using a correct character encoding. I added one method to SimpleFormServlet0 to perform this reconversion steps:
[SimpleFormServlet1.java]
package yellow;

import java.io.IOException;
import java.io.UnsupportedEncodingException;
import java.net.URLDecoder;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

public class SimpleFormServlet1 extends HttpServlet {
public static final String paramName = "message";
public static final String charset = "UTF-8";

protected void processRequest(HttpServletRequest request, HttpServletResponse response, String text)
throws ServletException, IOException {
String inputText = getUnicodeString(text, charset);
request.setAttribute("inputText", inputText);
request.setAttribute("servletName", getServletName());
getServletContext().getRequestDispatcher("/result.jsp").forward(request, response);
}

protected String getUnicodeString(String s, String charsetName)
throws UnsupportedEncodingException {
return new String(s.getBytes("8859_1"), charsetName);

}

@Override
protected void doGet(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
String text = URLDecoder.decode(request.getParameter(paramName), charset);
processRequest(request, response, text);
}

@Override
protected void doPost(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
processRequest(request, response, request.getParameter(paramName));
}

@Override
public String getServletInfo() {
return "This servlet outputs correct characters.";
}

}

When I changed the servlet name in document.forms[0].action of test0.html from /ginkgo/SimpleFormServlet0 to /ginkgo/SimpleFormServlet1 and tried again, I got the correct result.

The second option to convert submitted characters into correct ones is to use ServletRequest.setCharacterEncoding() method shown below:
[SimpleFormServlet2.java]
package yellow;

import java.io.IOException;
import java.io.UnsupportedEncodingException;
import java.net.URLDecoder;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

public class SimpleFormServlet2 extends HttpServlet {
public static final String paramName = "message";
public static final String charset = "UTF-8";

protected void processRequest(HttpServletRequest request, HttpServletResponse response, String text)
throws ServletException, IOException {
request.setAttribute("servletName", getServletName());
request.setAttribute("inputText", text);
getServletContext().getRequestDispatcher("/result.jsp").forward(request, response);
}

@Override
protected void doGet(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
request.setCharacterEncoding("UTF-8");
String text = URLDecoder.decode(request.getParameter(paramName), charset);
processRequest(request, response, text);
}

@Override
protected void doPost(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
request.setCharacterEncoding("UTF-8");
processRequest(request, response, request.getParameter(paramName));
}

@Override
public String getServletInfo() {
return "Correct outputs by using HttpServletRequest.setCharacterEncoding().";
}
}
This answer seems simple, but we need to care what API document says. According to the API document, this method "must be called prior to reading request parameters or reading input using getReader(). Otherwise, it has no effect." In another words, we might not be able to use this method when our web applications work on some web frameworks which internally access request object before programmers do something. Suppose I wrote an appliation on Struts and set form parameters in an ActionForm type object by using a feature of Struts, setCharacterEncoding() method would have no effect because request object had been used before I accessed it when an Action type object got executed. In this case, the first option would be more feasible choice.

The third option is to use ServletRequest.setCharacterEncoding() method in a Servlet filter as in below:
[CharacterEncodingFilter.java]
package yellow;

import java.io.IOException;
import javax.servlet.Filter;
import javax.servlet.FilterChain;
import javax.servlet.FilterConfig;
import javax.servlet.ServletException;
import javax.servlet.ServletRequest;
import javax.servlet.ServletResponse;

public class CharacterEncodingFilter implements Filter {

private FilterConfig filterConfig = null;

public void doFilter(ServletRequest request, ServletResponse response,
FilterChain chain)
throws IOException, ServletException {
request.setCharacterEncoding("UTF-8");
try {
chain.doFilter(request, response);
} catch (Throwable t) {
t.printStackTrace();
}
}

public void destroy() {
}

public void init(FilterConfig filterConfig) {
this.filterConfig = filterConfig;
}
}
This solution would work well unless web applications or frameworks require to use some filter as the first one. In fact, filtering worked well with Struts. However, as far as I tried, setting character encoding in the filter never worked when I used HTTP GET method whereas I got correct outputs by HTTP POST method.


So far, I wrote about how to avoid getting "?????" by programming, we have more choices. Both Tomcat and Glassfish offer us a way of setting encodings by configurations. In case of Tomcat, we can set encodings by URIEncoding option of Connector configuration in serverl.xml.
<Connector port="8080" protocol="HTTP/1.1" 
connectionTimeout="20000"
redirectPort="8443"
URIEncoding="UTF-8" />

Like its name, this option affects parameters sent by HTTP GET method only. Thus, this option would be complement to Servlet filtering. Glassfish gives us a more flexible and convenient option. We can set encoding in sun-web.xml of each web application as in below:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE sun-web-app PUBLIC "-//Sun Microsystems, Inc.//DTD Application Server 9.0 Servlet 2.5//EN"
"http://www.sun.com/software/appserver/dtds/sun-web-app_2_5-0.dtd">
<sun-web-app error-url="">
<context-root>/poplar</context-root>
<class-loader delegate="true"/>
<jsp-config>
<property name="keepgenerated" value="true">
<description>Keep a copy of the generated servlet class' java code.</description>
</property>
</jsp-config>
<parameter-encoding default-charset="UTF-8"/>
</sun-web-app>
This configuration worked both HTTP GET and POST methods of SimpleFormServlet0 perfectly. Besides, we can set differenct character encoding in each web application on a single Glassfish server, plus editable on IDE since it is located under the WEB-INF directory. This option would be the best when we can choose Glassfish for our web container.

Let me add breif explanation about JSP. As in result.jsp above, JSP usually have both
<%@page contentType="text/html" pageEncoding="UTF-8"%>
and
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
A pageEncoding attribute in the former tag is used when a JSP file is read and compiled by Java VM, while the latter is used to render by a web browser. That is why JSP pages have both.

To determine an appropriate character encoding in communications between web containers and web browsers is complicated. Web containers might have that features, and web frameworks might give us options. However, at least we should do something when we need to handle non-ASCII characters filled in HTML form fields exactly once. If conversion is done more than once, ????? would show up on a web browser. Don't forget that!

1 comment:

Anonymous said...
This comment has been removed by a blog administrator.