Friday, November 27, 2009

JRuby Embed (Red Bridge) Gotchas: __FILE__

About a month ago, I wrote __FILE__ didn't work when Ruby code was loaded from classpath in the thread, Load path issues inside jar / external app. Tracking jruby down with a debugger, I found out one solution. It was a combination of setting a feasible current directory and using File.expand_path.

Here's a test code:

# file_check.rb [Birch]

puts "__FILE__: #{__FILE__}"
puts "dirname: #{File.dirname(__FILE__)}"
puts "expanded path: #{File.expand_path(File.dirname(__FILE__))}"
puts "joined path 1: #{File.join(File.dirname(__FILE__), "abc.rb")}"
puts "joined path 2: #{File.join(File.expand_path(File.dirname(__FILE__)), "abc.rb")}"

// FileCheck.java
package vanilla;

import org.jruby.embed.LocalContextScope;
import org.jruby.embed.PathType;
import org.jruby.embed.ScriptingContainer;

public class FileCheck {

private FileCheck() {
//String userDir = System.getProperty("user.dir");
//System.setProperty("user.dir", userDir+"/src/ruby");
ScriptingContainer container = new ScriptingContainer(LocalContextScope.SINGLETHREAD);
System.out.println("currentDirectory: " + container.getProvider().getRubyInstanceConfig().getCurrentDirectory());
container.getProvider().getRubyInstanceConfig().setCurrentDirectory(System.getProperty("user.dir")+"/src/ruby");
System.out.println("currentDirectory: " + container.getProvider().getRubyInstanceConfig().getCurrentDirectory());
container.runScriptlet(PathType.CLASSPATH, "file_check.rb");
}

public static void main(String[] args) {
new FileCheck();
}
}

The absolute path to file_check.rb was /Users/yoko/NetBeansProjects/Birch/src/ruby/file_check.rb. So, I added the path, /Users/yoko/NetBeansProjects/Birch/src/ruby, to "-cp" option of java command. In a Java code, I set /Users/yoko/NetBeansProjects/Birch/src/ruby as the current directory. Then, the result was below:

yoko$ java -cp build/classes:/Users/yoko/DevSpace/jruby~main/lib/jruby.jar:./src/ruby vanilla.FileCheck
currentDirectory: /Users/yoko/NetBeansProjects/Birch
currentDirectory: /Users/yoko/NetBeansProjects/Birch/src/ruby
__FILE__: file_check.rb
dirname: .
expanded path: /Users/yoko/NetBeansProjects/Birch/src/ruby
joined path 1: ./abc.rb
joined path 2: /Users/yoko/NetBeansProjects/Birch/src/ruby/abc.rb

As you see, a wrapped path by File.expand_path is correct though just File.dirname didn't work. The combination of setting the current directory and using File.expand_path would be the solution of this kind of cases. If you are using JSR223 or BSF, setting user.dir system property works as I commented out in the Java code. This is because JRuby uses the current directory when it expands a path, and the current directory is based on user.dir system property. If it is a web application, perhaps, we can set the current directory using ServletContext#getRealPath().

Wednesday, November 25, 2009

JRuby Embed (Red Bridge) Update: global vars, loading java, and more

During these weeks, I made a couple of changes on Red Bridge (JRuby Embed), which would improve performance a bit and reduce problems caused by global variables. This change is available from 161d0fe in master (1.5.0.dev).

Firstly, I changed an internal implementation of sharing global variables. Red Bridge injects all variables in a variable map just before the evaluation, and tries to retrieve all local, instance, global variables and constants used in Ruby just after the evaluation. This behavior is really greedy, also ends up in poor performance. However, it is necessary since Red Bridge terminates the all state including variable values right after the evaluation is done, which is to save resources. Unless retrieving all variables and constants, Red Bridge can't return requested variables in a Java program. For example, users can do with Red Bridge:

ScriptingContainer container = new ScriptingContainer();
container.runScriptlet("$theta = Math::PI / 6.0");
container.runScriptlet("$value = Math.sin($theta)");
System.out.println(container.get("$theta") + ", " + container.get("$value"));

Above outputs: 0.5235987755982988, 0.49999999999999994


Local, instance variables and constants (except global constants) need to be saved before those are disappeared by the termination, but global variables are still on Ruby runtime. So, I changed to get global variables lazily. Only when it is requested, Red Bridge takes the requested global variable out from runtime.

This new behavior would also reduce troubles caused by global variables. Before, Red Bridge retrieves global variables as much as possible from Ruby runtime except predefined ones. Then, Red Bridge injects all global variables in its variable map to runtime for successive evaluation with values of previous evaluation. This behavior occasionally causes unexpected results and warnings. After the change, Red Bridge doesn't grab unnecessary global variables, doesn't inject them for the next evaluation. Perhaps, unexpected results related to global variables will be reduced. This new behavior is not available when a global local variable behavior, JSR223's default behavior, is chosen since it is tailored to behave exactly the same as the reference implementation.

Some of you might already know clearing up the variable map before the successive evaluation contributes performance. I added two shortcut methods to ScriptingContainer:

org.jruby.embed.ScriptingContainer#remove(String key)
org.jruby.embed.ScriptingContainer#clear()

The remove method removes a specified key-value pair from the variable map and runtime. The clear method removes all key-value pairs from the variable map and runtime. The smaller the variable map size is, the shorter the time for injection is. Don't forget to remove redundant key-value pairs.


I made one more change. Red Bridge no more loads a java library during the initialization. The process of loading libraries in JRuby is quite a cumbersome job. Looking the loaded library tables up to see it is not already loaded, judging how and from where loads the library, then loading, and caching them to avoid duplication... Nevertheless, not all Ruby scripts need the java library. If people run Fibonacchi written in pure Ruby on Red Bridge, they don't need the java library at all. When people want to use the java library, adding the line "require 'java'" in a Ruby code works fine. Moreover, people add "require 'java'" when they run scripts using jruby command if the scripts need the java library. The advantage of pre-loading the java library seems to be less. So, I stopped loading the java library during the initialization. Perhaps, the time for initialization got shortened a bit.

Wednesday, November 04, 2009

A Japanese Teenage Boy Improved Ruby 1.9 Performance Up to 63%

Japanese online magazine, @IT Jibun Senryaku Lab. (information site for IT engineers to educate and/or develop oneself), published an interview with a Japanese teenage boy, Masahiro Kanai, who improved the performance of several methods in Ruby 1.9. He is the age of high school freshman (the third grade of junior high school in Japanese school system). The article (written in Japanese) is here.

According to the article, Masahiro Kanai joined “the Security and Programming Camp 2009” this summer and chose the subject of Ruby’s performance improvement. His mentor was Koichi Sasada (ko1). The performances of the methods he worked have been bumped up 63% in maximum, 8% in average. His patches were applied to Ruby trunk in Oct. 5 this year.

What Masahiro Kanai did was fundamental for performance tuning. He took unnecessary macro references out from a loop. Masahiro spotted macros below in array.c, string.c, and struct.c were referred every time Ruby checked whether data was hold in a structure or not. Even though data were constants, Ruby saw the macros to judge data’s presence in every loop.

-RARRAY_PTR, RARRAY_LEN
-RSTRING_PTR, RSTRING_LEN
-RSTRUCT_PTR, RSTRUCT_LEN

He optimized the loop by eliminating macro references when data were constants.

The interviewer acclaimed that he made it in his age.

Monday, November 02, 2009

JRuby Embed (Red Bridge) Update

Since my last post about JRuby Embed (Red Bridge), it has been vastly changed. JRuby Embed codebase has been merged into JRuby! JRuby 1.5.0 will have Red Bridge inside in its both binary and source archives. Along with this, JRuby Embed wiki pages also have been merged into JRuby's wiki, Embedding JRuby section.

Now, JRuby Embed project is almost in end-of-life period. I'll soon close jruby-embed users ml since it is natural to talk at jruby-users/jruby-dev. Besides, most of discussions have done on jruby-users ml. Jira is also going to be merged into JRuby, but this will be done after JRuby's jira is completed moving from codehaus to kenai. Anyway, JRuby Embed users, please use jruby's ml and jira. JRuby's embedding seciton of jira would be good for us to file issues. However, I'll keep source code repository for JRuby 1.4. JRuby 1.4 has JRuby Embed binary but doesn't have sources. The binary that JRuby 1.4 has is built from codebase of this project, so it still has a reason to be there.

One of the biggest changes is JRuby Embed 0.1.3 has been released from JRuby Embed Project. It will be included in upcoming JRuby 1.4 release. In this release, default value of local context type has been switched from threadsafe to singleton. See the discussion about it. Please make sure your choice is the best to your case. Walk through Context Instance Type section to know what you should choose.

The version, 0.1.3 is identical to the one in JRuby trunk (1.5.0.dev) and also had a fix of JRUBY_EMBED-10. Give it a try. If you find something, file at "JRuby Jira" and ask about it at JRuby's mailing list.

Monday, October 05, 2009

What's the embedding API of JRuby 1.4.0RC1?

JRuby 1.4.0RC1 has been released on Oct. 2 and was a big release. JRuby had a lot of bug fixes and new features. Among them, JRuby Embed (aka Red Bridge) was there. The name, Red Bridge, means a bridge from Java to Ruby and, of course, the bridge has a color of ruby. However, many people would have thought, “What’s the new embedding API?” when they saw Tom’s announce. In this blog post, I ‘m going to answer such question so that people can have better understandings about Red Bridge.

Red Bridge is a Java API to run Ruby scripts in a Java program, and the project is hosted at http://kenai.com/projects/jruby-embed. Red Bridge has two layers, Embed Core and Core based implementations of scripting API. Currently, JSR223 (javax script: http://jcp.org/en/jsr/detail?id=223) and Jakarta BSF 2.4 (Bean Scripting Framework: http://jakarta.apache.org/bsf/) are implemented on top of Embed Core. Embed Core is totally different API from JRuby’s JavaEmbedUtils, which has similar but much fewer API compared to Embed Core. Embed Core has a lot of useful methods and features for embedders. Users of this new embedding API don’t need to use JavaEmbedUtils anymore. Besides, not like scripting APIs that are common to many languages, Embed Core is focused on leveraging JRuby’s power. For example, Embed Core allows users to configure Ruby runtime easily. For example, Embed Core’s parse method can have a JRuby friendly argument, InputStream, to read scripts from.

Red Bridge was originally my solo project I started in the last winter at Google Code to solve issues that Sun’s JSR223 JRuby engine reference implementation had. I was a contributor of JRuby engine at scripting.dev.java.net but felt reluctant to rewrite the it vastly since I’m not a Sun employee. Especially, the license of the reference implementation was a big issue to distribute with JRuby. JSR223 JRuby engine users wanted the implementation to be bundled in JRuby. So, I tried to get permission from Sun, and if possible, modify the license to fit into JRuby. But, I couldn’t get any answer from Sun at all. Other than the license issue, reference implementation’s bug-prone sharing global variable mechanism was a headache to me. That part was repeatedly affected by JRuby’s internal API changes, and grew to literally patchwork like ugly code. That global variables were only one type for sharing variables between Java and Ruby was also a problem. For JavaScript, PHP or maybe other languages, a variable name should be start with ‘$.” However, the name, $something, means not just a variable to Ruby but a globally referenced variable. Some people were eager to use another variable types to share. The sharing global variable of reference implementation also had a problem when JRuby engine was used on a multi-threaded environment such as a Servlet container (Java based web application server). The reference implementation might have set true to ThreadLocal option of Ruby runtime using a System property. However, relying on JVM wide system properties caused another problem especially on web application servers. A web application server might have multiple web applications (wars) on it and system property settings affect all of them.

In light of these issues Red Bridge has exactly the same license as JRuby and new mechanism for sharing variables, besides enables sharing global, local, and instance variables. Users can choose ThreadLocal model for context local values such as Ruby runtime, or sharing variables and other instances. Embed Core provides users methods to configure Ruby runtime. However, JSR223 and BSF engines still rely on JVM wide system property since those APIs haven’t defined such method. See Wiki, http://kenai.com/projects/jruby-embed/pages/Home, for details.

At kenai.com, you might find the project whose name is “Red Bridge.” When I moved my project to kenai.com right after I got the invitation from Charles Oliver Nutter, the name was Red Bridge, the same one at Google Code. A couple of weeks later, I talked with Charles and Thomas Enebo about Red Bridge. They liked Core part of two layers of Red Bridge and wanted to have just core layer bundled in JRuby. Following their choice, I started “JRuby Embed” project just for Embed Core. After that, JSR223 and BSF were added to the list to be bundled in JRuby, and Red Bridge was merged into JRuby Embed project. Merged into Red Bridge was definitely another choice. However, I chose JRuby Embed because people were interested in Embed Core part more than JSR 223 implementation and more members have been subscribed in JRuby Embed. Besides, the package name is org.jruby.embed, no redbridge in it. Since the name, “Red Bridge,” is easy to memorize and nice compared to banal name, “JRuby Embed,” I’ll keep using Red Bridge. While BSF implementation never had its own project ever. The implementation was added after JSR223 was merged in and took for a week or so.

Having JRuby 1.4.0RC1, users might be confusing JRuby’s JavaEmbedUtils and Red Bridge, and which one they should use. Definitely, new users should use Red Bridge since it is easy to use and powerful. (I’m working hard to update documents, so some of them are old. Sorry!) Right now, JavaEmbedUtils as well as other embed related interfaces are on a discussion to seek how they can be obsolete. API of JavaEmbedUtils and others have been used in many packages including JRuby Rack, so making them obsolete would be influential. Red Bridge will probably need to have bug fixes and improve its performance. Also API of Red Bridge probably needs to be reviewed and modified. I think it takes a time to eliminate JavaEmbedUtils.

Then, what will be next? I want to add a feature to run compiled Ruby scripts on Red Bridge. Currently, JIT and Force compiled modes are supported, but those are different from executing *.class files generated from *.rb. “Rails on Red Bridge” will be my exciting challenge. If people can write Struts’ action by Ruby using Red Bridge, it might be interesting. I don’t have a clear load map right now, but I want to keep going.

Have fun with Red Bridge!

Sunday, September 06, 2009

Splitting jruby-complete.jar up for Google App Engine

When we write a web application using JRuby, we need jruby-complete.jar included in a war to use builtin libraries. The builtin libraries are supposed to be located under jruby.home, so the jruby.home system property is expected to be set correctly. However, we need an alternative to set the property since setting jruby.home on the web application doesn't make sense. The answer is jruby-complete.jar, which has builtin libraries under META-INF/jruby.home directory in it.

When we use Google App Engine, another problem pops up. GAE has 10MB limit per each file to upload (http://googleappengine.blogspot.com/2009/02/skys-almost-limit-high-cpu-is-no-more.html). The size of jruby-complete.jar is unfortunately over 10MB. The shell script to split jruby-complete.jar up into two jar archives has been introduced at http://olabini.com/blog/2009/04/jruby-on-rails-on-google-app-engine/. However, I learned the way in the blog was already obsolete when I filed JRUBY-3949. The smart way of doing that was already out there. The latest JRuby, I mean, JRuby 1.4.0dev in git HEAD has had a Rakefile to create jars.

This is what I acutually did on a terminal:

$ git clone git://kenai.com/jruby~main
$ cd jruby~main
$ export JRUBY_HOME=`pwd`
$ PATH=$JRUBY_HOME/bin:$PATH
$ ant
$ gem install rake
$ gem install hoe
$ cd gem
$ rake update

Then, jruby-core-1.4.0dev.jar and jruby-stdlib-1.4.0dev.jar was built in jruby~main/gem/lib directory.

Wednesday, September 02, 2009

Finally yaml worked on Google App Engine

In my previous post, I wrote about my struggle over an application on Google App Engine that uses JRuby's builtin library, yaml. I found the reason of the error at JRUBY-3892. Builtin library needs jruby.home environment variable to be set correctly, and it should be done in jruby-complete.jar. But, jruby-complete.jar built from old JRuby 1.4.0dev had a somewhat broken path. Since the issue has been resolved in the end of Auguest, I tried again using the latest JRuby 1.4.0dev cloned out from git repo. It worked. So, the final release of JRuby 1.4.0 won't have this problem.

Now, my sample Servlet, ParsenRunServlet is working at http://servletgarden-in-red.appspot.com/. If you are interested in the code, those are in JRuby Embed API Wiki.

Saturday, August 29, 2009

Yaml doesn't work on Google App Engine

While I was testing JRuby Embed API on Google App Engine, I encountered this awkward problem. Yaml never worked on GAE. Exactly the same Servlet successfully worked on GlassFish. Servlet and Ruby codes were:

# yaml_snippet.rb

require 'yaml'

content = YAML::load @text

def format element
case element
when String: print "<p>#{element}</p>"
when Array:
print "<ul>"
element.each do |child|
print "<li>"
format child
print "</li>"
end
puts "</ul>"
when Hash:
element.each do |key, value|
print "<ul><li>#{key}"
format value
print "</li></ul>"
end
end
end

content.each do |heading, paragraph|
puts "<h4>#{heading}</h4>"
paragraph.each do |element|
format element
end
end

package olive.jruby.example;

import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.io.PrintWriter;
import java.util.Arrays;
import java.util.List;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import org.jruby.embed.ScriptingContainer;
import org.jruby.javasupport.JavaEmbedUtils.EvalUnit;

public class YamlSampleServlet extends HttpServlet {
private ScriptingContainer container;
private EvalUnit yaml_unit;
private String text =
"Trees:\n" +
"- This is a small example to general HTML.\n" +
"- - Quince\n" +
" - flower: Red\n" +
"- - Apple\n" +
" - fruit: Red\n" +
"- - Maple\n" +
" - leaf: Red";

@Override
public void init() {
String classpath = getServletContext().getRealPath("/WEB-INF/classes");
List<String> loadPaths = Arrays.asList(classpath.split(File.pathSeparator));
container = new ScriptingContainer();
container.getProvider().setLoadPaths(loadPaths);
String filename = "ruby/yaml_snippet.rb";
InputStream istream = container.getRuntime().getJRubyClassLoader().getResourceAsStream(filename);
yaml_unit = container.parse(istream, filename);
}

protected void processRequest(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
response.setContentType("text/html;charset=UTF-8");
PrintWriter out = response.getWriter();
container.setWriter(out);
try {
out.println("<html>");
out.println("<head>");
out.println("<title>Servlet YamlSampleServlet</title>");
out.println("</head>");
out.println("<body>");
out.println("<h3>Servlet YamlSampleServlet at " + request.getContextPath() + "</h3>");
container.put("@text", text);
yaml_unit.run();
out.println("</pre></body>");
out.println("</html>");
} finally {
out.close();
}
}

@Override
protected void doGet(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
processRequest(request, response);
}

@Override
protected void doPost(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
processRequest(request, response);
}

@Override
public String getServletInfo() {
return "Yaml Sample";
}
}

And the programs produced:

<html>
<head>
<title>Servlet YamlSampleServlet</title>
</head>
<body>
<h3>Servlet YamlSampleServlet at /Olive</h3>
<h4>Trees</h4>
<p>This is a small example to general HTML.</p><ul><li><p>Quince</p></li><li><ul><li>flower<p>Red</p></li></ul></li></ul>
<ul><li><p>Apple</p></li><li><ul><li>fruit<p>Red</p></li></ul></li></ul>
<ul><li><p>Maple</p></li><li><ul><li>leaf<p>Red</p></li></ul></li></ul>
</pre></body>
</html>

However, this test servlet never worked on GAE because of the exception:

[java] yaml/constants:15:in `const_missing': uninitialized constant YAML::Yecht::Resolver (NameError)
[java] from yaml:88
[java] from yaml:2:in `require'
[java] from ruby/yaml_snippet.rb:2
[java] ...internal jruby stack elided...
[java] from Module.const_missing(yaml:88)
[java] from (unknown).(unknown)(yaml:2)
[java] from (unknown).(unknown)(yaml:2)
[java] from Kernel.require(ruby/yaml_snippet.rb:2)
[java] from (unknown).(unknown)(:1)

JRuby has yaml library under the "builtin" directory in its source tree, so all of necessary scripts here are in jruby.jar as well as jruby-complete.jar. But, yaml library needs jruby.home system property to be set correctly, so jruby-complete.jar is the only choice on a web application. In case of an application, I could set jruby.home explicitly as in below and got the result as I expected:

package brick;

import java.io.InputStream;
import org.jruby.embed.ScriptingContainer;
import org.jruby.javasupport.JavaEmbedUtils.EvalUnit;

public class YamlSample {

private String filename = "ruby/yaml_snippet.rb";
private String text =
"Trees:\n" +
"- This is a small example to general HTML.\n" +
"- - Quince\n" +
" - flower: Red\n" +
"- - Apple\n" +
" - fruit: Red\n" +
"- - Maple\n" +
" - leaf: Red";

private YamlSample() {
System.setProperty("jruby.home", "/Users/yoko/Works/080909-jruby/jruby~main");
ScriptingContainer container = new ScriptingContainer();
InputStream istream = container.getRuntime().getJRubyClassLoader().getResourceAsStream(filename);
EvalUnit yaml_unit = container.parse(istream, filename);
container.put("@text", text);
yaml_unit.run();
}

public static void main(String[] args) {
new YamlSample();
}
}

I tried to copy all yaml library under the source tree of my web application, and clean & build, redeploy, and request again... with no luck. Sigh... I tried to set the path to the yaml library in jar archive,

String classpath = getServletContext().getRealPath("/WEB-INF/lib/jruby-complete.jar!/builtin");
List loadPaths = Arrays.asList(classpath.split(File.pathSeparator));
container = new ScriptingContainer();
container.getProvider().setLoadPaths(loadPaths);

... no luck. The same exception, as ever.

Since the program worked on GlassFish, I guess the difference in class loading mechanism between GlassFish and GAE might have caused the exception on GAE. But, I haven't figured the culprit out so far. Any idea?

Wednesday, August 26, 2009

NekoBean Fall Version


NetBeans' mascot, NekoBean, is enjoying cool air in fall surrounded by colored foliage.

More at:

http://nekobean.net/2009/08/post-18.html.

Tuesday, August 25, 2009

JRuby Embed API Update: Servlet Examples

I added Servlet Examples section in JRuby Embed API Wiki. Right now, just three examples are in that section. (I'll add more examples later.) Those are:

  • HelloWorldServlet
    Simple "Hello World" example, but helpful to get started.

  • GreetingServlet
    Two methods written in Ruby are called from Servlet.

  • SortableServlet
    Java interface is implemented in two ways in Ruby.


I tested these Servlets on Google App Engine and felt relieved since all three Servlets worked well. I've wanted to verify that Embed API works on GAE, which has some restrictions in programming on it. Embed API doesn't use any unsupported API, so there should not be any problem. However, I realized that I had to specify a classpath explicitly, and the classpath to be specified should not include appengine-tools-api.jar. If no classpath is given, Embed API sees java.class.path system property that has a path to appengine-tools-api.jar. This means, JRuby tries to load appengine-tools-api.jar onto Ruby runtime. The result is ... simply getting an exception. Thus, when Embed API is used with Servlet, especially, with Google App Engine, setting classpath is really important.

Embed API has two ways of setting a classpath. One is to use org.jruby.embed.class.path system property. This is easy, but not a preferred way in a web application. Since system property is common on Java VM, so the value is shared by every Servlet in more than one war archives and mutliple web applications on a single web application server. Some servlet might set classpath "A" using org.jruby.embed.class.path. At the same time another servlet might try to set classpath "B" using org.jruby.embed.class.path. We don't know what classpath is actually used.

Another way of setting classpath is to use setLoadPaths method of API. For example,

public class HelloWorldServlet extends HttpServlet {
private ScriptingContainer container;

@Override
public void init() {
String classpath = getServletContext().getRealPath("/WEB-INF/classes");
List loadPaths = Arrays.asList(classpath.split(File.pathSeparator));
container =
new ScriptingContainer(LocalContextScope.SINGLETHREAD);
container.getProvider().setLoadPaths(loadPaths);
}
...

Technically, we don't need to set the classpath to /WEB-INF/classes, since it has been already set by a server. But, some path with no further trouble is needed, so I chose that.

JRuby Embed API has a public method to set classpath, but JSR 223 implementation is unable to have such method. The specification doesn't define such method. The only way to set classpath for JSR 223 implementation is to use system property. It is true also in RedBridge. So, be careful to choose a harmless classpath to all Servlets on a web application server.

Saturday, August 15, 2009

RedBridge and JRuby Embed API update

I updated both JRuby Embed API and RedBridge and the latest version is, now, 0.0.1.1. By this update, three types of local variable behaviors were added in light of the discussion, http://www.nabble.com/Call-for-discussion-about-embed-API-tc24528478.html. Before the update of Embed API and RedBridge, Ruby's local variables always survived over the multiple evaluations. Thus, local variables used in the first script evaluation were always reused in the second, third, or fourth evaluation even though scripts has no relation to each other. Of course, users could delete unwanted local variables explicitly before the following evaluations went on, but this didn't happen in default. This feature was useful especially for ex-BSF users; however, it was not semantically correct. So, Tom Enebo concerned about it. During the discussion was going on, a nice idea of a toggle-able local variable was suggested (Thank you, Adam ;) ), and seemed to satisfy conflicting needs. The latest version supported the toggle-able local variable.

New local variable behavior has three options, transient, persistent and global. The first default behavior, transient, is a faithful behavior to Ruby semantics. So, local variables vanish after each evaluation. Java programs can't get local variables used in Ruby scripts. If you want to use the same value or object as a local variable in more than one script, you need to reset it again and again. However, instance and global variables survive over the evaluations as those were in the previous version.

The second variable behavior, persistent, is the behavior that the previous version did. Thus, the same local variables can be used in multiple script evaluations. Also, those can be retrieved from Ruby and used in Java. As well as a local variable, an instance and global variables, and a constant are persistent over multiple evaluations.

Example for JRuby Embed API:

package brick;

import java.util.Map;
import java.util.Set;
import org.jruby.embed.ScriptingContainer;
import org.jruby.embed.LocalVariableBehavior;

public class Sample1 {

private Sample1() {

ScriptingContainer container = new ScriptingContainer(LocalVariableBehavior.PERSISTENT);
container.runScriptlet("p=9.0");
container.runScriptlet("q = Math.sqrt p");
container.runScriptlet("puts \"square root of #{p} is #{q}\"");
Map m = container.getVarMap();
Set<String> keys = container.getVarMap().keySet();
for (String key : keys) {
System.out.println(key + ", " + m.get(key));
}
System.out.println("Ruby used: p = " + container.get("p") +
", q = " + container.get("q"));
}

public static void main(String[] args) {
new Sample1();
}
}

Example for RedBridge (JSR223):

package redbridge;

import java.util.Set;
import javax.script.Bindings;
import javax.script.ScriptContext;
import javax.script.ScriptEngine;
import javax.script.ScriptException;
import org.jruby.embed.jsr223.JRubyScriptEngineManager;

public class EvalStringSample {

private EvalStringSample() throws ScriptException {
System.out.println("[" + getClass().getName() + "]");
System.setProperty("org.jruby.embed.localvariable.behavior", "persistent");
JRubyScriptEngineManager manager = new JRubyScriptEngineManager(Thread.currentThread().getContextClassLoader());
ScriptEngine engine = manager.getEngineByName("jruby");
engine.eval("p=9.0");
engine.eval("q = Math.sqrt p");
engine.eval("puts \"square root of #{p} is #{q}\"");

Bindings bindings = engine.getBindings(ScriptContext.ENGINE_SCOPE);
Set<String> keys = bindings.keySet();
for (String key : keys) {
System.out.println(key + ", " + bindings.get(key));
}
System.out.println("Ruby used: p = " + engine.get("p") +
", q = " + engine.get("q"));
}

public static void main(String[] args) throws ScriptException {
new EvalStringSample();
}
}

Output:

square root of 9.0 is 3.0
MANT_DIG, 53
MAX_10_EXP, 308
DIG, 15
MIN_EXP, -1021
ROUNDS, 1
MAX, 1.7976931348623157E308
RADIX, 2
EPSILON, 2.220446049250313E-16
MIN, 4.9E-324
q, 3.0
p, 9.0
MIN_10_EXP, -307
MAX_EXP, 1024
Ruby used: p = 9.0, q = 3.0


The third variable behavior, global, is a backwards compatibility option for users who have used JSR223 reference implementation released form scripting.dev.java.net. The reference implementation (RI) uses Ruby's global variable to share variables between Java and Ruby. And the name of variables used in Java has the same form as the one of a local variable in Ruby. I mean, Java sees "message" while Ruby sees "$message." However, Embed API and RedBridge enable not only the global variable but also the instance and local variable and constant sharing. On Redbridge, when people use the name "message" in Java, they also use "message" in Ruby. When it is "$message" in Java, also, "$message" in Ruby. So that RI users can move on to RedBridge easily, I added this local variable behavior.

Example for RedBridge (JSR223):

# greetings_globalvars.rb

def greet
message = "How are you? #{$who}."
end

def sayhi
$, = ","
$\ = "\n"
print "Hi", $people
$, = ""
$\ = nil
end

def count
$people.size + 1
end

// OldVariableBehaviorSample.java
package redbridge;

import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.Reader;
import java.util.ArrayList;
import java.util.List;
import javax.script.Invocable;
import javax.script.ScriptEngine;
import javax.script.ScriptException;
import org.jruby.embed.jsr223.JRubyScriptEngineManager;

public class OldVariableBehaviorSample {
private final static String basedir = "/Users/yoko/NetBeansProjects/Birch";

private OldVariableBehaviorSample()
throws ScriptException, FileNotFoundException, NoSuchMethodException {
System.setProperty("org.jruby.embed.localvariable.behavior", "old");
JRubyScriptEngineManager manager = new JRubyScriptEngineManager();
ScriptEngine engine = manager.getEngineByName("jruby");
String filename = basedir + "/src/ruby/greetings_globalvars.rb";
Reader reader = new FileReader(filename);
engine.put("who", "Anakin");
List people = new ArrayList();
people.add("Obi-Wan");
people.add("C-3PO");
people.add("R2-D2");
engine.put("people", people);
engine.eval(reader);
Object[] args = null;
Object result = ((Invocable)engine).invokeFunction("greet", args);
System.out.println(result.toString());
((Invocable)engine).invokeFunction("sayhi", args);
result = ((Invocable)engine).invokeFunction("count", args);
System.out.println("counted: " + result.toString());
}

public static void main(String[] args)
throws ScriptException, FileNotFoundException, NoSuchMethodException {
new OldVariableBehaviorSample();
}
}

Output:

How are you? Anakin.
Hi,[Obi-Wan, C-3PO, R2-D2]
counted: 4


See wiki pages for details.
JRuby Embed API: http://kenai.com/projects/jruby-embed/pages/Home
RedBridge: http://kenai.com/projects/redbridge/pages/Home

Friday, August 07, 2009

RedBridge Update: JRubyScriptEngineManager

Today, I added two classes, JRubyScriptEngineManager and ServiceFinder, to RedBridge (JSR 223 JRuby engine). ServiceFinder is used from JRubyScriptEngineManager, and not for users. This update will be helpful especially for OS X users. Now, RedBridge works on both JDK 1.5 and 1.6 on OS X Java Update 4.

Since its first release, RedBridge hasn't had JRubyScriptEngineManager mainly because of copyright. JSR 223 JRuby Engine released from Scripting Project at dev.java.net has the same name and behavior class. Although I wrote that class, I couldn't simply include it in RedBridge since Sun has copyright. Thus, I've tested RedBridge on JDK 1.6 though RedBridge itself was compiled on JDK 1.5. However, after OS X's Java has been updated in last June, JDK 1.6's service discovery failed to locate RedBridge. So, I decided to write it. Like other classes of RedBridge, I totally rewrote JRubyScriptEgineManager, too, so that RedBridge won't suffer from unexpectd legal issues. The new JRubyScriptEngineManager isn't just an modified version of the old one. I wrote it as simple as possible because, I think, JSR 223 is, in many cases, used with frameworks. Keeping it vanilla would be better for users. Less headache.

Now, the snippet will be:

package redbridge;

import javax.script.ScriptEngine;
import javax.script.ScriptEngineManager;
import javax.script.ScriptException;
import org.jruby.embed.jsr223.JRubyScriptEngineManager;

public class EvalStringSample {

private EvalStringSample() throws ScriptException {
System.out.println("[" + getClass().getName() + "]");
System.setProperty("org.jruby.embed.localcontext.scope", "singlethread");
//ScriptEngineManager manager = new ScriptEngineManager();
JRubyScriptEngineManager manager = new JRubyScriptEngineManager();
ScriptEngine engine = manager.getEngineByName("jruby");
engine.eval("p=9.0");
engine.eval("q = Math.sqrt p");
engine.eval("puts \"square root of #{p} is #{q}\"");
System.out.println("q = " + engine.get("q"));
}

public static void main(String[] args) throws ScriptException {
new EvalStringSample();
}
}

Result:

[redbridge.EvalStringSample]
square root of 9.0 is 3.0
q = 3.0


JRubyScriptEngineManager can have a classloader in its constructor argument. When no classloader is given, JRubyScriptEngineManager uses System classloader. For example, to use a current context classloader:

JRubyScriptEngineManager manager =
new JRubyScriptEngineManager(Thread.currentThread().getContextClassLoader());

See, Wiki at the RedBridge project for other usages.

Tuesday, July 28, 2009

Start Over: JSR 223 JRuby engine on OSGi container

I got a comment from Neil Bartlett about my previous post. Yes, it was hard for OSGi people to understand what's wrong with it. Originally, my blog entry was to answer the question, "how can I create an osgi bundle using maven which uses jruby-engine to execute a jruby script?" So, I pasted entire pom.xml on it. As Neil commented, I should have pasted MANIFEST.MFs that all bundles used. This is hopefully for OSGi people to figure out culprits and help me out.


  • What I want to do is ...

    I'm a committer of JSR 223 JRuby engine. I want to provide a painless OSGi bundle of JSR 223 JRuby engine to users. JRuby engine works on top of JRuby, consequently, users need at least three bundles, JRuby, JRuby engine, and their own bundle, to get it work on OSGi containers. As far as I tested, current MANIFEST.MF of JRuby engine or JRuby, or both might have a flaw, but not sure. I want to fix JRuby engine's flaw if it exists as well as JRuby's.

  • Initial Problems were ...

    There were two basic problems. The first one was JSR 223's discovery mechanism didn't work on OSGi. The mechanism is officially introduced in JDK 1.6, but work on JDK 1.5, too. The mechanism works like this:
    1. looks for META-INF/services/javax.script.ScriptEngineFactory file in every jar file found from classpath.
    2. instantiate JSR 223 engine class specified in the javax.script.ScriptEngineFactory file.

    Probably, because of a classloading issue, this mechanism doesn't work on Apache Felix. However, we can avoid this problem by instatiating a JRuby engine factory directly bypassing discovery mechanism.

    The second problem is the one I'm seeking the best solution. While instantiating JRuby engine factory, com.sun.script.jruby.JRubyScriptEngineFactory (line 13 in the snippet of Take One), JRuby engine, com.sun.script.jruby.JRubyScriptEngine, is also instatiated. These two are in the same, JRuby engine's bundle. While instantiating JRuby engine, Ruby runtime is instantiated, too. Ruby runtime is in a different, JRuby's bundle. Up to here, no problem exists. At the same time, JRuby engine tries to load the instance of org.jruby.javasupport.Java on to Ruby runtime using JRuby's custom classloader. The class, org.jruby.javasupport.Java is in JRuby's bundle. This ends up in raising exception.
    org.jruby.exceptions.RaiseException: library `java' could not be loaded: java.lang.ClassNotFoundException: org.jruby.javasupport.Java
    I don't think I have a choice to use another classloader to load org.jruby.javasupport.Java since it is JRubish way to use Java classes in Ruby scripts.

    JRuby's MANIFEST.MF used for this sample code is here. (This is so long to paste.)

    JRuby engine's MANIFEST.MF

    Manifest-Version: 1.0
    Built-By: yoko
    Created-By: Apache Maven Bundle Plugin
    Import-Package: com.sun.script.jruby,javax.script,org.jruby,org.jruby.
    exceptions,org.jruby.internal.runtime,org.jruby.javasupport,org.jruby
    .runtime,org.jruby.runtime.builtin,org.jruby.runtime.load,org.jruby.u
    til,org.jruby.util.io
    Bnd-LastModified: 1247081259404
    Export-Package: com.sun.script.jruby;uses:="javax.script,org.jruby.run
    time.builtin,org.jruby.runtime,org.jruby,org.jruby.internal.runtime,o
    rg.jruby.exceptions,org.jruby.javasupport,org.jruby.util,org.jruby.ru
    ntime.load,org.jruby.util.io"
    Bundle-Version: 1.0
    Bundle-Name: JRuby JSR223 Engine
    Build-Jdk: 1.5.0_19
    Private-Package: com.sun.script.jruby,
    Bundle-ManifestVersion: 2
    Bundle-SymbolicName: com.sun.script.jruby
    Tool: Bnd-0.0.311

    And the MANIFEST.MF of the snippet:

    Manifest-Version: 1.0
    Built-By: yoko
    Created-By: Apache Maven Bundle Plugin
    Bundle-Activator: hickory.example.Activator
    Import-Package: com.sun.script.jruby,hickory.example,javax.script,org.
    osgi.framework;version="1.4"
    Bnd-LastModified: 1248816762637
    Export-Package: hickory.example;uses:="javax.script,com.sun.script.jru
    by,org.osgi.framework"
    Bundle-Version: 1.0.0.SNAPSHOT
    Bundle-Name: Hickory
    Build-Jdk: 1.5.0_19
    Private-Package: .
    Bundle-ManifestVersion: 2
    Bundle-SymbolicName: hickory.example.Hickory
    Tool: Bnd-0.0.311


  • The Workaound is ...

    Hasan found the workaround of the problem (see Using JRuby in OSGi).
    Using Hasan's workaround, I wrote the second snippet.

    MANIFEST.MFs of JRuby and JRuby engine are the same as the first try. The differences of the MANIFEST.MF of the second snippet are just Bundle-Activator and Bnd-LastModified lines.

    Manifest-Version: 1.0
    Built-By: yoko
    Created-By: Apache Maven Bundle Plugin
    Bundle-Activator: hickory.example.Activator1
    Import-Package: com.sun.script.jruby,hickory.example,javax.script,org.
    osgi.framework;version="1.4"
    Bnd-LastModified: 1248818750858
    Export-Package: hickory.example;uses:="javax.script,com.sun.script.jru
    by,org.osgi.framework"
    Bundle-Version: 1.0.0.SNAPSHOT
    Bundle-Name: Hickory
    Build-Jdk: 1.5.0_19
    Private-Package: .
    Bundle-ManifestVersion: 2
    Bundle-SymbolicName: hickory.example.Hickory
    Tool: Bnd-0.0.311

    This worked well although I'm not sure this is the best. Then, another problem came.

  • Further problem is ...

    JRuby users use Java classes in thier Ruby scripts very often. Thoses classes are usually in differenct jar archives or in classpath that JRuby knows. Here's a further problem happened.

    The third snippet raised an exception when I was to instantiate my Java class in Ruby script.
    org.jruby.exceptions.RaiseException: cannot load Java class hickory.example.YellOut
    In the program, Ruby script, "include Java\nputs Java::hickory.example.YellOut.new.whats," is passed to Ruby runtime to be evaluated. Typically, JRuby finds hickory.example.YellOut class out from classpath and instantiates it using JRuby's classloader. But, this process failed on the OSGi container.

    Again, the only differences in MANIFEST.MF used here are just Bundle-Activator and Bnd-LastModified lines.

    Manifest-Version: 1.0
    Built-By: yoko
    Created-By: Apache Maven Bundle Plugin
    Bundle-Activator: hickory.example.Activator2
    Import-Package: com.sun.script.jruby,hickory.example,javax.script,org.
    osgi.framework;version="1.4"
    Bnd-LastModified: 1248820140956
    Export-Package: hickory.example;uses:="javax.script,com.sun.script.jru
    by,org.osgi.framework"
    Bundle-Version: 1.0.0.SNAPSHOT
    Bundle-Name: Hickory
    Build-Jdk: 1.5.0_19
    Private-Package: .
    Bundle-ManifestVersion: 2
    Bundle-SymbolicName: hickory.example.Hickory
    Tool: Bnd-0.0.311


  • One more workaround might be ...

    I tried the failed example after adding "DynamicImport-Package: *" to JRuby bundle. Now, JRuby's new MANIFEST.MF had "DynamicImport-Package: *" and the third snippet worked.

    However, Tommy strongly opposed to add "DynamicImport-Package: *" to JRuby's bundle, and added the comment to http://jira.codehaus.org/browse/JRUBY-3792.

    According to Neil, Tommy's workaround works only on SpringSource's dm Server. Then, what is the best way to get these code work on other OSGi containers, for example on Apache Felix?

  • If JSR 223 engine has a flaw in its MANIFEST.MF, I'll fix it to provide a painless API.
    I wrote this entry because I couldn't get any relevant information by googling.

Monday, July 27, 2009

What's the ideal way to get JSR223 work on OSGi?

After I wrote the entry, JSR 223 JRuby Engine won't work on OSGi platform, a workaround and an opposition to the workaround were posted to jruby-users ml, which is archived http://www.nabble.com/running-jruby-in-an-osgi-container-td24379565.html. The workaround Hasan found out worked well. But, Tommy opposed because the workaround would cause a tangle of references on some conainter that has multiple types of applications. To avoid this, Tommy advised me to add "Import-Bundle: org.jruby.jruby" to my bundle configuration. I tried Tommy's advise, but it didn't work for me. What's wrong with it?

Still, I haven't figured out how to get JSR 223 JRuby engine on OSGi platform "ideally." Still, I need a help, suggestion, advise, and whatever I can find the ideal way. For ease of tracking the discussion down, I'm going to write what I did along with it.

Versions:

  • Java for Mac OS X 10.5 Update 4

  • java version "1.5.0_19" for making a bundle

    java version "1.6.0_13" for starting the bundles

  • JRuby 1.3.1

  • JSR 223 JRuby Engine 1.1.7

  • Apache Felix 1.8.0


Take one: Original Test Program

The first one is the original test program that raises java.lang.ClassNotFoundException: org.jruby.javasupport.Java. I simplified this from the old one seeing Hasan's sample.

- pom.xml

1 <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
2 xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
3 <modelVersion>4.0.0</modelVersion>
4 <groupId>hickory.example</groupId>
5 <artifactId>Hickory</artifactId>
6 <packaging>bundle</packaging>
7 <version>1.0-SNAPSHOT</version>
8 <name>Hickory</name>
9 <url>http://maven.apache.org</url>
10 <build>
11 <plugins>
12 <plugin>
13 <groupId>org.apache.felix</groupId>
14 <artifactId>maven-bundle-plugin</artifactId>
15 <extensions>true</extensions>
16 <configuration>
17 <instructions>
18 <Bundle-Activator>hickory.example.Activator</Bundle-Activator>
19 </instructions>
20 </configuration>
21 </plugin>
22 <plugin>
23 <groupId>org.apache.maven.plugins</groupId>
24 <artifactId>maven-compiler-plugin</artifactId>
25 <configuration>
26 <source>1.5</source>
27 <target>1.5</target>
28 </configuration>
29 </plugin>
30 </plugins>
31 </build>
32 <repositories>
33 <repository>
34 <id>maven2-repository.dev.java.net</id>
35 <name>Java.net Repository for Maven</name>
36 <url>http://download.java.net/maven/2/</url>
37 <layout>default</layout>
38 </repository>
39 </repositories>
40 <dependencies>
41 <dependency>
42 <groupId>org.apache.felix</groupId>
43 <artifactId>org.osgi.core</artifactId>
44 <version>1.3.0-SNAPSHOT</version>
45 </dependency>
46 <dependency>
47 <groupId>org.livetribe</groupId>
48 <artifactId>livetribe-jsr223</artifactId>
49 <version>2.0.5</version>
50 </dependency>
51 <dependency>
52 <groupId>com.sun.script.jruby</groupId>
53 <artifactId>jruby-engine</artifactId>
54 <version>1.1.7</version>
55 </dependency>
56 <dependency>
57 <groupId>junit</groupId>
58 <artifactId>junit</artifactId>
59 <version>3.8.1</version>
60 <scope>test</scope>
61 </dependency>
62 </dependencies>
63 </project>

- Snippet

1 package hickory.example;
2
3 import com.sun.script.jruby.JRubyScriptEngineFactory;
4 import javax.script.ScriptEngine;
5 import javax.script.ScriptEngineFactory;
6 import org.osgi.framework.BundleActivator;
7 import org.osgi.framework.BundleContext;
8
9 public class Activator implements BundleActivator {
10
11 public void start(BundleContext context) throws Exception {
12 System.out.println("Poor Activator");
13 ScriptEngineFactory factory = (ScriptEngineFactory) new JRubyScriptEngineFactory();
14 ScriptEngine engine = factory.getScriptEngine();
15
16 System.out.println("Everything should be ready.");
17 engine.eval("puts \"Yeaaaaaah! See?\"");
18 }
19
20 public void stop(BundleContext context) {
21 System.out.println("Bye!");
22 }
23 }

- On Apache Felix

cd felix-1.8.0
java -jar bin/felix.jar

Welcome to Felix.
=================

-> ps
START LEVEL 1
ID State Level Name
[ 0] [Active ] [ 0] System Bundle (1.8.0)
[ 1] [Active ] [ 1] Apache Felix Shell Service (1.2.0)
[ 2] [Active ] [ 1] Apache Felix Shell TUI (1.2.0)
[ 3] [Active ] [ 1] Apache Felix Bundle Repository (1.4.0)
-> start http://repo1.maven.org/maven2/org/jruby/jruby-complete/1.3.1/jruby-complete-1.3.1.jar
-> start http://download.java.net/maven/2/com/sun/script/jruby/jruby-engine/1.1.7/jruby-engine-1.1.7.jar
-> start file:///Users/yoko/NetBeansProjects/Hickory/target/Hickory-1.0-SNAPSHOT.jar
Poor Activator
Warning: JRuby home "/4.0:1/META-INF/jruby.home" does not exist, using /var/folders/xY/xYuRYl0RHjy7p6SeA0nHVU+++TI/-Tmp-/
org.osgi.framework.BundleException: Activator start error in bundle hickory.example.Hickory [6].
at org.apache.felix.framework.Felix.startBundle(Felix.java:1506)
at org.apache.felix.framework.BundleImpl.start(BundleImpl.java:779)
at org.apache.felix.shell.impl.StartCommandImpl.execute(StartCommandImpl.java:105)
at org.apache.felix.shell.impl.Activator$ShellServiceImpl.executeCommand(Activator.java:291)
at org.apache.felix.shell.tui.Activator$ShellTuiRunnable.run(Activator.java:177)
at java.lang.Thread.run(Thread.java:637)
Caused by: org.jruby.exceptions.RaiseException: library `java' could not be loaded: java.lang.ClassNotFoundException: org.jruby.javasupport.Java
at (unknown).initialize(:1)
at (unknown).(unknown)(:1)
org.jruby.exceptions.RaiseException: library `java' could not be loaded: java.lang.ClassNotFoundException: org.jruby.javasupport.Java
->

Take two: Applying Hasan's workaround

According to Hasan's analysis, the snippet above doesn't work because ...

JRuby could not find the class org.jruby.javasupport.Java if run within OSGi environment. So, this is a class loading problem. Tracing the log "could not be loaded" took us from org/jruby/ext/LateLoadingLibrary.java to org/jruby/RubyInstanceConfig.java. In this class we found:

private ClassLoader contextLoader = Thread.currentThread().getContextClassLoader();
private ClassLoader loader = contextLoader == null ? RubyInstanceConfig.class.getClassLoader() : contextLoader;

In an OSGi environment, the Thread.currentThread().getContextClassLoader() of our bundle cannot find the abovementioned java class of JRuby.

So, my second version became below:

- pom.xml
I changed the Activator's class name form Activator to Activator1.

18 <Bundle-Activator>hickory.example.Activator1</Bundle-Activator>

- Snippet

1 package hickory.example;
2
3 import com.sun.script.jruby.JRubyScriptEngineFactory;
4 import javax.script.ScriptEngine;
5 import javax.script.ScriptEngineFactory;
6 import org.osgi.framework.BundleActivator;
7 import org.osgi.framework.BundleContext;
8
9 public class Activator1 implements BundleActivator {
10
11 public void start(BundleContext context) throws Exception {
12 System.out.println("Activator1");
13 ScriptEngineFactory factory = (ScriptEngineFactory) new JRubyScriptEngineFactory();
14 final ClassLoader oldClassLoader = Thread.currentThread().getContextClassLoader();
15 Thread.currentThread().setContextClassLoader(null);
16 ScriptEngine engine = factory.getScriptEngine();
17 Thread.currentThread().setContextClassLoader(oldClassLoader);
18
19 System.out.println("Everything should be ready.");
20 engine.eval("puts \"Yeaaaaaah! See?\"");
21 }
22
23 public void stop(BundleContext context) {
24 System.out.println("Bye!");
25 }
26 }

Lines 14, 15, and 17 were added to the first one.

- On Apache Felix
After recreating the bundle, I tried this.

-> shutdown

rm -rf felix-cache
java -jar bin/felix.jar

Welcome to Felix.
=================

-> start http://repo1.maven.org/maven2/org/jruby/jruby-complete/1.3.1/jruby-complete-1.3.1.jar
-> start http://download.java.net/maven/2/com/sun/script/jruby/jruby-engine/1.1.7/jruby-engine-1.1.7.jar
-> start file:///Users/yoko/NetBeansProjects/Hickory/target/Hickory-1.0-SNAPSHOT.jar
Activator1
Warning: JRuby home "/4.0:1/META-INF/jruby.home" does not exist, using /var/folders/xY/xYuRYl0RHjy7p6SeA0nHVU+++TI/-Tmp-/
Everything should be ready.
Yeaaaaaah! See?
->

It worked!
I dare to remove Apache Felix's cache and restart it every time before I try modified bundles. The cache seems to remember something worked before, so I've gotten a different result before and after I removed the cache. It is a bit annoying, but needs to have accurate results.

Take three: Using a defined Java class in Ruby

Everything seems fine, but Tommy brought another problem that the workaround does not work when a java class is used in Ruby script. To try this, I defined the class, hickory.example.YellOut, in the same package as the Activator. Now, test programs are as in below:

- pom.xml

18 <Bundle-Activator>hickory.example.Activator2</Bundle-Activator>

- Snippet

1 package hickory.example;
2
3 import com.sun.script.jruby.JRubyScriptEngineFactory;
4 import javax.script.ScriptEngine;
5 import javax.script.ScriptEngineFactory;
6 import org.osgi.framework.BundleActivator;
7 import org.osgi.framework.BundleContext;
8
9 public class Activator2 implements BundleActivator {
10
11 public void start(BundleContext context) throws Exception {
12 System.out.println("Activator2");
13 ScriptEngineFactory factory = (ScriptEngineFactory) new JRubyScriptEngineFactory();
14 final ClassLoader oldClassLoader = Thread.currentThread().getContextClassLoader();
15 Thread.currentThread().setContextClassLoader(null);
16 ScriptEngine engine = factory.getScriptEngine();
17 Thread.currentThread().setContextClassLoader(oldClassLoader);
18
19 System.out.println("Everything should be ready.");
20 engine.eval("include Java\nputs Java::hickory.example.YellOut.new.whats");
21 }
22
23 public void stop(BundleContext context) {
24 System.out.println("Bye!");
25 }
26 }

1 package hickory.example;
2
3 public class YellOut {
4 public String whats() {
5 return "I made it!!!";
6 }
7 }

- On Apache Felix

-> shutdown
-> Bye!

rm -rf felix-cache
java -jar bin/felix.jar

Welcome to Felix.
=================

-> start http://repo1.maven.org/maven2/org/jruby/jruby-complete/1.3.1/jruby-complete-1.3.1.jar
-> start http://download.java.net/maven/2/com/sun/script/jruby/jruby-engine/1.1.7/jruby-engine-1.1.7.jar
-> start file:///Users/yoko/NetBeansProjects/Hickory/target/Hickory-1.0-SNAPSHOT.jar
Activator2
Warning: JRuby home "/4.0:1/META-INF/jruby.home" does not exist, using /var/folders/xY/xYuRYl0RHjy7p6SeA0nHVU+++TI/-Tmp-/
Everything should be ready.
org.osgi.framework.BundleException: Activator start error in bundle hickory.example.Hickory [6].
at org.apache.felix.framework.Felix.startBundle(Felix.java:1506)
at org.apache.felix.framework.BundleImpl.start(BundleImpl.java:779)
at org.apache.felix.shell.impl.StartCommandImpl.execute(StartCommandImpl.java:105)
at org.apache.felix.shell.impl.Activator$ShellServiceImpl.executeCommand(Activator.java:291)
at org.apache.felix.shell.tui.Activator$ShellTuiRunnable.run(Activator.java:177)
at java.lang.Thread.run(Thread.java:637)
Caused by: javax.script.ScriptException: org.jruby.exceptions.RaiseException: cannot load Java class hickory.example.YellOut
at com.sun.script.jruby.JRubyScriptEngine.evalNode(JRubyScriptEngine.java:509)
at com.sun.script.jruby.JRubyScriptEngine.eval(JRubyScriptEngine.java:184)
at javax.script.AbstractScriptEngine.eval(AbstractScriptEngine.java:247)
at hickory.example.Activator2.start(Activator2.java:20)
at org.apache.felix.framework.util.SecureAction.startActivator(SecureAction.java:589)
at org.apache.felix.framework.Felix.startBundle(Felix.java:1458)
... 5 more
Caused by: org.jruby.exceptions.RaiseException: cannot load Java class hickory.example.YellOut
at (unknown).(unknown)(/builtin/java/ast.rb:49)
at (unknown).get_proxy_or_package_under_package(/builtin/javasupport/java.rb:51)
at #.method_missing(:2)
at (unknown).(unknown)(:1)
javax.script.ScriptException: org.jruby.exceptions.RaiseException: cannot load Java class hickory.example.YellOut
->

JRuby failed to load hickory.example.YellOut even though this class is in the same bundle as the Activator. Class loading issue again. I need to let JRuby know where hickory.example.YellOut.class resides by doing something.

Take four: Applying the fix reported in http://jira.codehaus.org/browse/JRUBY-3792.

The filed issue came up in my mind. I thought this might have fix something this sort of problems. So, I recompiled JRuby.

- JRuby recompilation
I added "DynamicImport-Package: *" at end of jruby.bnd.template. Following is entire jruby.bnd.template file.

Export-Package: org.jruby.*;version="@JRUBY_VERSION@"
Import-Package: !org.jruby.*, *;resolution:=optional
Bundle-Version: @JRUBY_VERSION@
Bundle-Description: JRuby @JRUBY_VERSION@ OSGi bundle
Bundle-Name: JRuby @JRUBY_VERSION@
Bundle-SymbolicName: org.jruby.jruby
DynamicImport-Package: *

Then, recompiled JRuby by running "ant jar-complete."

- On Apache Felix

-> shutdown

rm -rf felix-cache
java -jar bin/felix.jar

Welcome to Felix.
=================

-> start file:///Users/yoko/Tools/jruby-1.3.1/lib/jruby-complete.jar
-> start http://download.java.net/maven/2/com/sun/script/jruby/jruby-engine/1.1.7/jruby-engine-1.1.7.jar
-> start file:///Users/yoko/NetBeansProjects/Hickory/target/Hickory-1.0-SNAPSHOT.jar
Activator2
Warning: JRuby home "/4.0:1/META-INF/jruby.home" does not exist, using /var/folders/xY/xYuRYl0RHjy7p6SeA0nHVU+++TI/-Tmp-/
Everything should be ready.
I made it!!!
->


Worked!

However, Tommy posed the problem of this way of fixing bundles becuase "DynamicImport-Package: *" would be a culprit of linkage problems when multiple applications and bundles are deployed on a single OSGi container, especially differenct versions of JRuby bundles exists on it. The suggestion was

The better solution, IMHO, is for the script bundle to actually declare its dependency on the jruby-complete bundle either by using Import-Package to bring in all of the packages it needs, or using Import-Bundle to pull in everything exported from the jruby-complete bundle. The first is pretty clearly a non-starter, since there is no way for me to tell which packages from jruby-complete the script bundle is going to need. The second works fine, though, and even allows the script bundle to as for a particular version of the jruby-complete bundle, which is one of the weaknesses of DynamicImport-Package.


Take five: Trying "Import-Bundle: org.jruby.jruby"

Adding "Import-Bundle: org.jruby.jruby" to application's bundle configuration is Tommy's advice. So, I also tried this.

- pom.xml

The line 19 is added to existing pom.xml. As in line 20, I also tried to give Import-Bundle configuration from a separate file, osgi.bnd. Osgi.bnd file has just a line, Import-Bundle: org.jruby.jruby, in it.


1 <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
2 xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
3 <modelVersion>4.0.0</modelVersion>
4 <groupId>hickory.example</groupId>
5 <artifactId>Hickory</artifactId>
6 <packaging>bundle</packaging>
7 <version>1.0-SNAPSHOT</version>
8 <name>Hickory</name>
9 <url>http://maven.apache.org</url>
10 <build>
11 <plugins>
12 <plugin>
13 <groupId>org.apache.felix</groupId>
14 <artifactId>maven-bundle-plugin</artifactId>
15 <extensions>true</extensions>
16 <configuration>
17 <instructions>
18 <Bundle-Activator>hickory.example.Activator2</Bundle-Activator>
19 <Import-Bundle>org.jruby.jruby</Import-Bundle>
20 <!--<_include>src/main/resources/osgi.bnd</_include>-->
21 </instructions>
22 </configuration>
23 </plugin>
24 <plugin>
25 <groupId>org.apache.maven.plugins</groupId>
26 <artifactId>maven-compiler-plugin</artifactId>
27 <configuration>
28 <source>1.5</source>
29 <target>1.5</target>
30 </configuration>
31 </plugin>
32 </plugins>
33 </build>
34 <repositories>
35 <repository>
36 <id>maven2-repository.dev.java.net</id>
37 <name>Java.net Repository for Maven</name>
38 <url>http://download.java.net/maven/2/</url>
39 <layout>default</layout>
40 </repository>
41 </repositories>
42 <dependencies>
43 <dependency>
44 <groupId>org.apache.felix</groupId>
45 <artifactId>org.osgi.core</artifactId>
46 <version>1.3.0-SNAPSHOT</version>
47 </dependency>
48 <dependency>
49 <groupId>org.livetribe</groupId>
50 <artifactId>livetribe-jsr223</artifactId>
51 <version>2.0.5</version>
52 </dependency>
53 <dependency>
54 <groupId>com.sun.script.jruby</groupId>
55 <artifactId>jruby-engine</artifactId>
56 <version>1.1.7</version>
57 </dependency>
58 <dependency>
59 <groupId>junit</groupId>
60 <artifactId>junit</artifactId>
61 <version>3.8.1</version>
62 <scope>test</scope>
63 </dependency>
64 </dependencies>
65 </project>

- On Apache Felix

-> shutdown
-> Bye!

rm -rf felix-cache
java -jar bin/felix.jar

Welcome to Felix.
=================

-> start http://repo1.maven.org/maven2/org/jruby/jruby-complete/1.3.1/jruby-complete-1.3.1.jar
-> start http://download.java.net/maven/2/com/sun/script/jruby/jruby-engine/1.1.7/jruby-engine-1.1.7.jar
-> start file:///Users/yoko/NetBeansProjects/Hickory/target/Hickory-1.0-SNAPSHOT.jar
Activator2
Warning: JRuby home "/4.0:1/META-INF/jruby.home" does not exist, using /var/folders/xY/xYuRYl0RHjy7p6SeA0nHVU+++TI/-Tmp-/
Everything should be ready.
org.osgi.framework.BundleException: Activator start error in bundle hickory.example.Hickory [6].
at org.apache.felix.framework.Felix.startBundle(Felix.java:1506)
at org.apache.felix.framework.BundleImpl.start(BundleImpl.java:779)
at org.apache.felix.shell.impl.StartCommandImpl.execute(StartCommandImpl.java:105)
at org.apache.felix.shell.impl.Activator$ShellServiceImpl.executeCommand(Activator.java:291)
at org.apache.felix.shell.tui.Activator$ShellTuiRunnable.run(Activator.java:177)
at java.lang.Thread.run(Thread.java:637)
Caused by: javax.script.ScriptException: org.jruby.exceptions.RaiseException: cannot load Java class hickory.example.YellOut
at com.sun.script.jruby.JRubyScriptEngine.evalNode(JRubyScriptEngine.java:509)
at com.sun.script.jruby.JRubyScriptEngine.eval(JRubyScriptEngine.java:184)
at javax.script.AbstractScriptEngine.eval(AbstractScriptEngine.java:247)
at hickory.example.Activator2.start(Activator2.java:20)
at org.apache.felix.framework.util.SecureAction.startActivator(SecureAction.java:589)
at org.apache.felix.framework.Felix.startBundle(Felix.java:1458)
... 5 more
Caused by: org.jruby.exceptions.RaiseException: cannot load Java class hickory.example.YellOut
at (unknown).(unknown)(/builtin/java/ast.rb:49)
at (unknown).get_proxy_or_package_under_package(/builtin/javasupport/java.rb:51)
at #.method_missing(:2)
at (unknown).(unknown)(:1)
javax.script.ScriptException: org.jruby.exceptions.RaiseException: cannot load Java class hickory.example.YellOut
->


Unfortunately, it didn't work for me though Tommy said it worked.


Tommy also talked about JSR 223 JRuby engine's bundle.

It might be better for the JSR223 bundle to import all of the jruby-complete packages and then re-export them, so bundles like the script bundle would just have to specify a dependency on the JSR223 engine.


So, what is the ideal solution in terms of OSGi?

Friday, July 17, 2009

The design and implementation of Ruby M17N - Translation


This is a translation of the article posted to Rubyist Magazine vol. 0025 published on Februrary, 2009. The original article is written in Japanese by Yui Naruse. The article is not new, so, probably, many poeple might have read this article via an online translation service. However, the article is really long and is not easy to understand even in my first language. To learn Ruby's M17N and train my English, I'm tackling to translate it. I hope this translation will help Ruby programmers who have not yet read the article so far.

Last updated:05/23/11



The Design and Implementation of Ruby M17N







  • Preface

    Eventually, Ruby 1.9.1 has been released in January 31, 2009 in JST. It took almost a year for development of the new version since Ruby 1.9.0-0’s release back in December 25, 2007 in UTC time.


    Ruby 1.9 has had many new features and changes, some of which are not compatible to Ruby 1.8. For example, YARV is, now, the Ruby VM. In addition, Ruby 1.9 had Oniguruma for its regular expression engine and Enumerator in its core. Among them, introducing Ruby multilingualization (M17N) would be the biggest change.


    Ruby multilingualization (M17N) of Ruby 1.9 uses the code set independent model (CSI) while many other languages use the Unicode normalization model. To make this original system happen, an encoding convert engine called transcode has been newly added to Ruby 1.9. In this article, I will show you that what is a multilingualization, what Ruby M17N supports, and how you can write a code in the Ruby 1.9 way.


  • Introduction to Multiligualization


    • What is M17N?


    • Firstly, M17N is the short form of multilingualization. As many of you know, a computer can handle only a bit, byte, which is a group of bits, or arrays of bytes. Single byte character sets like US-ASCII are easy to process for computer software; however, multi bytes character sets are not easy and need some ideas to manage them well. Also, other relevant ideas are necessary to use more than one encoding at the same time on single software. I’ll start with a brief explanation of typical internationalizations, and then go to M17N.


      • L10N

        L10N is the short form of Localization, and the idea to localize computer software (cf. national language support (NLS)). Given localized software, users can read and write their native languages and see appropriate area-specific information on it. Historically, Japanese people have used a lot of software made in the U.S. or European countries. In general, those imported programs have had a poor localization for Japanese people; therefore, Japanese developers have worked to translate messages output from software into Japanese. Besides, software architecture itself should have changed to handle multi-byte encodings such as Shift_JIS or EUC_JP since original software was often designed only for single-byte character sets. Unless, troubles would happen when the software decides word boundaries or spaces between words or characters. To resolve this sort of troubles, Japanese programmers had made great efforts to make imported software to handle multi-byte encodings in early days.


      • I18N

        I18N is the short form of Internationalization and the ideas to internationalize computer software. Precisely, the ideas are:
        - Multi-byte character sets/encodings should be supported in software
        - The mechanism to covert languages, currency symbols, or other region specific notations to different ones easily should be fixed.


        At the beginning, efforts to support multi-byte character sets was focused on creating ISO 2022 (one of multi-byte encodings) frameworks. Later, the idea of internationalization got close to supporting Unicode. The Unicode support has improved by adopting the abstraction to handling messages or region specific notations. The gettext library is an example. Also, Rails’ internationalization has come along with using Unicode support discussed here.


      • M17N

        As I talked before, M17N is the short form of multilingualization and has the ideas:
        - Localization for more than one language on single software should be available
        See http://www.jpnic.net/ja/research/200605-dom/chapter1-2.pdf
        - More than one language should be available to use at the same time
        See http://www.m17n.org/m17n-lib-en/index.html

        Ruby 1.9 chose the second idea of multilingualization.



      • Derivation of I18N, L10N, and M17N

        I18N, L10N, and M17N are numeronyms consisting of “the first letter + number of letters between the first and last letter + the last letter." Originally, this rule is coined at DEC (Digital Equipment Corporation).
        * http://www.i18nguy.com/origini18n.html
        * http://q.hatena.ne.jp/1159582709
        * http://blog.miraclelinux.com/yume/2007/01/i18n_8bc0.html



    • Differences between the UCS Normalization and CSI Models

      The UCS Normalization and CSI models are well known models for multilingulaliation. Two models use different mechanisms to have internal character encodings in computer software. Naturally, pros and cons are there.



      • The UCS Normalization Model

        The UCS normalization model has the only one character set, which is called Universal Character Set, to handle characters internally. The most remarkable advantage of this model is that people don’t need to modify their programs along with it. Thus, in most cases, people can keep using just localized (not internationalized), old computer software even after this model has been newly introduced. When this model woks on a back-end, encoding to and decoding from internal code points are done in every input/output processing. In other words, all input characters are converted to an array of internal code points before processing. On the contrary, all internal code points are converted to a byte array before outputting. Since this approach successfully standardized the only one internal code set and does not require programmers to modify old software, many languages and operating systems such as Perl, Python, Java, .NET, Windows and Mac OS X, have this UCS normalization model inside. I mean almost all languages and operating systems use the UCS normalization model.



        • Perl's case (UTF-8)

          Perl uses the UCS Normalization model with Unicode. Supported character sets in this model depend on Unicode version that the system chose. Since UTF-8, one of the Unicode encodings, is a variable-width encoding, language implementers will have trouble to map character sequences to byte sequences and vice versa. It is known that Perl implementers have paid a lot of efforts to handle multi-byte characters, for example, using caches to memorize each character position. Probably, our challenge would be how well we can implement the idea of Grapheme Cluster, which is a user perceived character, consists of more than one Unicode code point but expresses a single character, for example, “G" + acute-accent. Also, the definition of Unicode Scalar Value is important to think about what is a character when Unicode is used for the language implementation.

          $str = decode("UTF-8", "\xE3\x81\x82"); #=> "あ"
          $bytes = encode("UTF-8", "あ"); #=> "\xE3\x81\x82"



        • Java's case (UTF-16)

          Java 1.5 and later versions use UTF-16 for its internal character code, while Java 1.4 and older versions have used the 16-bit-fixed width to store code points in Basic Multilingual Plane (BMP) of Unicode or ISO/IEC 10646. Consequently, old versions could handle the range just between U+0000 and U+FFFF. Java’s change in assigned bits came after a serious Unicode problem was revealed. I mean 16 bits were too short to assign code points to all characters in the world. Then, Unicode 2.0 resolved this flaw by inventing the Surrogate Pair, which had an enough range to assign a greater code points than BMP by using a pair of two 16-bit code units. In light of this background, all Java programmers should know that a unit of a character might be only a half of the Surrogate Pair. Also, .NET Framework implements in the same way, so programmers should think about this tricky stuff.


          In addition, Python, too, uses UTF-16 for its internal character code by default. However, Python has an option to change UTF-16 to UTF-32 by setting –enable-unicode=ucs4 (version 2.x), or –with-wide-unicode (version 3.0). UTF-32 has been adopted to some versions of Fedora and Ubuntu.


        • Mosh's case (UTF-32)

          The internal character code of Mosh, a Scheme interpreter, is UTF-32. Not like other Unicode encodings, UTF-32 uses a fixed-length encoding, which means that Unicode code points are stored in exactly 32 bits in any case. Consequently, system does encoding/decoding of characters every time; however, architecture of converting characters will be simple since UTF-32 encoding uses fixed-length and not used for communication.


        • TRON's case (TRON code)

          TRON uses UCS Normalization model, but does not do Unicode for its internal character code. The TRON project defined TRON code, which includes Unicode 2.0, for the internal code. Another example of TRON code is soopy.




      • The CSI Model

        The Code Set Independent (CSI) model does not have a common internal character code set not like the UCS Normalization. Under the CSI model, all encodings are handled equally, which means, Unicode is one of character sets. The most remarkable feature of the CSI model is that the model does not require a character code conversion since external and internal character codes are identical. Thus, the cost for conversion can be eliminated. Besides, we can keep away from unexpected information loss caused by the conversion, especially by cutting bits or bytes off. Ruby uses the CSI model, so do Solaris, Citrus, or other system based on the C library that does not use __STDC_ISO_10646__. If the C library of system does not define __STDC_ISO_10646__, stored data in wchar_t are not always the same ones. On the other hand, when __STDC_ISO_10646__ is defined, stored data in wchar_t is always mapped to the same character; for example, 0x3042 is mapped to “あ" in Japanese Hiragana. Therefore, when the system uses the CSI model, programmer should be careful not to judge character codes easily just look at data on memory. This is important to avoid bugs mixed in. To avoid character related bugs, programmers should use defined functions for characters when they handle strings.




  • The Ideas of Ruby M17N


    • The CSI Model

      As I explained before, Ruby uses the Code Set Independent (CSI) model, while many other languages uses the UCS Normalization model. Ruby succeeds in reducing computational overhead that comes from unnecessary encoding conversions by using the CSI model. Moreover, it is possible to handle various character sets even though they are not based on Unicode.


    • Strings' Built-in Encodings

      Since Ruby M17N uses the CSI model, we are unable to determine the encoding of a given string. Besides, each string might have a different encoding. In light of this complexity, Ruby’s String object is designed to have its own encoding in it. Consequently, every string processing is done based on the encoding the String object has.

      # coding: UTF-8
      "あいうえお".encoding #=> #<Encoding:UTF-8>



    • Script Encodings

      Basically, script encoding determines an encoding of string literals in a source code. Each source file has its unique script encoding, which is available to get by __ENCODING__ keyword in Ruby runtime. We can specify ASCII compatible encodings for the script encoding, and set it in a magic comment line. (I’m going to talk about the magic comment later) When no magic comment is there, Ruby applies US-ASCII encoding to a given source code. Thus, the magic comment explained below is necessary to write non-ASCII strings in a Ruby script.


      However, when we give a Ruby script to runtime through standard input, or by command-line with –e option, system locale will be applied to the script encoding only if the magic comment is missing. Thus, we don’t need to add the magic comment just for a line of a Ruby program.

      The priority order of the script encoding for .rb files
      magic comment > command-line –K option > RUBYOPT –K > shebang –K > US-ASCII

      The priority order of the script encoding for a given script via command-line or standard input
      magic comment > command-line –K option > RUBYOPT –K > system locale



    • Magic Comment

      The magic comment is used to specify the script encoding of a given Ruby script file. The magic comment is similar to the encoding attribute of a XML declaration in each XML file. As I explained, US-ASCII will be applied if the magic comment is missing in a file. The magic comment should be on the first line unless the script file does not have a shebang line. When we want to write shebang line, the magic comment comes on the second line. The format of the magic comment must match to the regular expression, /coding[:=]\s*[\w.-]+/ , which is, generally, the style of Emacs or Vim modeline. Namely, the magic comment must be a comment as its name illustrates.

      #!/bin/env ruby
      # -*- coding: utf-8 -*-
      puts "Emacs Style"

      # vim:fileencoding=utf-8
      puts "Vim Style 1"

      # vim:set fileencoding=utf-8 :
      puts "Vim Style 2"

      #coding:utf-8
      puts "Simple"

      We will get the “invalid multibyte char" error, when we write non-ASCII string literals in a Ruby script with no magic comment. The error warns you that non US-ASCII characters are written in the script. Raising the error is good to keep platform independent of a source code. Only a script’s author knows what encoding he or she used to write the script. Usually, people don’t know the encoding of the script written by somebody else. Although NKF.guess or other utilities would figure out the Japanese encoding, it is so hard to guess European encodings of the script that someone wrote before in some place. In light of this difficulty, Ruby 1.9 requires the magic comment if programmers want to write non-ASCII characters in a script file. Therefore, the magic comment is valid only in the file. We don’t have any feature to require after we specify one of script encodings.


    • External and Internal Encodings

      Ruby 1.9 IO object has a feature to set appropriate encodings to input strings and to convert encodings. Also, we can let IO object convert output encodings automatically. The external and internal encodings of IO object are decisive factors of this sort of behavior.


      We should think about whether we can set the external encoding or not rather than to invoke the String#force_encoding method against a string input from an IO object. Also, we should think about setting the internal encoding rather than using String#encode method.

      I’m going to talk about IO in detail later.



      • default_external and default_internal

        Encoding.defualt_external returns a default external encoding of IO object, while Encoding.default_internal does a default internal encoding of IO object. These encodings are used only when the encoding is not specified explicitly over standard I/O, command-line arguments, or a file opened in a script.

        When Encoding.default_internal is defined, the encoding of every input string is supposed to be identical to the returned value from Encoding.default_internal. In the same way, the encodings of returned strings from libraries are expected to be the value of Encoding.default_internal.

        However, it is not recommended assigning Encoding.default_external to the initial value of strings returned from libraries. Since Encoding.default_external is only for the default value of the external encoding, we don’t have any information about the internal encoding. In addition, we should be care about the internal encoding because a default value of Encoding.default_internal is nil. Don’t misunderstand that Encoding.default_internal seems to be suitable to the initial value.

        default_external
        -E/-U/-K command-line options > -E of RUBYOPT > -E in shebang > locale

        default_internal
        -E/-U command-line options > -E of RUBYOPT > -E in shebang > nil



      • Command Line options: -E and -U

        We can set the values of Encoding.default_external and Encoding.default_internal by –E command-line option, whose format is -Eex[:in] . –U command-line option sets UTF-8 to the both values. When we use the –U command-line option, the script encoding of a given script from standard input or command-line by –e option is assumed to be UTF-8, too. However, the –U option does not have any effect on a script encoding of the script given by a file.



    • Locale Encoding

      Here’s how locale encoding is determined. Ruby runtime tries to get $LANG environment variable on both Unix and Windows to decide locale_charmap. If $LANG variable exists, the value will be the same encoding value as $LANG has. Otherwise, the runtime tries to pick it up by GetConsoleCP*4 on Windows or cygwin. Once the value of locale_charmap has been fixed, we can get it from Encoding.locale_charmap. Remember that miniruby always returns ASCII-8BIT, and no nl_langinfo environment returns nil.

      After locale_charmap has been fixed, the locale encoding is determined from. The locale encoding is identical to the value of Encoding.find(Encoding.locale_charmap), but it will be US_ASCII when locale_charmap is nil or ASCII-8BIT when locale_charmap is an unknown name for Ruby. We can get the determined locale encoding by Encoding.find(“locale").

      The locale encoding is mainly used to set the default value of default_external as I discussed here. Since default_external is a default value of IO object’s external encoding, it is applied to $stdin、$stdout、$stderr, which are always ready to use. On the other hand, we should explicitly set the encoding to open files or others. Ruby developers concluded that the encodings for standard I/O should be the same as the one used on a console since standard I/O is available only on the console. Thus, they agreed to use $LANG environment variable on Unix and Windows platforms or GetConsoleCP on Windows for the encodings of IO object. They have an idea that UTF-16LE would be a substitute on Windows platform using Unicode compliant API. However, the idea was not included since it would have a problem in compatibility between Ruby 1.9 and 1.8. *5

      For the background of the locale encoding, it is possible on Windows platform to have an incorrect encoding when a programmer uses the value of default_external not for standard I/O or default_external’s default value. I think you’d better to report in the ruby-dev mailing list if you want to use the locale encoding not only for setting the default value of default_external. Be aware that future versions of Ruby might use GetACP instead of GetConsoleCP.


    • Fielsystem Encodings

      A filesystem encoding is used to handle characters on file system. For example, character encodings of filenames got from file system are the field of the filesystem encoding. Thus, the filesystem encoding is totally different from the locale encoding described here. Currently, no Ruby API is provided to get the filesystem encoding so far. *6



      • Windows Platform

        Windows stores filenames in the UTF-16LE *7 encoding on FAT32, NTFS, or other file system such that long filenames are supported. Also, FAT file system on Windows NT uses UTF-16LE for the filenames after files are read by system. Consequently, I can say Windows, especially Windows NT, uses UTF-16LE to handle the filenames.

        Since Ruby 1.9.1 uses ANSI API, Windows gives Ruby filenames after converting them to ANSI or the OEM code pages *8. This means that the filenames are always encoded in ANSI or the OEM code pages when Ruby 1.9.1 gets them. Thus, the Ruby’s filesystem encoding is invariably ANSI or the OEM code pages on Windows platform. Ruby assigns the encoding of strings to ANSI or the OEM code pages, and, if necessary, converts strings into the appropriate encoding specified by command-line options.

        Since Ruby 1.9.1 uses ANSI API, Windows gives Ruby filenames after converting them to ANSI or the OEM code pages *8. This means that the filenames are always encoded in ANSI or the OEM code pages when Ruby 1.9.1 gets them. Thus, the Ruby’s filesystem encoding is invariably ANSI or the OEM code pages on Windows platform. Ruby assigns the encoding of strings to ANSI or the OEM code pages, and, if necessary, converts strings into the appropriate encoding specified by command-line options.


      • Unix Platform

        On Unix Platform, we are unable to detect encodings of filenames reside in filesystem. For this difficulty, Ruby regards the locale encoding as the filesystem encoding, and sets the locale encoding to a byte array that made from a given filename.


      • Mac OS X Platform

        On HFS+ of Mac OS X, filenames are saved by UTF-16, whose format is the Normalization Form D modified by Apple. When we use POSIX API, we can get filenames encoded in UTF-8. I mean the encoding of a filename will be UTF8-MAC if the filenames are saved using Carbon libraries of OS X. Thus, Ruby assigns UTF8-MAC to the filesystem encoding on Mac OS X. However, when POSIX API has been used to save the filenames, Ruby handles them exactly the same as the way on Unix.





        Windows Platformencoding of filesystem -> converting into the internal encoding od system (UTF-16LE) (Windows) -> converting into the Ruby filesystem encoding (Ruby)
        Unix Platformsaving byte arrays in filesystem -> setting Ruby filesystem encoding

        Read the section, “File Path Encoding" for further details.




  • Ruby M17N Implementation

    Ruby M17N implemented the idea described in previous section, although implementation was not simple for various reasons. Those were, for example, lack of resources for development, pursuit of higher usability, and preservation of backwards compatibility.



    • All Encodings Ruby Handles

      As you may know, Ruby 1.9.1 returns 83 when you run Encoding.list.length (Encoding.list.length #=> 83), which shows Ruby currently supports 83 encodings. However, all 83 encodings are not supported equally mainly from lack of development resources. Encodings are grouped in three categories according to the level of support Ruby assures. Three categories are ASCII compatible encodings, ASCII incompatible encodings, and dummy encodings. Since Ruby M17N uses the CSI model, Ruby should know how to handle encoded strings, rather than has a conversion table from some external to/from internal encodings like the UCS Normalization model.



      • ASCII Compatible Encodings

        Ruby fully supports strings encoded in the encodings of this category. ASCII compatible encodings means that every character in the US-ASCII area is mapped to the range \x00-\x7F. This is the only one category that we can use in a Ruby source code. The most remarkable feature of this category would be that we could compare or concatenate a pair of strings even though encodings of two strings are not equivalent under a condition. The condition is that strings to be compared and concatenated should consist of ASCII characters, and “String#ascii_only?" should return true. Ruby succeeded in getting over a hedge of encodings.



        • The Range of ASCII

          Ruby assumes that code points 0x00-0x7F are mapped to US-ASCII in every encoding. For example, even in Shift_JIS encoding, the code points 0x00-0x7F should be US-ASCII for Ruby M17N, although the range is mapped to JIS X 0201 Roman. But, conversion by transcode is exception, and follows the ordinary definition of Shift_JIS.


        • ASCII ONLY

          When we say strings are ASCII ONLY, the strings consist of just ASCII characters and are encoded by one of ASCII compatible encodings. ASCII ONLY strings are available to compare, concatenate, and match with regular expression against strings encoded in other ASCII compatible encodings.

          # coding: UTF-8
          a = "いろは".encode("Shift_JIS") # converting into Shift_JIS
          a.ascii_only? #=> false
          b = "ABC".encode("EUC-JP") # converting into EUC-JP
          b.ascii_only? #=> true
          c = a + b #=> "いろはABC" # a and b may be encoded in different encodings
          c.encoding #=> #<Encoding:Shift_JIS>



        • ASCII-8BIT

          ASCII-8BIT is one of ASCII compatible encodings, and is applied to an ASCII compatible octet sequence. Thus, strings are ASCII compatible but different from typical definition of “string" and also different from a binary form. This encoding is not an ASCII incompatible encoding; also, strings can be compared, concatenated with strings of ASCII characters. Please understand that Ruby 1.9.1 does not have ASCII incompatible binary encodings since they are considered unnecessary encodings for Ruby.


        • Emacs-Mule

          Emacs-Mule Encoding is an internal encoding Emacs/Mule uses. The approach of this encoding is going toward for the multilingualization of ISO 2022, which is a stateless, variable-length encoding. An example is the stateless-ISO-2022-JP encoding.



      • ASCII Incompatible Encodings

        The definition of ASCII incompatible encodings is that characters in US-ASCII area are mapped to another code points of \x00-\x7F. UTF-16BE, UTF-16LE, UTF-32 BE, and UTF-32LE are categorized to this encoding in Ruby 1.9.1. Ruby partially supports this sort of encodings. We can’t use US-ASCII incompatible encodings in a Ruby script; besides, we can’t concatenate with ASCII strings if the underlined strings are encoded in the ASCII incompatible ones.



        • UTF-16 & UTF-32

          As I talked about, Ruby 1.9.1 partially supports UTF-16BE, UTF-16LE, UTF-32BE, and UTF-32LE, all of which does not have BOM (Byte Order Mark) *9. Therefore, please feel free to delete U+FEFF, bytes of BOM, since U+FEFF is treated as ZERO WIDTH NO-BREAK SPACE. Be careful that Ruby 1.9.1 does not support UTF-16 and UTF-32.

          The lack of development resources is the reason of UTF-16 and UTF-32 unsupported. Once, Ruby developers tried to support these encodings; however, they figured out it was not easy. To support these encodings, Ruby needs to calculate byte position paying attention to BOM, provide endian sensitive methods for each encoding, and tackle complicated IO related processing. In light of these difficulties, Ruby 1.9.1 gave up to support UTF-16 and UTF-32. No one will oppose to support these two encodings, please provide a patch. Probably, the patch will be taking in.

          Since ASCII incompatible encodings are supported only partially, it is recommended to convert strings into UTF-8 when various string operations are expected.

          Although Ruby defines UCS-2BE as an alias of UTF-16BE, Ruby 1.9.1 does not support UCS-2BE. The alias name, UCS-2BE, is used to read data encoded in UCS-2BE for users’ convenience.



      • Dummy Encodings

        Dummy Encodings are the ones Ruby knows just names of them. Ruby regards strings of dummy encodings as byte sequences and does not see them as strings. Even though a string has ASCII characters only, comparison, concatenation, or other string operations are not supported for dummy encodings. ISO-2022-JP, UTF-7, or other stateful encodings are in this category. We should convert them into stateless-ISO-2022-JP, or UTF-8 before we use strings.

        Encoding::ISO_2022_JP.dummy? #=> true
        a = "いろは".encode("ISO-2022-JP") # converting into ISO-2022-JP
        b = "ABC".encode("EUC-JP") # converting into EUC-JP
        b.ascii_only? #=> true
        c = a + b
        #=> Encoding::CompatibilityError: incompatible character encodings: ISO-2022-JP and EUC-JP



      • Adding a New Encoding

        Ruby can have a new encoding support as an extended library. The one of C API, rb_enc_replicate, will help you to define the new encoding by creating a replica of other already supported encodings. Or, rb_define_dummy_encoding will help to create a dummy encoding. (The idea of the “replica" is that we can manage encodings from C API, but cannot do anything against them from Ruby.) It would not be easy to define the new encoding from scratch. However, I encourage you to request a standard support of the encodings you want.

        In spite of my recommendation, when you need to have your own implementation of the new encoding support, you should create the replica from one of supported encodings. In addition, you should be careful to define a new dummy encoding since strings encoded in dummy encodings are unable to concatenate with ASCII ONLY strings. Make sure your choice of a dummy encoding is truly correct.


      • Special Encoding Names

        Ruby defines the names, “locale," “external," and “internal" to refer three internal encodings: locale encoding, default external encoding, and default internal encoding. When we want to know the script encoding in each source file, we can use __ENCODING__ keyword.

        # coding: UTF-8
        locale = Encoding.find("locale")
        external = Encoding.find("external") # Encoding.default_external is equivalent
        internal = Encoding.find("internal") # Encoding.default_internal is equivalent
        __ENCODING__ #=> #<Encoding:UTF-8>




    • Encoding

      Encoding class provides utility methods to access encodings such as getting a list of encodings, or managing special encodings. In Ruby, Encoding class does not have a conversion table for encodings; instead, the class keeps encoding byte structures, and character information using Oniguruma encoding module. The encoding conversion table is maintained by Encoding::Converter, which is a member of a transcode group.



      • Getting Encodings

        Getting a supported encoding list, we can use Encoding.list, Encoding.name_list, or Encoding.aliases.

        p Encoding.list # prints an array of supported encodings
        #=> [#<Encoding:ASCII-8BIT>, #<Encoding:UTF-8>, #<Encoding:US-ASCII>, ...]

        p Encoding.name_list # prints supported encoding names and an array of alias names
        #=> ["ASCII-8BIT", "UTF-8", "US-ASCII", ..., "locale", "external", "internal"]

        p Encoding.aliases # prints alias names and a hash of encodings names
        #=> {"BINARY"=>"ASCII-8BIT", "SJIS"=>"Shift_JIS", "CP932"=>"Windows-31J", ...}

        To get an object of the specified encoding, we can use Encoding.find.

        p Encoding.find("UTF-8") #=> #<Encoding:UTF-8>

        p Encoding.find("eucJP") #=> #<Encoding:EUC-JP>

        p Encoding.find("locale") #=> #<Encoding:Windows-31J< # Japanese Windows Platform

        p Encoding.find("jis") #=> ArgumentError: unknown encoding name - jis

        We can use predefined constants for encodings. The rule of constant name is that capitalizing all characters of an alias name and replacing “-“ to “_” if it is included.

        p Encoding::UTF_8 #=> #<Encoding:UTF-8>

        p Encoding::EUC_JP #=> #<Encoding:EUC-JP>

        p Encoding::EUCJP #=> #<Encoding:EUC-JP>

        To get default external and internal encodings, we can use Encoding.default_external or other methods. We can also use Encoding.find.

        p Encoding.default_external #=> #<Encoding:Windows-31J> # a default value of Japanese Windows

        p Encoding.find("external")

        p Encoding.default_internal #=> nil

        p Encoding.find("internal")



      • Getting information about each encoding

        Other than Encoding#name and Encoding#inspect, we have the method “Encoding#dummy?” to know whether the given encoding is categorized in dummy encodings or not.

        To know the given encoding is ASCII compatible or not, we’d better to use Encoding::Converter.asciicompat_encoding instead of methods in Encoding class. This method returns nil for ASCII incompatible encodings and non-existing encodings. Or, it returns the ASCII compatible encoding whose character sets are equivalent to the given encoding when the given one is the ASCII compatible or dummy encoding.


      • Others

        Encoding class has more methods, for examples, Encoding.compatible?(str1, str2). The method is used to judge two String or Encodings objects are available to compare or concatenate, and return the resulted encoding after concatenation. Please see the document of Encoding class for further details.



    • String

      In Ruby 1.8 and older versions, String was just a byte array. This design has brought us high flexibility and low cost over encoding conversions. In older versions, Ruby could successfully convert byte arrays to appropriate encodings whatever those are Shift_JIS, EUC-JP, or UTF-8. However, the design also had negative effects that string operations are limited to the regular expression with $KCODE, or methods proviced by jcode.rb. Because of this limitation, some people have dissatisfied with Ruby.

      Introduced Ruby M17N, String object in Ruby 1.9 is still a byte array, but the array has an encoding related to unlike older versions. This change made every operation available to being encoding compliant. When the String object is encoded in one of the encodings Ruby supports, we can use the String object as string, literally, whatever the encoding is. In this section, I’ll explain that what part of String object has been changed in Ruby 1.9.



      • Encodings of String Objects

        As I talked, each String object has its own encoding in Ruby 1.9. We can get Encoding object of a given String object by String#encoding.

        # coding: EUC-JP
        "あいうえお".encoding #=> #<Encoding:EUC-JP>
        "\u{3042}".encoding #=> #<Encoding:UTF-8>



      • Character Class

        Since strings are arrays of characters, I’m going to start from Character Class. Ruby M17N does not have any class that expresses a character literally, but has a String object whose content is a single character.

        In early days of Ruby M17N development, Character class was on the table. It was not the special case of String class, but just Character class. However, in the course of designing, Ruby developers figured out that they do not need a class definition just for a character since String class can cover every feature of the Character class. I mean code points, encodings, and data stored in a byte array are also the element of the String class. In terms of Ruby-ism, the Character class is designed to be a String class whose content is just one character.

        This design has a couple of advantages to cover various string operations. For example, a string is ready to use just setting an appropriate encoding to a byte array read from external resources. Or, we can change the unit of a character by replacing an encoding. On the other hand, the design has a downside that a performance goes down for a difficulty to identify a targeted character position in a string of a variable-length encoding.


      • String#[]

        When we want to get a character from a given string, we can use String#[]. Ruby 1.8 String indexer returns a byte value of a specified index, I mean Fixnum type value will be returned. Look at examples below. The first example of Ruby 1.8 returns the value of the 0th byte when the index is zero, and its value is 0xE3.

        On the other hand, Ruby 1.9 returns a character as the value of the string indexer. Like I talked, the character means a string whose content is just a single character. The second example shows this. The first letter “あ” of the given string, “あいう” is returned when the index 0 is specified. This example describes well that Ruby 1.9 sees the String object as an array of characters literally.

        Ruby 1.8

        String#[] #=> Fixnum (1 byte)
        "あいう"[0] #=> 0xE3 # in UTF-8

        Ruby 1.9

        String#[] #=> one character String
        "あいう"[0] #=> "あ"



      • Character Literal

        Ruby 1.9 does not have Character class defined, but we can write a character literal in a Ruby program. Not only a conventional style like “?a,” but also a new multilingualized style using non-ASCII character like “?あ” are available to write in it. In addition, Unicode notation is now available to use. Unicode notation is similar to the one that is an escape symbol preceded ASCII code expression. The encoding of the character is UTF-8 when Unicode notation is used, or the same one of the script encoding of a source code.

        ?a
        ?\t
        ?あ
        ?\u3042



      • String#ord and Integer#chr

        Ruby has a String#ord method to convert a character to a code point. What if we try this using a Hiragana character, “あ”.ord? The result depends on the encoding tied to the character. For example, we get 12354 when the encoding is UTF-8. Then, what if we try converse method “chr” against 12354 got from “あ”.ord? We get an exception instead of the expected character. This method needs an encoding to convert into. Thus, 12354.chr("UTF-8") gives us the Hiragana character “あ” as we intend.

        # coding: utf-8
        "あ".ord #=> 12354
        12354.chr("UTF-8") #=> "あ" in UTF-8



      • String Literal

        In most cases, a String literal remains same as was in Ruby 1.8. One of a couple of new features is that the form of Unicode escapes is added. In Ruby 1.9, we can use \uXXXX and \u{XXXX} to express a character in addition to the traditional forms of \OOO and \xHH.

        When a String object is composed by String literals, the encoding of that String object is usually the same as the script encoding. The exception is String literals by Unicode escapes. In this case, UTF-8 is applied to the String literals whatever the script encoding is. If non-ASCII String object is created using byte escapes under the script encoding is set to US-ASCII, then the resulted encoding will be ASCI-8BIT.

        # coding: EUC-JP
        "あ".encoding #=> #<Encoding:EUC-JP>
        "\u3042".encoding #=> #<Encoding:UTF-8>
        "\u{3042 3044 3046}" #=> "あいう"
        "abc".encoding #=> #<Encoding:US-ASCII>
        "\x82\xA0".encoding #=> #<Encoding:ASCII-8BIT>



      • String#length

        “String#length” method has been changed to a character aware one. This method returned a length of byte array that expresses a String in Ruby 1.8. However, Ruby 1.9’s “String#length” method returns a number of characters in the String. When we want the length of byte array of the given String, a newly added “String#bytesize” method is the one.

        Ruby 1.8
        * String#length #=> byte length
        "あいう".length #=> 9 (UTF-8)

        Ruby 1.9
        * String#length #=> character length
        * String#bytesize #=> byte length
        "あいう".length #=> 3
        "あいう".bytesize #=> 9 (UTF-8)



      • String#each_*

        String#each" method was removed from Ruby 1.9. Since a String object is not enumerable anymore, this method became unclear what should be iterated.

        To make it clear, Ruby 1.9 has four kinds of methods to enumerate a String. Those are “String#each_byte” to iterate a byte, “String#each_codepoint” for a code point, “String#each_char” for a character, and “String#each_line” for a line. When a block follows these methods, we get the same result as the old each method did; meanwhile, when no block comes after the methods, we get Enumerator object. Also, Ruby 1.9 has plural forms of these methods, “String#bytes,” “String#codepoints,” “String#chars”, and “String#lines.” These plurals clarify that a String object is not only an array of characters but also still an array of bytes, code points, or lines.

        The methods explained here except each_codepoint have been back ported to Ruby 1.8, and are available to use since Ruby 1.8.7.


      • String Comparison and Concatenation

        String comparison and concatenation have been changed largely in Ruby 1.9. Even if two strings are identical in terms of a byte array, Ruby returns false for String#== when encodings of two strings are not the same. String comparison results in true only when both byte arrays and encodings are matched respectively. However, one exception exists. Even though the encodings of two strings are different, we get true only if two encodings are ASCII compatible and strings to be compared are all ASCII characters.

        In case of string concatenations, Ruby raises Encoding::CompatibilityError when encodings of two strings to be concatenate are not the same. However, when both encodings are ASCII compatible, and at least one of two strings is an ASCII only, the concatenation is possible. Or, concatenation with empty character is always possible whatever the encodings are.


      • String as a Byte Array

        So far, I talked about the String as an array of characters. Now, I’m going to pick another side of a String up here, a String as a byte array. Not like Ruby 1.8, Ruby 1.9’s byte array support in String is limited, and just three methods are provided. Those are "String#getbyte(index)" to read a byte, "String#setbyte(index, value)" to write a byte, and “String#bytesize” to get a byte length. Proposals of more sophisticated features will be welcomed.


      • String#force_encoding

        "String#force_encoding" is a destructive method to force String to change its encoding. This method is useful when we create a new String of another desired encoding by combining with "Object#dup" without modifying a byte array.

        "String#force_encoding" should be sparsely used since Strings have already had appropriate encodings assigned when those are created, or read from files specifying the encoding. The method might be useful when we need network library or XML library in which encodings are managed out of the Ruby world. For example, encodings set in HTTP headers or XML declarations are. Or, we can use str.force_encoding("ASCII-8BIT"), when we want to start using String as a byte array, which was treated as an array of characters before.

        If you need to use "String#force_encoding" in your library, you should reconsider your library design. You should not use this method thoughtlessly. A correct design does not need this method. To warn people not to use this method easily, the method had the name force_encoding not like set_encoding or encoding=, and impresses it is destructive.


      • String#valid_encoding?

        "String#valid_encoding?" judges a String whether it has a correct byte structure in terms of the encoding assigned to it. We can use this method to know that the String has a right byte structure of the String’s encoding; however, we can’t know that every character in the String is defined in the assigned encoding. To know the character in the String is defined in the encoding, we should try conversion using Encoding::Converter.


      • String#gsub(pattern, hash)

        "String#gsub(pattern, hash)" is one of new methods of Ruby 1.9.1. Before this method has been added, we need to use a block when we want to replace a character by matched one. The problem was that using a block is costly.

        Internally, "str.gsub(pattern, hash)" method works in the same way as "str.gsub(pattern){hash[$&]}" did. But, the new method works really faster than the old one. This is because the new method works mainly in the C library layer. Thus, the method is thought to be effective especially in escaping a specified character.


      • String#inspect and String#dump

        I’m going to add the information about "String#inspect" and "String#dump" methods here to clarify those usage although the methods are not changed in Ruby 1.9. The "String#inspect" method is defined to know what the String is by just giving a glance at. When we want to escape or dump Strings, "String#dump" and "Marshal.dump" methods work as we expected.

        * String#inspect #=> an easy way to check it using p
        * String#dump #=> dump use (str == eval(str.dump) is guaranteed)

        # coding: UTF-8
        "あいう".inspect #=> "あいう" (UTF-8)
        "あいう".dump #=> "\u{3042}\u{3044}\u{3046}" (UTF-8)

        "あいう".encode("EUC-JP").inspect #=> "あいう" (EUC-JP)
        "あいう".encode("EUC-JP").dump #=> "\xA4\xA2\xA4\xA4\xA4\xA6" (EUC-JP)

        "あいう".encode("UTF-16LE").inspect #=> "B0D0F0" (US-ASCII)
        "あいう".encode("UTF-16LE").dump #=> "B0D0F0".force_encoding("UTF-16LE") (ASCII-8BIT)

        "あいう".encode("ISO-2022-JP").inspect #=> "\e$B$\"$$$&\e(B" (US-ASCII)
        "あいう".encode("ISO-2022-JP").dump #=> "\e$B$\"$$$&\e(B".force_encoding("ISO-2022-JP") (ASCII-8BIT)

        I want to add one more brief information about "Kernel#p." The "Kernel#p" method returned nil in Ruby 1.8, but returns its argument as it is in Ruby 1.9.



    • Regexp

      Many people might not be aware that Ruby’s regular expression has been encoding sensitive since older versions. For example, in Ruby 1.8, /a/e.kcode returns “euc.” However, the implementation for Regular Expression in Ruby 1.8 was behind to other languages because of its GNU regex based implementation. The old implementation could handle only SJIS、EUC、and UTF8 encodings; besides, it does not have a feature to look backwards.

      Ruby 1.9 uses the Oniguruma 5.9.1 equivalent regex engine. This new engine enables to use more colorful rules than before, lookbefind feature, and named capture groups; moreover, matching with subexpressions by context-free grammars is also available to use.

      * Regular Expression@RURIMA
      * Oniguruma Regular Expression



      • Regular Expression and Encodings

        Regular expression matching is absolutely encoding sensitive. I mean, “/./” matches any character except newline if both encodings are the same; otherwise, we get Encoding::CompatibilityError.

        Since Regexp#force_encoding is not immutable, use Regexp.new(reg.source.force_encoding(enc)). When you want to change the encoding of regular expression, use Regexp.new(reg.source.encode(enc)) instead of non-existing Regexp#encode. We can use Regexp.new for ASCII incompatible encodings, too. Be careful that you might get unexpected results if you use code points in regular expression. This happens when the regular expression depends on the order of the used code points and is converted into a different encoding.

        # coding: UTF-8
        kanji_of_jis_lv1_and_lv2_in_euc_jp = Regexp.new('[亜-煕]'.encode("EUC-JP"))
        broken_regexp = Regexp.new(kanji_of_jis_lv1_and_lv2_in_euc_jp.source.encode("UTF-8"))
        # no fun if matches with U+4E9C-U+7155



      • ASCII ONLY

        Ruby’s regular expression has a similar idea to ASCII ONLY. When Regexp object has an ASCII compatible encoding and ASCII only expression, the object can match with a String object whose encoding is ASCII compatible. In this case, “Regexp#fixed_encoding?” returns false.


      • Capture Syntax

        In old Ruby versions, we have applied $&, Regexp.last_match, $1, and $2 to use matched strings for some purpose in a program. Returned value from String#match is also used to assign into a variable before.

        Capture syntax of regular expression make it possible without using orthodox measures. We can directly assign a matched value to a local variable by using capture syntax. When a regular expression literal on the left side of “=~” has named capture groups but does not have dynamic expansion of #{} or others, the captured string is assigned to a value of a local variable after matching. The name of the local variable should have a correct name of the named capture group.

        /(?<foo>f\w+)(?<bar>b\w+)/i =~ "foobar2000"
        p [foo, bar] #=> ["foo", "bar2000"]

        /(?<foo>f\w+)(?<bar>b\w+)/i =~ "FreeBSD"
        p [foo, bar] #=> ["Free", "BSD"]

        /(?<foo>f\w+)(?<bar>b\w+)/i =~ "Firebug"
        p [foo, bar] #=> ["Fire", "bug"]

        If you worry about overwriting the value of the local variable with the result of matching, you can use –w command-line option to have a warning.


      • Matching with Byte Arrays

        It would be not common to match strings with a regular expression that has escaped byte arrays. This does not work in ASCII-8BIT and UTF-8 because of a compatibility problem. The reason is the matching operation is not a string operation but byte operation. The regular expression expects to be used for comparison of byte arrays. In addition, the byte array might be an illegal byte array. Thus, we should convert both side of the expression into ASCII-8BIT before the operation.

        The second example below is possible because ASCII-8BIT is a compatible encoding to ASCII. We can regard this as a regular expression matching of ASCII ONLY characters. If the encoding is not ASCII incompatible, the result will be ArgumentError like in the first example.

        # coding: UTF-8
        /\xE3\x81\x82/n =~ "あ"
        #=> ArgumentError: incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string)
        # This is the operation for byte arrays, not for strings

        #Both sides should be converted into ASCII-8BIT before the operation
        bytes = "aあ".force_encoding("ASCII-8BIT")
        /\xE3\x81\x82/n =~ bytes #=> 1
        /a/ =~ bytes #=> 0





    • IO Class

      IO class is encoding sensitive, so we should be careful returned values whether those are strings or byte arrays. To use IO class, we should know the idea of “external encoding” and “internal encoding” to convert and set encodings to the strings.



      • External and Internal Encodings

        The external and internal encodings affects to the encoding set and converted by IO class. When the internal encoding is not given, the external encoding is applied to input String object. See the table below for details.

        We can set the external and internal encoding by the second argument of IO#open, the third argument of IO#open as an option of Hash, and IO#set_encoding after opening a stream.

        p [Encoding.default_external, Encoding.default_internal]
        #=> [#<Encoding:UTF-8>, nil] # in case of UTF-8

        open(path, "r") {|f| p [f.external_encoding, f.internal_encoding] }
        #=> [#<Encoding:UTF-8>, nil]

        open(path, "r:Shift_JIS") {|f| p [f.external_encoding, f.internal_encoding] }
        #=> [#<Encoding:Shift_JIS>, nil]
        open(path, "r:Shift_JIS:EUC-JP") {|f| p [f.external_encoding, f.internal_encoding] }
        #=> [#<Encoding:Shift_JIS>, #<Encoding:EUC-JP>]

        open(path, "r", :encoding => "Shift_JIS") {|f| p [f.external_encoding, f.internal_encoding] }
        #=> [#<Encoding:Shift_JIS>, nil]
        open(path, "r", :encoding => "Shift_JIS:EUC-JP") {|f| p [f.external_encoding, f.internal_encoding] }
        #=> [#*lt;Encoding:Shift_JIS>, #<Encoding:EUC-JP>]

        open(path, "r", :external_encoding => "Shift_JIS", :internal_encoding => "EUC-JP") \
        {|f| p [f.external_encoding, f.internal_encoding] }
        #=> [#<Encoding:Shift_JIS>, #<Encoding:EUC-JP>]

        open(path, "r", :encoding => "Shift_JIS:EUC-JP") do |f|
        p [f.external_encoding, f.internal_encoding]
        #=> [#<Encoding:Shift_JIS>, #<Encoding:EUC-JP>]

        f.set_encoding(nil)
        p [f.external_encoding, f.internal_encoding]
        #=> [#<Encoding:UTF-8>, nil]
        end



      • Setting and Converting Encodings

        IO object set an appropriate encoding to input or output strings by checking en existence of external or internal encodings. If necessary, IO object converts the encoding.








        ExternalInternaldefault_internalBehavior of InputBehavior of Output
        not givennot givennilsets default_externaloutputs byte arrays without conversion
        not givennot givengivenconverts from default_external to default_internaloutputs byte arrays without conversion
        givennot givennilsets external encodingconverts into external encoding
        givennot givengivenfrom external encoding to default_internalconverts into external encoding
        givengivennilconverts from external encoding to internal encodingconverts into external encoding
        givengivengivenconverts from external encoding to internal encodingconverts into external encoding


      • Byte, Character, Byte Array, and String

        The methods in IO class are categorized into four based on what kind of data IO class handles. The four are the byte, character, byte array, and string category.

        Character operation methods, IO#getc, IO#ungetc, and IO#readchar, has been changed to handle characters as String not like old versions in which those were Fixnum. Also, IO#each_char method handles characters. A return value from these methods follows the encoding converting rule illustrated in the table above.

        In light of the changes that IO#getc now handles characters, IO#getbyte and IO#readbyte methods are added to IO class. Also, IO#each_byte is newly added method for byte handling.

        When we want to operate byte arrays, IO#binread, IO#read(size), IO#read_nonblock, IO#readpartial, and IO#sysread are the methods. The encoding of a byte array is always ASCII-8BIT.

        The applied encoding to methods for string operation depends on a combination of a couple of factors. Example method of this kind is IO#read method that does not have an argument for size.


      • IO#external_encoding and IO#internal_encoding

        IO#external_encoding, IO#internal_encoding return external encoding, internal encoding respectively. The returned encodings are used to judge conversions. We should be careful that these methods do not simply return external and internal encodings that IO object has.












        default_internalExternalInternalMode: w / w+ / r+Mode: r
        not given not given not given external_encoding: nil
        internal_encoding: nil
        external_encoding: default_external
        internal_encoding: nil
        not given not given givenexternal_encoding: default_external
        internal_encoding: internal
        external_encoding: default_external
        internal_encoding: internal
        not given given not given external_encoding: external
        internal_encoding: nil
        external_encoding: external
        internal_encoding: nil
        not given givengivenexternal_encoding: external
        internal_encoding: internal
        external_encoding: external
        internal_encoding: internal
        given not given not given external_encoding: default_external
        internal_encoding: default_internal
        external_encoding: default_external
        internal_encoding: default_internal
        given not given given external_encoding: default_external
        internal_encoding: internal
        external_encoding: default_external
        internal_encoding: internal
        given given not given external_encoding: external
        internal_encoding: default_internal
        external_encoding: external
        internal_encoding: default_internal
        given given given external_encoding: external
        internal_encoding: internal
        external_encoding: external
        internal_encoding: internal


      • File Path Encodings

        A file path encoding is normally determined based on a filesystem encoding of a platform, so it varies that how and what encoding are applied to the file path on each platform. Let me remind you, Ruby does not provide any API to get filesystem encoding; thus, no Ruby API to get file path encoding is out there.



        • Unix Platform

          On Unix Operating System, we can’t determine filesytem encoding in general. Ruby returns a byte array of a filename after setting the filesystem encoding or specified encoding by a command-line option. At this time, Ruby does not convert the byte array; therefore, when Ruby hands the filename over to the system, the filename is also a byte array of a String object.


        • Mac OS X Platform

          The filesystem encoding of Mac OS X is UTF8-MAC. Consequently, Ruby returns a filename encoded in UTF8-MAC or the specified encoding by a command-line option. When Ruby gives OS X the filename, Ruby converts it into UTF8-MAC if the encoding is other than ASCII-8BIT, but does not any conversion if the encoding is ASCII-8BIT. Please remind that this behavior will be possibly changed in future since the design is not stable. (Like the one Ruby does on Unix)


        • Windows Platform

          When ANSI API is used for Ruby implementation, strings handed from system to Ruby runtime are encoded in ANSI or the OEM code pages. Thus, Ruby returns a filename as a string encoded in filesystem encoding by default. If a specific encoding is given to Ruby runtime by a command-line option, Ruby returns the filename string encoded in the specified encoding. There are two types of Ruby’s behavior when Ruby passes the filename to the operating system. If the filename string has ASCII-8BIT as it’s encoding, Ruby gives system the string without any conversion since it should be a byte array. Or, Ruby gives the filename after converting it in the filesystem encoding.

          For the lack of development resources, Ruby 1.9.1 behaves just I described here. However, Ruby 1.9.2 has a plan to use Unicode API instead of ANSI API. After Unicode API is used by Ruby implementation, Ruby returns the filename encoded in Unicode when a command-line option specifies one of Unicode encodings. In this case, filenames are never converted in ANSI or the OEM code pages. In the same way, when Ruby gives system the filename without converting, I mean, encoded in Unicode if one of Unicode encodings is specified. The advantage of this new design is Ruby can cover wider range of characters that was once unable to handle correctly by an encoding extracted from system locale.




    • Deprecated Featues in 1.9.1


      • $KCODE

        $KCODE has been deprecated in Ruby 1.9. If you have programs depend on $KCODE, you need to modify them to work under new Ruby M17N design. The substitute, Encoding.default_internal, is ready in Ruby 1.9.1; however, you should be careful that the default value of Encoding.default_internal is nil. Besides, encoding conversion will be done automatically in IO object when a value is set to Encoding.default_internal.


      • Replica and Base Encodings

        The ideas of “replica encoding” and “base encoding” in Ruby 1.9.0 has been deprecated in Ruby 1.9.1. These ideas were originally invented to share implementations between encodings that have the same byte structure. Ruby developers thought the ideas were possible to define supersets or subsets of character sets.

        However, the flaw of the two ideas came out. To define the supersets and subsets of encodings based on similarity of the byte structure worked well only for EUC-JP and Shift_JIS encodings. On the other hand, many other encodings needed to be newly defined. Finally, Ruby developers admitted that the ideas were insufficient, and decided to remove from Ruby 1.9.1.

        Since C API still has the ideas of replica and basic encodings, we can see what those ideas were.


      • Command Line Option -K

        The –K command-line option is not recommended anymore although it is still available to use. When we use –K option in Ruby 1.9, we can set some encoding to a default value of script encoding and Encoding.default_external. If no encoding is given by –K option, Ruby applies US-ASCII to the default value of the script encoding. Encoding.default_internal is not related to –K option, so it remains nil.

        The –K command-line option has survived for backwards compatibility. The option works when we want to run Ruby 1.8 codes on Ruby 1.9 without any modification. Since –K is not recommended to use, we should write the magic comment in each script file. The magic comment is the best answer to make scripts run on both 1.8 and 1.9, especially, for new scripts.




  • Encoding Conversion

    Every Ruby version has released with character code conversion libraries, Kconv, NKF (the implemetation of Kconv), and Iconv. However, these libraries have flaws that encodings supported by them are limited. In addition, the libraries depend on each platform. To fix the issues, the transcode library written by Martin Dürst is bundled in Ruby 1.9. Using the transcode library, String#encode method and Encoding::Converter class are newly defined.

    Ruby 1.9’s encoding conversion library rewrites both the byte array of a String and the encoding assigned to it. When we want to change only the encoding of the String, String#force_encoding is the method we use.

    Martin explained details about transcode library in his presentation at RubyKaigi. If you want to know about the old conversion API, “Introduction to Standard Library (3): Kconv, NKF, Iconv" is a good article.



    • String#encode Method

      String#encode and String#encode! methods are the most basic encoding conversion tools. Theses methods allow us to give options by hashes. Also we can specify behaviors using them when an illegal byte sequence is found in the original String (:invalid => nil | :replace), or an undefined character in the given encoding is found (:undef => nil | :replace). We can use these methods to mix converted strings in XML documents (:xml => :text | :attr), and to replace non-LF new line character to a line feed(LF).

      # coding: UTF-8
      u = "いろは"
      puts u.dump #=> "\u{3044}\u{308d}\u{306f}"
      p u.encoding #=> #<Encoding:UTF-8>

      e = u.encode("EUC-JP") # Equivalent to u.encode("EUC-JP", u.encoding)
      puts e.dump #=> "\xA4\xA4\xA4\xED\xA4\xCF"

      Encoding::Converter#convert method explained later is useful to tailor conversions in various ways more than this.


    • Encoding::Converter Class

      Encoding::Converter class is created pulling a converter out form transcode library, and allows us to convert in a more sophisticated way. For example, we can use Encoding::Converter#convert and Encoding::Converter#primitive_convert methods. The latter does a really fine work for us.

      When we create an instance of Encoding::Converter class, we can give a source and destination encodings or an array to specify conversion paths to its constructor. In addition, following constants are available to set to an option field of the constructor as well as conversion options used in String#encode method.

      * Encoding::Converter::INVALID_REPLACE
      * Encoding::Converter::UNDEF_REPLACE
      * Encoding::Converter::UNDEF_HEX_CHARREF
      * Encoding::Converter::UNIVERSAL_NEWLINE_DECORATOR
      * Encoding::Converter::CRLF_NEWLINE_DECORATOR
      * Encoding::Converter::CR_NEWLINE_DECORATOR
      * Encoding::Converter::XML_TEXT_DECORATOR
      * Encoding::Converter::XML_ATTR_CONTENT_DECORATOR
      * Encoding::Converter::XML_ATTR_QUOTE_DECORATOR




      • Encoding::Converter#convert Method

        Encoding::Converter#convert method has a unique feature to take an area out from a given String and convert the selected area only. Thus, we don’t need to mind possible invalid byte sequences in the strings while converting them read from a stream if we use Encoding::Converter#convert method.

        ec = Encoding::Converter.new("EUC-JP", "UTF-8")
        dst = ec.convert("あいうえお")

        If an invalid byte sequence is found while executing Encoding::Converter#convert method, Ruby raises Encoding::InvalidByteSequenceError. In case of an undefined character, Ruby raises Encoding::UndefinedConversionError. Encoding::Converter#convert method is unable to restart the conversion once the exception is raised. Instead, we can use a following Encoding::Converter#primitive_convert method to escape invalid and undefined characters, or specify detailed behaviors.


      • Encoding::Converter#primitive_convert Method

        Encoding::Converter#primitive_convert method is the best and only one to specify fine grained behaviors to a converter. Using this method, we can keep portability and specify how to manage invalid characters and undefined characters after the conversion.

        ec = Encoding::Converter.new("UTF-8", "EUC-JP")
        src = "abc\x81あいう\u{20bb7}\xe3"
        dst = ''

        begin
        ret = ec.primitive_convert(src, dst)
        p [ret, src, dst, ec.primitive_errinfo]
        case ret
        when :invalid_byte_sequence
        ec.insert_output(ec.primitive_errinfo[3].dump[1..-2])
        redo
        when :undefined_conversion
        c = ec.primitive_errinfo[3].dup.force_encoding(ec.primitive_errinfo[1])
        ec.insert_output('\x{%X:%s}' % [c.ord, c.encoding])
        redo
        when :incomplete_input
        ec.insert_output(ec.primitive_errinfo[3].dump[1..-2])
        when :finished
        end
        break
        end while nil

        Above is the example to show how we can convert characters with escaping invalid characters and undefined characters in destination. As in the code, branching by a return value, we can go forward using the information stored in Encoding::Converter#primitive_errinfo.



    • Kconv

      Kconv has not changed since Ruby 1.8 except adding an appropriate encoding to a converted string. Changes made to NFK, which works at the back-end of Kconv, should not affect on this library.


    • NKF

      NKF is a wrapper of a traditional Kanji code conversion library, nkf. The library equivalent to nkf 2.0.9 is bundled in Ruby 1.9.1.

      * NKF
      * The nkf Project


    • Iconv

      Iconv is a wapper of character conversion library bundled in Unix platform. The supported encodings and behaviors of this method depends on unix distributions, so we should use transcode based methods in Ruby 1.9.1.

      * Iconv



  • Into Action


    • Designing Libraries

      If you are planning to create a library that works on Ruby 1.9, you should code in
      US-ASCII only. You should keep your library from relying on a specific encoding or character set although non-ASCII String literals can be used with the magic comment. When you need to use a specific encoding in your library, you should see the
      encoding of a given method argument. For example, when you write Hiragana to/from Katakana conversion library, you should avoid writing a Hiragana/Katakana map. Instead, you should convert using a correspondent encoding based on the encoding of the method argument. I recommend not modifying original strings.
      Your library should choose an output encoding according to the priority below:

      1. The encoding of a user-supplied argument, if it is given

      2. Encoding.default_encoding.String#encode will do this by default like so:


      3. def output target_encoding = Encoding.default_internal
        library_string.encode target_encoding
        end

      4. The encoding of the original string


      Basically, you should use the Code Set Independent (CSI) model in your library unless it is huge or defined to use Unicode such as XML or YAML.


    • 1.9 CSI

      * You should write US_ASCII only
      * You should handle strings as “String”literally, and process by a unit of a character
      * You should not modify bytes or code points in your string processing.


    • 1.9 UCS

      * Choose one encoding for the UCS model (UTF-8, EUC-JP, or whatever you like).
      * Set the chosen encoding to Encoding.defult_internal.
      * Convert characters into UCS only if those are not UCS.
      * Deal characters and bytes differently like the CSI model


    • 1.8 Compatible UCS

      * Ruby 1.8 could have only one encoding for its internal processing. Thus, the choice should be UCS model only.
      * Since automatically converted results would be often culprits of later processing, Encoding.default_internal should be nil.
      * You should use Ruby 1.9’s string processing methods. If you want to use methods of Ruby 1.8.7 or older, you need to redefine them by yourself.
      * Also, you should use Ruby 1.9’s byte processing methods. If you want to use methods of Ruby 1.8.7 or older, you need to redefine them by yourself.


    • 1.8 Compatible CSI

      This must be so hard since the regular expression engine can handle only SJIS, EUC, and UTF8.




  • Reivision History


    • Februrary 12, 2009

      - Added Ruby 1.9.1’s behavior towards UCS-2
      - Added description more about Unicode
      - Added filesystem encoding section
      - Modified locale encoding section based on Usa-san’s opinion
      - Changed characters of magic comment examples to lower case letters reflecting Mamamoto-san’s suggestion


    • Februrary 8, 2009

      - Added Index
      - Added further details to BOM preceded UTF-16 and UTF-32
      - Added explanation to Locale Encoding and Filesystem Encoding sections
      - Added brief information about Encoding constant
      - Added few more descriptions about a replica encoding
      - Revised String#inspect section and added information about Kernel#p
      - Added brief information about IO#binread
      - Revised many words and phrases



  • About the Autor

    Yui Naruse. nkf is in the author's field of work.