Friday, April 23, 2010

Finally, pure Java Nokgiri worked on Google App Engine

Pure Java version of Nokogiri is on the way to its very first version. It is "pure Java," not backed by libxml2, so people will expect that Nokogiri works on Google App Engine. In fact, Nokogiri on GAE is one of the purposes of Java port. Then... does it really work? Yes, it does! I could manage to get it work on GAE. However, it took pretty longer time than I thought, so I'm going to write down how I made it. Hoping this will help *pure Java* Nokogiri users.

1. Create pure Java Nokogiri gem

See Nokogiri Java Port: Help Us Finish It! for details.
$ git clone git://
$ cd nokogiri
$ git checkout -b java origin/java
(use C Ruby for following two commands)
$ sudo gem install racc rexical rake-compiler hoe
$ rake gem:dev:spec
$ jruby -S rake java:gem

You'll have pure Java Nokogiri gem in "pkg" directory. The name is something like nokogiri-

Then, install this gem to your JRuby.
$ jruby -S gem install [path to nokogiri]/pkg/nokogiri-

2. Create a web app project for GAE

I used Google Plugin for Ecplise to create a web application project. This is Java's web application project. I used Java project because I wrote a servlet using JRuby Embed (RedBridge). Testing Nokogiri with JRuby Embed (RedBridge) is good to know how it works.

3. Create one jar archive including Nokogiri gem

I could have put Ruby library and jar archives of Nokogiri under WEB-INF, but instead, I created one jar archive. This is because GAE limits a number of files as well as each file size. Before creating jar archive, you need to do "gem bundle" so that JRuby can load ruby files. Gem bundler is good tool to put gems together to create a jar archive. See Using the New Gem Bundler Today to learn how you can bundle gems.

First, install gem bundler.

$ jruby -S gem install bundler08

Go to the top directory of your web application. If you created a web application project using Eclipse plugin, go to [project's top directory]/war. Then, create "Gemfile." For example,

$ cd /Users/yoko/workspace/Dahlia/war
$ vi Gemfile

Here's an example of Gemfile.

# Critical default settings:
bundle_path ".gems/bundler_gems"

# List gems to bundle here:
gem "nokogiri", ""

Make sure nokogiri gem is installed in your JRuby by "jruby -S gem list."

$ jruby -S gem bundle

This command creates .gems directory and puts gems listed in Gemfile under .gems/bundler_gems directory.

Next, edit .gems/bundler_gems/environment.rb as in below:
require 'rbconfig'
engine = defined?(RUBY_ENGINE) ? RUBY_ENGINE : 'ruby'
version = Config::CONFIG['ruby_version']
#require File.expand_path("../#{engine}/#{version}/environment", __FILE__)
require File.dirname(__FILE__) + "/#{engine}/#{version}/environment"

File.expand_path tries to expand the path to a full path starting from "/" (root). This full path doesn't work as a load path on Java based web application. This environment.rb file will be in a jar archive, so the path should be something like "file:/base/data/home/apps/servletgarden-dahlia/17.341465334854239323/WEB-INF/lib/gems.jar!/bundler_gems/jruby/1.8/environment." The path starting from "/" never works.

Now, you have everything under .gems/bundler_gems directory, next thing you do is to create a jar archive. For example, I did as in below:

$ cd .gems
$ jar -J-Duser.language=en cfv ../WEB-INF/lib/gems.jar bundler_gems/

In this case, gems.jar will be created in WEB_INF/lib directory. If you change the top directory in the jar archive, you will have a different load path setting I did in my servlet.

You need to one more job here. Pure Java Nokogiri gem has six jar archives in it. However, you need to move or copy Java archives in WEB-INF/lib directory to be loaded to your web application The jar archives are in .gems/bundler_gems/jruby/1.8/gems/nokogiri- directory. So, assuming you are in .gems direotry,

$ cp undler_gems/jruby/1.8/gems/nokogiri-*.jar ../WEB-INF/lib/.
$ cp undler_gems/jruby/1.8/gems/nokogiri- ../WEB-INF/lib/.

4. Have patched JRuby archive

JRuby 1.5.0RC2 released in April 28 2010 fixed all problems described here. Have JRuby 1.5.0RC2! Then, what you do is to create two jar archives by rake task.

This part was the most painful to make Nokogiri work on GAE. Unfortunately, you can't do this on JRuby 1.5.0RC1. You need the latest JRuby in git repo because of the problems.
The first problem is that JRuby 1.5.0RC1's source archive doesn't have gem directory. In the gem directory, the tool to split jruby-complete.jar up into two jar archives is there. Because jruby-complete.jar is too big to upload GAE, this jar file need to be split into smaller jars. GAE has --enable_jar_splitting option, but jruby-complete.jar is not just a bunch of .class files. It includes .rb files, which should be found under jruby.home. So, I don't think --enable_jar_splitting option will work. This problem was fixed in master already. If you have the latest JRuby,

$ ant dist
$ cd gem
$ jruby -S rake

will do the job.

The second problem is JRuby Embed raises NullPointerException form SystemPropertyCathcer. JRuby Embed haven't suspected "java.class.path" system property might be null, but it is on GAE. This bug is also fixed in master.

However, the third problem is now under the way. The third one, maybe serious one, is that "require 'rbconfig'" failes on GAE. When RbConfigLibrary is loaded on JRuby it raises NullPointerException because Platform.ARCH is null on GAE. I filed this in JIRA, with a patch. After that, Charles attached a new patch, which is supposed to be applied to jruby-1_5 branch. Probably, it won't take long to solve this issue. If you want to give Nokogiri a try on GAE now, apply the patch and build JRuby, and split jruby-complete.jar up into two jars.

5. Write a Servlet

Here's a very simple Servlet that uses Nokogiri.
package com.servletgarden.dahlia;

import java.util.Arrays;
import java.util.List;

import javax.servlet.http.*;

import org.jruby.embed.LocalContextScope;
import org.jruby.embed.ScriptingContainer;

public class DahliaServlet extends HttpServlet {
private ScriptingContainer container;
private String script =
"doc = Nokogiri::XML \"\"\n" +
"puts doc.to_xml";

public void init() {
String basepath = getServletContext().getRealPath("/WEB-INF");
String[] paths = {"file:"+ basepath + "/lib/gems.jar!/bundler_gems"};
List loadPaths = Arrays.asList(paths);
container = new ScriptingContainer(LocalContextScope.SINGLETHREAD);
public void doGet(HttpServletRequest req, HttpServletResponse resp)
throws IOException {
resp.setContentType("text/plain; charset=UTF-8");
synchronized (container) {
container.runScriptlet("require 'environment'");
container.runScriptlet("require 'nokogiri'");

The load path setting in this servlet depends on what top directory you chose for the gem.jar. Don't forget "require 'environment'" since this app uses gem bundler instead of rubygems.

If this servlet successfully works on GAE, you'll get this simple response on your browser.
<?xml version="1.0"?>

You can see this at

Friday, April 16, 2010

RedBridge, what are new and improved in JRuby 1.5.0RC1

As you may know, Tom Enebo announced the release of JRuby 1.5.0RC1 on Apr. 15 saying "aged like a fine wine." @headius tweated "Over 1250 commits for JRuby 1.5, our largest amount of work ever for any individual release." Also, RedBridge is. RedBridge has been improved since last release based on user inputs. It's API had many changes to become more useful and organized API. Although I've already written about all changes in this blog, I'm going to put them together here for convenience.

New and Deprected Configuration API
RedBridge in JRuby 1.5.0 has a lot of Ruby runtime configuration methods. Before, those were available through getProvider().getRubyInstanceConfig() method, however, this was not a good idea. Since the method exposes JRuby's internal API, users' code might be affected by internal API changes. This fact is against to the purpose of RedBridge. RedBrdige should cover JRuby's internal API and absorb internal changes so that users don't need to fix their code by themselves. Avoid using getProvider().getRubyInstanceConfig() method as much as possible. If you want more runtime configuration methods, please request us.

New runtime configuration methods of ScriptingContainer:

  • get/setInput
  • get/setOutput
  • get/setError
  • get/setCompileMode
  • get/setRunRubyInProcess
  • get/setCompatVersion
  • get/setObjectSpaceEnabled
  • get/setEnvironment
  • get/setCurrentDirectory
  • get/setHomeDirectory
  • get/setClassCache
  • get/setClassLoader
  • get/setProfile
  • get/setLoadServiceCreator
  • get/setArgv
  • get/setScriptFileName
  • get/setRecordSeparator
  • get/setKCode
  • get/setJITLogEvery
  • get/setJITThreshold
  • get/setJITMax
  • get/setJITMaxSize

Deprecated configuration methods:

  • getRuntime()
  • getProvider().setLoadPaths()
  • getProvider().setClassCache()

Usage example:

[JRuby 1.4.0]
ScriptingContainer container = new ScriptingContainer();

[JRuby 1.5.0]
ScriptingContainer container = new ScriptingContainer();

New Options
Also, RedBridge got two new options: SHARING_VARIABLES and TERMINATION. The first, SHARING_VARIABLES, option turns on/off a sharing variables feature.This is an essential feature for some users while useless for other users. For those people, sharing variables is just a source of performance degradation. When the feature is turned off, the performance will be a bit better.
Usage example:

[Embed Core]
container.setAttribute(AttributeName.SHARING_VARIABLES, false);

engine.getContext().setAttribute("org.jruby.embed.sharing.variables", false, ScriptContext.ENGINE_SCOPE);

The second, TERMINATION, option is for JSR223 users to call terminate. This option was added in light of RedBridge's behavior change. This is a big change, so I'll discuss more about this.

Changed Behaviors

* No termination in each evaluation and method call

RedBridge in JRuby 1.4.0 always called Ruby runtime's terminate method at the end of each evaluation and method call. This was to execute at_exit blocks and release resources automatically. The idea came from JSR223 API, which doesn't have a terminate method defined. However, I eliminated this behavior in JRuby 1.5.0 for three reasons. The first one is for performance. The terminate() method is so slow and was the biggest culprit of bad performance. The more Ruby code uses ruby files, instance variables, etc, the more it takes time. The second one is to make RedBridge's behavior more natural. Since the terminate() method fires at_exit blocks automatically, users might have unexpected results when they use third party libraries. Users should have a chance to fire at_exit blocks by themselves. The third one is that JRuby's memory leak was fixed. Thus, RedBridge doesn't need to invoke the terminate method just for releasing resources. Make sure, you have the terminate method in the right place.

Usage examples:

[Embed Core]

ScriptingContainer container = null;
try {
container = new ScriptingContainer();
container.runScriptlet(PathType.CLASSPATH, testname);
} catch (Throwable t) {
} finally {


ScriptEngineManager manager = new ScriptEngineManager();
ScriptEngine engine = manager.getEngineByName("jruby");
engine.eval("at_exit { puts \"#{$x} in an at_exit block\" }"); // nothing is printed here
engine.getContext().setAttribute(AttributeName.TERMINATION.toString(), true, ScriptContext.ENGINE_SCOPE);
engine.eval(""); // prints "GVar in an at_exit block"

* Global runtime

When a global ruby runtime exists on a single JVM, a singleton model of RedBridge uses the global runtime in JRuby 1.5.0. This works behind the scene and seems not so attractive, but is interesting. Here's a bit tricky usage:

$ pwd
$ jruby --1.9 -Ctest -S irb
irb(main):001:0> require 'java'
=> true
=> org.jruby.embed.ScriptingContainer@771eb1
irb(main):003:0> container.compat_version <---- --1.9 option
=> RUBY1_9
irb(main):004:0> container.current_directory <---- -Ctest option
=> "/Users/yoko/Tools/jruby-1.5.0.RC1/test"
irb(main):009:0> container.home_directory <---- jruby home
=> "/Users/yoko/Tools/jruby-1.5.0.RC1"
irb(main):010:0> container.run_scriptlet "at_exit { puts \"see you, later\" }"
=> #<Proc:0x88a1b@<script>:1>
irb(main):011:0> container.terminate
see you, later
=> nil

This new behavior might be useful in a complicated application.
However, you should be aware that setting a runtime configuration doesn't work if the global runtime is there already. This is because the runtime configuration is read only when the runtime is instantiated. You should be careful not miss the timing to set configuration.

* Lazy Runtime Initialization

RedBridge (in this case, I mean Embed Core) delays ruby runtime initialization as much as possible. This is to improve ScriptingContainer's start up time. You may know loading Ruby runtime is a huge job and takes pretty much time. This might cause frustration if it happens right after the ScriptinContainer gets started. The question is when runtime is up and running. Some of ScriptingContainer's methods will kick ruby runtime to wake up. Here's a list:

  • put()
  • runScriptlet()
  • setWriter()
  • resetWriter()
  • setErrorStream()
  • resetErrorStream()
  • setReader()

Thus, when you want configuration settings to work, you need to set them before these methods.
Meanwhile, JSR223 implementation doesn't delay ruby runtime initialization. It was not easy without breaking JSR223's requirement.

* Lazy Java Library Loading

Red Bridge doesn't load a java library while ruby runtime is initialized in JRuby 1.5.0. This is also for performance improvement. Loading libraries on to ruby runtime is quite a cumbersome job. Checking loaded library tables up to see whether a specified library has not yet loaded, judging how to load the library, then loading, caching... Even though Java library is not loaded while initialization, it will be loaded internally if necessary. Or you can load Java library explicitly:
container.runScriptlet("require 'java'");

Performance Tuning Tips
RedBridge's performance has been improved compared to older version, but you can tweak a bit more. For example, you can remove variables for sharing or clear sharing variable table at some point:
org.jruby.embed.ScriptingContainer#remove(String key)

RedBridge retrieves instance variables and constants as much as possible at the end of each evaluation and method call. All retrieved values are injected to runtime when the next script or method is evaluated. You can cut down the time for injection by removing unnecessary values.

Remaining jobs
RedBridge couldn't resolve all issues and has remaining jobs. Among them, OSGi and configuration on JSR223 impl would be two big issues. By the final release of JRuby 1.5.0, I want to improve these.

Finally, your input will help us to make RedBridge more perfect API. Give it a try and report us!