Friday, April 23, 2010

Finally, pure Java Nokgiri worked on Google App Engine

Pure Java version of Nokogiri is on the way to its very first version. It is "pure Java," not backed by libxml2, so people will expect that Nokogiri works on Google App Engine. In fact, Nokogiri on GAE is one of the purposes of Java port. Then... does it really work? Yes, it does! I could manage to get it work on GAE. However, it took pretty longer time than I thought, so I'm going to write down how I made it. Hoping this will help *pure Java* Nokogiri users.

1. Create pure Java Nokogiri gem

See Nokogiri Java Port: Help Us Finish It! for details.
$ git clone git://
$ cd nokogiri
$ git checkout -b java origin/java
(use C Ruby for following two commands)
$ sudo gem install racc rexical rake-compiler hoe
$ rake gem:dev:spec
$ jruby -S rake java:gem

You'll have pure Java Nokogiri gem in "pkg" directory. The name is something like nokogiri-

Then, install this gem to your JRuby.
$ jruby -S gem install [path to nokogiri]/pkg/nokogiri-

2. Create a web app project for GAE

I used Google Plugin for Ecplise to create a web application project. This is Java's web application project. I used Java project because I wrote a servlet using JRuby Embed (RedBridge). Testing Nokogiri with JRuby Embed (RedBridge) is good to know how it works.

3. Create one jar archive including Nokogiri gem

I could have put Ruby library and jar archives of Nokogiri under WEB-INF, but instead, I created one jar archive. This is because GAE limits a number of files as well as each file size. Before creating jar archive, you need to do "gem bundle" so that JRuby can load ruby files. Gem bundler is good tool to put gems together to create a jar archive. See Using the New Gem Bundler Today to learn how you can bundle gems.

First, install gem bundler.

$ jruby -S gem install bundler08

Go to the top directory of your web application. If you created a web application project using Eclipse plugin, go to [project's top directory]/war. Then, create "Gemfile." For example,

$ cd /Users/yoko/workspace/Dahlia/war
$ vi Gemfile

Here's an example of Gemfile.

# Critical default settings:
bundle_path ".gems/bundler_gems"

# List gems to bundle here:
gem "nokogiri", ""

Make sure nokogiri gem is installed in your JRuby by "jruby -S gem list."

$ jruby -S gem bundle

This command creates .gems directory and puts gems listed in Gemfile under .gems/bundler_gems directory.

Next, edit .gems/bundler_gems/environment.rb as in below:
require 'rbconfig'
engine = defined?(RUBY_ENGINE) ? RUBY_ENGINE : 'ruby'
version = Config::CONFIG['ruby_version']
#require File.expand_path("../#{engine}/#{version}/environment", __FILE__)
require File.dirname(__FILE__) + "/#{engine}/#{version}/environment"

File.expand_path tries to expand the path to a full path starting from "/" (root). This full path doesn't work as a load path on Java based web application. This environment.rb file will be in a jar archive, so the path should be something like "file:/base/data/home/apps/servletgarden-dahlia/17.341465334854239323/WEB-INF/lib/gems.jar!/bundler_gems/jruby/1.8/environment." The path starting from "/" never works.

Now, you have everything under .gems/bundler_gems directory, next thing you do is to create a jar archive. For example, I did as in below:

$ cd .gems
$ jar -J-Duser.language=en cfv ../WEB-INF/lib/gems.jar bundler_gems/

In this case, gems.jar will be created in WEB_INF/lib directory. If you change the top directory in the jar archive, you will have a different load path setting I did in my servlet.

You need to one more job here. Pure Java Nokogiri gem has six jar archives in it. However, you need to move or copy Java archives in WEB-INF/lib directory to be loaded to your web application The jar archives are in .gems/bundler_gems/jruby/1.8/gems/nokogiri- directory. So, assuming you are in .gems direotry,

$ cp undler_gems/jruby/1.8/gems/nokogiri-*.jar ../WEB-INF/lib/.
$ cp undler_gems/jruby/1.8/gems/nokogiri- ../WEB-INF/lib/.

4. Have patched JRuby archive

JRuby 1.5.0RC2 released in April 28 2010 fixed all problems described here. Have JRuby 1.5.0RC2! Then, what you do is to create two jar archives by rake task.

This part was the most painful to make Nokogiri work on GAE. Unfortunately, you can't do this on JRuby 1.5.0RC1. You need the latest JRuby in git repo because of the problems.
The first problem is that JRuby 1.5.0RC1's source archive doesn't have gem directory. In the gem directory, the tool to split jruby-complete.jar up into two jar archives is there. Because jruby-complete.jar is too big to upload GAE, this jar file need to be split into smaller jars. GAE has --enable_jar_splitting option, but jruby-complete.jar is not just a bunch of .class files. It includes .rb files, which should be found under jruby.home. So, I don't think --enable_jar_splitting option will work. This problem was fixed in master already. If you have the latest JRuby,

$ ant dist
$ cd gem
$ jruby -S rake

will do the job.

The second problem is JRuby Embed raises NullPointerException form SystemPropertyCathcer. JRuby Embed haven't suspected "java.class.path" system property might be null, but it is on GAE. This bug is also fixed in master.

However, the third problem is now under the way. The third one, maybe serious one, is that "require 'rbconfig'" failes on GAE. When RbConfigLibrary is loaded on JRuby it raises NullPointerException because Platform.ARCH is null on GAE. I filed this in JIRA, with a patch. After that, Charles attached a new patch, which is supposed to be applied to jruby-1_5 branch. Probably, it won't take long to solve this issue. If you want to give Nokogiri a try on GAE now, apply the patch and build JRuby, and split jruby-complete.jar up into two jars.

5. Write a Servlet

Here's a very simple Servlet that uses Nokogiri.
package com.servletgarden.dahlia;

import java.util.Arrays;
import java.util.List;

import javax.servlet.http.*;

import org.jruby.embed.LocalContextScope;
import org.jruby.embed.ScriptingContainer;

public class DahliaServlet extends HttpServlet {
private ScriptingContainer container;
private String script =
"doc = Nokogiri::XML \"\"\n" +
"puts doc.to_xml";

public void init() {
String basepath = getServletContext().getRealPath("/WEB-INF");
String[] paths = {"file:"+ basepath + "/lib/gems.jar!/bundler_gems"};
List loadPaths = Arrays.asList(paths);
container = new ScriptingContainer(LocalContextScope.SINGLETHREAD);
public void doGet(HttpServletRequest req, HttpServletResponse resp)
throws IOException {
resp.setContentType("text/plain; charset=UTF-8");
synchronized (container) {
container.runScriptlet("require 'environment'");
container.runScriptlet("require 'nokogiri'");

The load path setting in this servlet depends on what top directory you chose for the gem.jar. Don't forget "require 'environment'" since this app uses gem bundler instead of rubygems.

If this servlet successfully works on GAE, you'll get this simple response on your browser.
<?xml version="1.0"?>

You can see this at

1 comment:

Shih-gian Lee said...

Hello Yoko,

I tried to build nokogiri by following Charles Nutter's instruction (jruby -S rake java:build) but received the following error:

(in /Users/shihgianlee/git/nokogiri)
warning: couldn't activate the debugging plugin, skipping
javac -g -cp /usr/local/jruby-1.5.0/lib/jruby.jar:../../lib/nekohtml.jar:../../lib/nekodtd.jar:../../lib/xercesImpl.jar:../../lib/isorelax.jar:../../lib/jing.jar nokogiri/*.java nokogiri/internals/*.java
nokogiri/internals/ method does not override a method from its superclass
nokogiri/internals/ method does not override a method from its superclass
nokogiri/internals/ method does not override a method from its superclass
3 errors
rake aborted!
Command failed with status (1): [javac -g -cp /usr/local/jruby-1.5.0/lib/jr...]
(See full trace by running task with --trace)

Do you know what I may be missing? I am running Java 1.5.0 on my Mac. Any help is much appreciated!