Thursday, April 14, 2011

Nokogiri on Google App Engine

Nokogiri 1.5.0 is on its way right now. Sure, it should be soonish. This version is also the first release of pure Java Nokogiri. We call it *pure Java*, but the name might not express itself precisely. Since it is written half Ruby and half Java, so *pure JRuby* (pragdave called so) would be the best name. This pure JRuby version implements methods, which are implemented in C, using xerces, nekoHTML, jing and a couple more Java Tools, while CRuby version uses libxml and libxslt. When people use Nokogiri 1.5.0 on JRuby, they use pure Java version.
What's the beauty of pure Java Nokogiri? It works smoothly on various platforms if Java runs on them. On OS X, Linux, Windows, and even Google App Engine, Nokogiri starts working painlessly. Really frequently asked questions for Nokogiri are "I can't install Nokogiri," or "Nokogiri doesn't work." Definitely, pure Java Nokogiri doesn't have these problems.


To see pure Java Nokogiri works fine, I gave it a try on Google App Engine (GAE). As you know, GAE supports python or Java only. Using libxml is out of scope. In short, pure Java Nokogiri just worked. Easy. (Unexpectedly, I struggled to get GAE work, so I'll write how I made it.) Although I don't have many to write about, I'm going to note what I did for people who don't know they can use Nokogiri on GAE.


First, I installed gems following the instruction, https://gist.github.com/825451. The instruction says, "Do not use rvm," but, I used rvm. Using rvm is not the matter. Rubygems' version is the matter. After I installed Ruby 1.8.7 using rvm, I downgraded rubygems to 1.3.7. Don't forget, google-appengine gem needs version 1.3.7 (or before) of rubygems. Otherwise, bundler08 will fail to install gem command *bundle*. This will end up in raising an error when appengine gem tries to install gems in .gems/bundler_gems/jruby/1.8/gems directory. Make sure *bundle* is listed in there when you type "gem help commands." See http://groups.google.com/group/appengine-jruby/browse_thread/thread/2db62b1a51896098 for a detail.

You do need to have CRuby but don't need to install JRuby. Appengine gem will install jruby-jar gem when it is needed. The gem, jruby-jar, has JRuby's stdlib in a jar archive. JRuby gets stared using this jar archive. So, google-appengine gem mostly works on CRuby and uses jruby-jar gem when JRuby is needed. Therefore, all gems should be installed on CRuby. Below is what I did.

rvm 1.8.7
sudo gem install google-appengine (Since I installed rvm to /usr/local, I need *sudo*)
sudo gem install rails -v 2.3.11
sudo gem install rails_dm_datastore
sudo gem install activerecord-nulldb-adapter
mkdir rails_app; cd rails_app
curl -O http://appengine-jruby.googlecode.com/hg/demos/rails2/rails2311_appengine.rb
ruby rails2311_appengine.rb

Then, rails app is ready to run. To start app on a development server,

./script/server.sh

This should start Jetty and rails app on that.

However, I was among unlucky people. I got Segmentation fault because my Java was Java SE 6 Update 4 for Mac OS X. Googling, I followed "Comment 39" of http://code.google.com/p/googleappengine/issues/detail?id=4712. I didn't want to downgrade JDK, but there seemed no better choice. Anyways, rails app successfully worked on update 3.


Next, I added Nokogiri in Gemfile. Currently 1.5.0.beta.4 is the latest.

gem 'nokogiri', '1.5.0.beta.4'

One more. The latest version of jruby-jar gem is 1.6.1, but, sadly, the jar archive in the gem is too big to upload. JRuby 1.6.1 grew bigger. As far as I remember, 1.6.0 is also too big to upload. Again, downgrade came in. I used version 1.5.6, and my Gemfile became as in below:

# Critical default settings:
disable_system_gems
disable_rubygems
bundle_path '.gems/bundler_gems'

# List gems to bundle here:
gem 'rails_dm_datastore'
gem 'jruby-jars', '1.5.6'
gem 'jruby-openssl'
gem 'jruby-rack', '1.0.5'
gem 'rails', '2.3.11'
gem 'nokogiri', '1.5.0.beta.4'



OK, my platform has been ready. Let's create a simple Nokogiri sample. In this sample, I got the rss feed from cnn.com (http://rss.cnn.com/rss/cnn_topstories.rss), parsed it using Nokogiri, and displayed news list. Since this is just a simple sample of Nokogiri, I generated a controller only.

./script/generate controller newsfeeds index

The rss I used was like https://gist.github.com/921058. From this XML document, I collected item elements using xpath. Then, I extracted pubDate, title, link, and description children elements of item also using xpath.

# newsfeeds_controller.rb
require 'nokogiri'
require 'open-uri'

class Entry
attr_reader :title, :url, :description, :pubdate
def initialize(title, url, description, pubdate)
@title = title
@url = url
@description = description
@pubdate = pubdate
end
end

class NewsfeedsController < ApplicationController
def index
doc = Nokogiri::XML(open("http://rss.cnn.com/rss/cnn_topstories.rss"))
items = doc.xpath("//item")
@entries = []
items.each do |item|
title = item.xpath("title").text
url = item.xpath("link").text
description = item.xpath("description").text
pubdate = item.xpath("pubDate").text
@entries << Entry.new(title, url, description, pubdate)
end
end
end

# newsfeeds/index.html.erb
<h1>Newsfeeds#index</h1>
<% @entries.each do |entry| %>
<dl>
<dt><%= entry.pubdate %></dt>
<dt><b><%= entry.title %></b> [<%= link_to("Read", entry.url) %>]</dt>
<dt><%= entry.description %></dt>
</dl>
<% end %>

When I restarted the server./script/server.h and requested http://localhost:8080/newsfeeds/, I could see news list something like this.



The last thing I did was uploading. I set my application id on the line "application:" in WEB-INF/app.yaml, then uploaded it by ./script/publish.sh. Now my Nokogiri sample is working at http://4.latest.servletgarden-in-red.appspot.com/newsfeeds/.


In the end, I'm going to add a link to the blog talked about Nokogiri on Google App Engine. This would be helpful, too.

- Google App Engine, JRuby, Sinatra and some fun!


So far, pure Java Nokogiri worked just fine on Google App Engine. Give it a try!

No comments: