Wednesday, September 03, 2008

Ruby 1.9's Unicode Regular Expression

Ruby 1.9 has greatly improved its M17N features. Unicode regular expressions would be among the most improved ones. Ruby 1.9 uses Oniguruma for its regular expression engine and enables regular expressions by unicode codepoints or property names as described in Oniguruma's document at http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt.

When I tested some unicode regular expressions by ruby 1.9.0 (2008-08-26 revision 18849) [i386-darwin9.4.0], those were correctly processed. For example, this Ruby script,

# encoding: UTF-8

p 'abcアイウαβγ'.scan(/[a-z]/) # lower case alphabetical characters
p 'abcアイウαβγ'.scan(/\p{Katakana}/) # Katakana characters
p 'abcアイウαβγ'.scan(/\p{^Greek}/) # negation: other than Greek characters
p 'abcアイウαβγ'.scan(/[\u0370-\u30FF]/) # unicode codepoints from Greek to Katakana blocks

ouputs like this:

["a", "b", "c"]
["ア", "イ", "ウ"]
["a", "b", "c", "ア", "イ", "ウ"]
["ア", "イ", "ウ", "α", "β", "γ"]


The first line of Ruby script is a magic comment, which specifies an encoding of the script file. We can use either one of

# coding: UTF-8
# encoding: UTF-8
# -*- coding: UTF-8 -*-
# vim:set fileencoding=UTF-8:

to tell Ruby what encoding the script file uses. If the file starts with shebang(#!), then the magic comment goes to the second line of the file. (I found this infomation at http://i.loveruby.net/svn/rubydoc/doctree/trunk/refm/doc/spec/m17n.rd, which is written in Japanese, don't know where I can see English version of this document.)

JRuby 1.1.4 has been out there recently and started to support Ruby 1.9; however, unicode regular expressions are not included in the list. I tried to get this script run with --1.9 flag by JRuby 1.1.4, got "invalid character property name {Katakana}: /\p{Katakana}/ (RegexpError)" error. Oniguruma is also JRuby's regular expression engine like Ruby 1.9, but its implementation by Java, JONI, doesn't seem to work exactly the same as the Ruby's.

9 comments:

lopex said...

JOni has currently set USE_UNICODE_PROPERTIES flag off, since unicode tables would make jruby distribiustion a bit more boilerplate and 1.9 support is just a begining. With this flag set on, you should never experience any differences between Oniguruma and Joni (well, the truth is that 1.9 integrated Oniguruma diverged from it's original a good bit). JOni currently is able to match both using VANILLA flag in places where they differ. In order to check Oniguruma syntax works with USE_UNICODE_PROPERTIES being off under jruby now, you can choose EUCJP encoding: /\p{Katakana}/e.
Also, JOni supports all the encodings 1.9/Oniguruma supports (just the Encoding ruby class, transcoders and related API are not done yet, we're working on it).
All this stuff is still a moving target in 1.9 which makes it all more difficult to trace.

yokolet said...

Thanks for commeting this, but I got exactly the same RegexpError after compiling jcondings, joni and jruby with turning USE_UNICODE_PROPERTIES true. I used joni 1.0.3 since Config.java of joni in trunk doesn't have USE_UNICODE_PROPERTIES option. Did I need to do somthing more?

Anyway, I could get a correct result when I tried /\p{Katakana}/e, but this is a kind of pain. These days, EUC-JP is not a platform default encoding. I have Ubuntu and MacOS X, both of which use UTF-8 for their default encodings. So I needed to do something unusual to save a script file by EUC-JP. Besides, what about Greek?

JRuby has an alternative as for unicode regular expression, doing it in Java way. This alternative is not so nice as Ruby's regular expression but gives correct results.

I hope future JRuby's regular expression will work like Ruby 1.9.

lopex said...

First, you need to check 1.0 joni branch out: http://svn.codehaus.org/jruby/joni/branches/joni-1_0/ (joni trunk contains extracted encoding framework with changed packages and jruby trunk is not compatible with that change yet).
Then set USE_UNICODE_PROPERTIES in src/org/joni/Config.java, then (under joni source dir):

mvn clean package

Then copy target/joni.jar into build_lib under jruby source tree, then:

ant clean jar

As JRuby parser is not yet aware of 1.9 preambles like 'coding: utf-8', you'll need to use 1.8 way to specify utf-8 encoding, there's two ways, either:

$KCODE = "utf8"
/\p{Greek}/

or use 'u' regex option:
/\p{Greek}/u

Unfortunately 1.8 doesn't allow to specify encodings other than EUCJP, SJIS, ASCII or UTF-8. But it's just a matter of exposing those other encodings to Ruby.

yokolet said...

Thank you! Finally, I could get unicode regular expression run by JRuby in a Ruby way!

But, some weren't work as you talked about. I'll write another entry about this.

lopex said...

If there's any incompatibility, you can consider it as a bug. Feel free to file an issue: http://jira.codehaus.org/browse/JRUBY.

yokolet said...

All right, thanks.
In this case, I wasn't sure what part of Ruby 1.9 was supported in JRuby 1.1.4, so I hesitated to file this in JIRA.

Wolf said...

Regular expression is really wonderful to parsing HTML or matching pattern. I use this a lot when i code. Actually when I learn any new langauge, first of all I first try whether it supports regex or not. I feel ezee when I found that.

Here is about ruby regex. This was posted by me when I first learn ruby regex. So it will be helpfull for New coders.

http://icfun.blogspot.com/2008/04/ruby-regular-expression-handling.html

naruse said...

Recently, Run Paint Run Run wrote a document of Ruby's fork of Oniguruma.
http://svn.ruby-lang.org/cgi-bin/viewvc.cgi/trunk/doc/re.rdoc?view=markup

And I'm changing it arround Unicode Property.

yokolet said...

naruse-san, thanks for new info. I'll write about this in new post.