Thursday, September 04, 2008

JRuby's unicode regular expression

In my previous post, I worte about JRuby's unicode regular expression, Joni, didn't work like Ruby's even though both engines were Oniguruma. But, the truth is ... Joni dares be off the flag that enables unicode regular expression syntax described in Oniguruma's document "since unicode tables would make jruby distribiustion a bit more boilerplate (lopex)". Lopex, who is an implementor of joni, commented on my post. I followed what lopex wrote and could get unicode regular expression run on JRuby 1.1.4 as if it is Ruby 1.9.

Here're what I did to enable the unicode flag and to get correct outputs.

1. checkout jcodings from http://svn.codehaus.org/jruby/jcodings/ because joni needs it.
2. cd jcodings; mvn clean install
3. check out joni-1_0 from http://svn.codehaus.org/jruby/joni/branches/joni-1_0/. (needs exactly this version)
4. cd joni-1_0
5. edit src/org/joni/Config.java and set true to USE_UNICODE_PROPERTIES.
6. mvn clean package
7. cp target/joni.jar <somewhere>/jruby-1.1.4/build_lib/.
8. cd <somewhere>/jruby-1.1.4
9. ant clean jar

Then, I could build customized version of JRuby, which should be unicode regular expression compliant. When I tried this UTF-8 encoded Ruby script,


p 'abcアイウαβγ'.scan(/[a-z]/)
p "abcアイウαβγ".scan(/\p{Katakana}/u)
print "abcアイウαβγ".scan(/\p{Katakana}/u), "\n\n"
p "abcアイウαβγ".scan(/\p{^Greek}/u)
print "abcアイウαβγ".scan(/\p{^Greek}/u), "\n\n"
p "abcアイウαβγ".scan(/[\u0370-\u30FF]/u)
print "abcアイウαβγ".scan(/[\u0370-\u30FF]/u), "\n"

$KCODE="utf8"
p "abcアイウαβγ".scan(/\p{Greek}/)


it printed out:


["a", "b", "c"]
["\343\202\242", "\343\202\244", "\343\202\246"]
アイウ

["a", "b", "c", "\343\202\242", "\343\202\244", "\343\202\246"]
abcアイウ

["a", "b", "c"]
abc
["α", "β", "γ"]


Although unicode codepoint from Greek to Katakana didn't work, others were good. (Ruby 1.9 showed readable characters in both p and print, but JRuby's p didn't.)
Of course, I got an error "unicode_regex.rb:2: invalid character property name {Katakana}: /\p{Katakana}/u (RegexpError)" when I tried this script by regular JRuby 1.1.4.

Following lopex's comment, I wrote this Ruby script in EUC-JP encoding and ran it on regular JRuby 1.1.4.


p "abcアイウαβγ".scan(/\p{Katakana}/e)
print "abcアイウαβγ".scan(/\p{Katakana}/e),"\n"
print "abcアイウαβγ".scan(/\p{Greek}/e),"\n"


Naturally, the last line caused an error "unicode_regexp_eucjp.rb:6: invalid character property name {Greek}: /\p{Greek}/e (RegexpError)" whatever the encoding option of regular expression was. However, two lines from the top worked and outputed:


["\245\242", "\245\244", "\245\246"]
アイウ


JRuby already has the ability to handle unicode regular expression in a Ruby way but this feature is just turned off. Since unicode regular expression is useful for non ascii language speakers, I hope this feature will trun on in near future.

5 comments:

lopex said...

Let me clarify the whole thing. JRuby regexp engine supports utf-8 just as 1.9/Oniguruma does. It is utf-8 multibyte character aware when unicode option is set. It also supports full Oniguruma syntax (the \p{...} etc...). The only feature that is turned off by default now are the code range tables (namely USE_UNICODE_PROPERTIES), nothing more.

yokolet said...

As far as I tested, all versions of JRuby 1.1.x don't recognize /\p{property name}/u syntax in UTF-8 encoded files. Only when USE_UNICODE_PROPERTIES option was turned on, /\p{property name}/u syntax worked. So, JRuby's regular expression does not support unicode property name syntax for UTF-8 encoding. And, unicode codepoint range syntax did not work even though USE_UNICODE_PROPERTIES was turned on.
Is this a bug?

lopex said...

It's not a bug, more a missing functionality from 1.9 mode which JRuby doesn't officially support yet.
In 1.9 re.c the unicode range is escaped in unescape_nonascii/unescape_unicode_bmp functions.

lopex said...

Re, property syntax, \p{...} syntax is always recognized and works even without USE_UNICODE_PROPERTIES set. The thing that fails is property name lookup, not the syntax itself.

Wolf said...

Regular expression is really wonderful to parsing HTML or matching pattern. I use this a lot when i code. Actually when I learn any new langauge, first of all I first try whether it supports regex or not. I feel ezee when I found that.

http://icfun.blogspot.com/2008/04/ruby-regular-expression-handling.html

Here is about ruby regex. This was posted by me when I first learn ruby regex. So it will be helpfull for New coders.