Here're what I did to enable the unicode flag and to get correct outputs.
1. checkout jcodings from http://svn.codehaus.org/jruby/jcodings/ because joni needs it.
2. cd jcodings; mvn clean install
3. check out joni-1_0 from http://svn.codehaus.org/jruby/joni/branches/joni-1_0/. (needs exactly this version)
4. cd joni-1_0
5. edit src/org/joni/Config.java and set true to USE_UNICODE_PROPERTIES.
6. mvn clean package
7. cp target/joni.jar <somewhere>/jruby-1.1.4/build_lib/.
8. cd <somewhere>/jruby-1.1.4
9. ant clean jar
Then, I could build customized version of JRuby, which should be unicode regular expression compliant. When I tried this UTF-8 encoded Ruby script,
p 'abcアイウαβγ'.scan(/[a-z]/)
p "abcアイウαβγ".scan(/\p{Katakana}/u)
print "abcアイウαβγ".scan(/\p{Katakana}/u), "\n\n"
p "abcアイウαβγ".scan(/\p{^Greek}/u)
print "abcアイウαβγ".scan(/\p{^Greek}/u), "\n\n"
p "abcアイウαβγ".scan(/[\u0370-\u30FF]/u)
print "abcアイウαβγ".scan(/[\u0370-\u30FF]/u), "\n"
$KCODE="utf8"
p "abcアイウαβγ".scan(/\p{Greek}/)
it printed out:
["a", "b", "c"]
["\343\202\242", "\343\202\244", "\343\202\246"]
アイウ
["a", "b", "c", "\343\202\242", "\343\202\244", "\343\202\246"]
abcアイウ
["a", "b", "c"]
abc
["α", "β", "γ"]
Although unicode codepoint from Greek to Katakana didn't work, others were good. (Ruby 1.9 showed readable characters in both p and print, but JRuby's p didn't.)
Of course, I got an error "unicode_regex.rb:2: invalid character property name {Katakana}: /\p{Katakana}/u (RegexpError)" when I tried this script by regular JRuby 1.1.4.
Following lopex's comment, I wrote this Ruby script in EUC-JP encoding and ran it on regular JRuby 1.1.4.
p "abcアイウαβγ".scan(/\p{Katakana}/e)
print "abcアイウαβγ".scan(/\p{Katakana}/e),"\n"
print "abcアイウαβγ".scan(/\p{Greek}/e),"\n"
Naturally, the last line caused an error "unicode_regexp_eucjp.rb:6: invalid character property name {Greek}: /\p{Greek}/e (RegexpError)" whatever the encoding option of regular expression was. However, two lines from the top worked and outputed:
["\245\242", "\245\244", "\245\246"]
アイウ
JRuby already has the ability to handle unicode regular expression in a Ruby way but this feature is just turned off. Since unicode regular expression is useful for non ascii language speakers, I hope this feature will trun on in near future.
5 comments:
Let me clarify the whole thing. JRuby regexp engine supports utf-8 just as 1.9/Oniguruma does. It is utf-8 multibyte character aware when unicode option is set. It also supports full Oniguruma syntax (the \p{...} etc...). The only feature that is turned off by default now are the code range tables (namely USE_UNICODE_PROPERTIES), nothing more.
As far as I tested, all versions of JRuby 1.1.x don't recognize /\p{property name}/u syntax in UTF-8 encoded files. Only when USE_UNICODE_PROPERTIES option was turned on, /\p{property name}/u syntax worked. So, JRuby's regular expression does not support unicode property name syntax for UTF-8 encoding. And, unicode codepoint range syntax did not work even though USE_UNICODE_PROPERTIES was turned on.
Is this a bug?
It's not a bug, more a missing functionality from 1.9 mode which JRuby doesn't officially support yet.
In 1.9 re.c the unicode range is escaped in unescape_nonascii/unescape_unicode_bmp functions.
Re, property syntax, \p{...} syntax is always recognized and works even without USE_UNICODE_PROPERTIES set. The thing that fails is property name lookup, not the syntax itself.
Regular expression is really wonderful to parsing HTML or matching pattern. I use this a lot when i code. Actually when I learn any new langauge, first of all I first try whether it supports regex or not. I feel ezee when I found that.
http://icfun.blogspot.com/2008/04/ruby-regular-expression-handling.html
Here is about ruby regex. This was posted by me when I first learn ruby regex. So it will be helpfull for New coders.
Post a Comment