[MacRuby-devel] String performance (yet another)

Tue Jan 18 17:49:51 PST 2011

Hi Vincent,

I used force_encode for a testing purpose only, which was suggested somewhere.  Using String#encode introduces another problem, so I don't want to use it in my app.  It seems to be that I have to add String#encode to all the String objects used with the text read from a file.  This behavior is the same as CRuby 1.9.2.

  File.read("test.txt").force_encoding("UTF-16LE").split("\n")

This script returns an error (Encoding::CompatibilityError) even with declaring char code at the beginning or the encoding of the file being UTF-16LE.

Anyway, I found an error in my original post.  The results of MacRuby 0.8 were 10 times faster.

> *Ruby 1.8.7				0.0019	0.0018	0.0017
> Ruby 1.9.2				0.029	0.030	0.029
> **MacRuby 0.8				0.0028	0.0025	0.0028
> MacRuby 0.9 2011/01/16	0.18		0.17		0.18
> MacRuby 0.9 2011/01/16	0.0023	0.0029	0.0021
> (with encode("UTF-16LE"))

So because of the changes made on 2010/12/17, MacRuby String behaves more like Ruby 1.9.2 than 1.8.7 and because of the slow object allocator, this process is slow (slower than 1.9.2)?  I'm not sure what changes were made internally, but this is too slow compare to 0.8.

How much could the object allocator be (or expected to be) faster than the current version (or is it going to be?)?  (I assume the optimization comes after 1.0).

My app is a text analysis tool and kwic (keyword in context) is the main feature.  With my simple test script, the processing time (kwic) on MacRuby was 20+ times slower than that on Ruby 1.8.7/1.9.2 (depending on a search word).

Thanks,
Yasu

On 2011/01/17, at 13:19, Vincent Isambart wrote:

> Hi,
> 
>> Indeed, String#[] will now perform slower on UTF8 non-ascii strings, because
>> computing the character index cannot be done in constant time anymore.
>> I don't believe this can be improved using the optimization we implemented
>> for #gsub and #scan. Maybe 1.9.2 has a better optimization, I will let
>> Vincent comment :)
> 
>> text = File.read("test.txt")
>> 1000.times do |i|
>> a = text[i,i+30]
>> end
> 
> In fact I already use the cache to get the offset for the end index.
> I just had a look at 1.9.2 and what they do is pretty similar to what
> we do. I would not be surprised if the difference was mainly due to
> the object allocator being much slower in MacRuby.
> I would need to shark to be sure but I would not expect much
> improvement on String#[] soon.
> 
> And by the way to try with UTF-16 you should not use force_encoding
> but encode, and not UTF-16BE but LE:
> text = text.encode(Encoding::UTF_16LE)
> because the fastest encoding is UTF-16LE and not BE (the native
> encoding on x86 is little endian), and on a UTF-8 string, forcing the
> encoding to ASCII or BINARY(ASCII-8BIT) would make sense (as all ASCII
> characters are the same in UTF-8 and ASCII) but forcing it to UTF-16
> would give you a meaningless string full of strange characters.
> _______________________________________________
> MacRuby-devel mailing list
> MacRuby-devel at lists.macosforge.org
> http://lists.macosforge.org/mailman/listinfo.cgi/macruby-devel