[MacRuby-devel] Strings, Encodings and IO
Charles Oliver Nutter
charles.nutter at sun.com
Tue Apr 7 15:23:16 PDT 2009
Vincent Isambart wrote:
> Hi again,
> So plan B: We emulate Ruby 1.9 strings behavior on top of of NSString/NSData.
I'm really interested in this discussion too. A little background for JRuby:
We started out (or really, the original authors started out) with JRuby
using all Java strings and stringbuffers for Ruby's String. This made
interop easy and of course simplified implementation, but it ran into
the same problems you enumerated in your original post:
* Java's strings are all UTF-16. In order to represent binary data, we
ended up using a "raw" encoder/decoder and only using the bottom byte of
each character. Wasteful, since every string was 2x as large, and slow,
since IO had to up/downcast byte contents to char and back.
* So we made a move about two years ago to using all byte-based
strings in JRuby. This allowed us maximum compatibility, maximum
performance, and a future-proof path, but it damages interop. Currently
whenever you pass a string across the Ruby/Java boundary, we have to do
the decode/encode. That affects performance pretty severely. We also had
to implement our own regular expression engine, since no Java regex
works with byte.
* We want to move to an intermediate version, where we sometimes have a
byte-backed string and sometimes a char/String-backed string.
IronRuby does this already. This is, however, predicated on the idea
that byte-based strings rarely become char-based strings and vice
versa. I don't have any evidence for or against that yet.
So it's a nearly identical problem for MacRuby, as I understand it. I'm
interested in discussion around this topic, since we are still moving
forward with JRuby and would like to improve interop with Java
libraries. I will offer the following food for thought:
* Going with 100% objc strings at first is probably a good pragmatic
start. You'll have the perf/memory impact of encoding/decoding and
wasteful string contents, but you should be able to get it functioning
well. And since interop is a primary goal for MacRuby (where it's been
somewhat secondary in JRuby) this is probably a better place to start.
* We have considered having a modified 1.9 mode that normalizes all
strings into UTF-16 internally. That might work for you as well. I
presume there are byte-decoding APIs in objc that could produce your
standard strings. You'd be able to at least pretend you support 1.9
encoding, but be transcoding to the user's selected encoding. It
wouldn't be fast, but it would work.
* Alternatively, you could only support a minimum set of encodings and
make it explicit that internally everything would be UTF-16 or MacRoman.
In MacRuby's case, I think most people would happily accept that, just
as a lot of JRuby users would probably accept that everything's UTF-16
since that's what they get from Java normally.
Ultimately this is the exact reason I argued over a year ago that Ruby
1.9 should introduce a separate Bytes class used for IO. I was denied.
It's definitely a sticky issue, and Ruby has made it even stickier in
1.9 with arbitrary encoding support. None of the proposed solutions
across all implementations (including JRuby) have really seemed ideal to me.
More information about the MacRuby-devel