[MacRuby-devel] Strings, Encodings and IO

Tue Apr 7 15:23:16 PDT 2009

Vincent Isambart wrote:
> Hi again,
> 
> So plan B: We emulate Ruby 1.9 strings behavior on top of of NSString/NSData.

I'm really interested in this discussion too. A little background for JRuby:

We started out (or really, the original authors started out) with JRuby 
using all Java strings and stringbuffers for Ruby's String. This made 
interop easy and of course simplified implementation, but it ran into 
the same problems you enumerated in your original post:

* Java's strings are all UTF-16. In order to represent binary data, we 
ended up using a "raw" encoder/decoder and only using the bottom byte of 
each character. Wasteful, since every string was 2x as large, and slow, 
since IO had to up/downcast byte[] contents to char[] and back.
* So we made a move about two years ago to using all byte[]-based 
strings in JRuby. This allowed us maximum compatibility, maximum 
performance, and a future-proof path, but it damages interop. Currently 
whenever you pass a string across the Ruby/Java boundary, we have to do 
the decode/encode. That affects performance pretty severely. We also had 
to implement our own regular expression engine, since no Java regex 
works with byte[].
* We want to move to an intermediate version, where we sometimes have a 
byte[]-backed string and sometimes a char[]/String-backed string. 
IronRuby does this already. This is, however, predicated on the idea 
that byte[]-based strings rarely become char[]-based strings and vice 
versa. I don't have any evidence for or against that yet.

So it's a nearly identical problem for MacRuby, as I understand it. I'm 
interested in discussion around this topic, since we are still moving 
forward with JRuby and would like to improve interop with Java 
libraries. I will offer the following food for thought:

* Going with 100% objc strings at first is probably a good pragmatic 
start. You'll have the perf/memory impact of encoding/decoding and 
wasteful string contents, but you should be able to get it functioning 
well. And since interop is a primary goal for MacRuby (where it's been 
somewhat secondary in JRuby) this is probably a better place to start.
* We have considered having a modified 1.9 mode that normalizes all 
strings into UTF-16 internally. That might work for you as well. I 
presume there are byte-decoding APIs in objc that could produce your 
standard strings. You'd be able to at least pretend you support 1.9 
encoding, but be transcoding to the user's selected encoding. It 
wouldn't be fast, but it would work.
* Alternatively, you could only support a minimum set of encodings and 
make it explicit that internally everything would be UTF-16 or MacRoman. 
In MacRuby's case, I think most people would happily accept that, just 
as a lot of JRuby users would probably accept that everything's UTF-16 
since that's what they get from Java normally.

Ultimately this is the exact reason I argued over a year ago that Ruby 
1.9 should introduce a separate Bytes class used for IO. I was denied.

It's definitely a sticky issue, and Ruby has made it even stickier in 
1.9 with arbitrary encoding support. None of the proposed solutions 
across all implementations (including JRuby) have really seemed ideal to me.

- Charlie