[MacRuby-devel] Strings, Encodings and IO

Tue Apr 7 22:23:29 PDT 2009

>> So plan B: We emulate Ruby 1.9 strings behavior on top of of  
>> NSString/NSData.
>
> I'm really interested in this discussion too. A little background  
> for JRuby:

Thanks for the background, Charlie. This sort of history is very  
instructive.

> * Java's strings are all UTF-16. In order to represent binary data,  
> we ended up using a "raw" encoder/decoder and only using the bottom  
> byte of each character. Wasteful, since every string was 2x as  
> large, and slow, since IO had to up/downcast byte[] contents to  
> char[] and back.

Most CFStrings use a UTF-16 internal store as well.

> * We want to move to an intermediate version, where we sometimes  
> have a byte[]-backed string and sometimes a char[]/String-backed  
> string. IronRuby does this already. This is, however, predicated on  
> the idea that byte[]-based strings rarely become char[]-based  
> strings and vice versa. I don't have any evidence for or against  
> that yet.
>
> So it's a nearly identical problem for MacRuby, as I understand it.  
> I'm interested in discussion around this topic, since we are still  
> moving forward with JRuby and would like to improve interop with  
> Java libraries. I will offer the following food for thought:
>
> * Going with 100% objc strings at first is probably a good pragmatic  
> start. You'll have the perf/memory impact of encoding/decoding and  
> wasteful string contents, but you should be able to get it  
> functioning well. And since interop is a primary goal for MacRuby  
> (where it's been somewhat secondary in JRuby) this is probably a  
> better place to start.

That’s where things stand today, and with Laurent’s ByteString work  
this all mostly works as long as you don’t try to change encodings  
around on strings.

> * Alternatively, you could only support a minimum set of encodings  
> and make it explicit that internally everything would be UTF-16 or  
> MacRoman. In MacRuby's case, I think most people would happily  
> accept that, just as a lot of JRuby users would probably accept that  
> everything's UTF-16 since that's what they get from Java normally.

This seems like a bad situation in the face of the varied encoding  
landscape on the Internet.

> Ultimately this is the exact reason I argued over a year ago that  
> Ruby 1.9 should introduce a separate Bytes class used for IO. I was  
> denied.

I was disappointed to see this turned down as well. The encoding  
situation in 1.9 feels worse than it was in 1.8, and that’s pretty  
impressive.

> It's definitely a sticky issue, and Ruby has made it even stickier  
> in 1.9 with arbitrary encoding support. None of the proposed  
> solutions across all implementations (including JRuby) have really  
> seemed ideal to me.

Laurent and I discussed this a bit tonight, and here’s what I think we  
can get away with:

By default, store all strings as NSString (UTF-16 backed) with an ivar  
to store the encoding e.
When getting bytes, convert to a ByteString in the appropriate encoding.
When doing force_encoding, convert to a ByteString in the old  
encoding, then try to convert to an NSString in the new encoding. If  
we succeed, great. If not, leave as a tagged ByteString (and probably  
whine about it).
All ASCII-8BIT strings are backed by ByteString.

There’s some simplification here; some of the ByteStrings are really  
just NSDatas, &c., but the flow is there. Sorry the list above is a  
mess, I’m up much later than I’m accustomed to.

-Ben