So plan B: We emulate Ruby 1.9 strings behavior on top of of NSString/NSData.
I'm really interested in this discussion too. A little background for JRuby:
Thanks for the background, Charlie. This sort of history is very instructive.
* Java's strings are all UTF-16. In order to represent binary data, we ended up using a "raw" encoder/decoder and only using the bottom byte of each character. Wasteful, since every string was 2x as large, and slow, since IO had to up/downcast byte[] contents to char[] and back.
Most CFStrings use a UTF-16 internal store as well.
* We want to move to an intermediate version, where we sometimes have a byte[]-backed string and sometimes a char[]/String-backed string. IronRuby does this already. This is, however, predicated on the idea that byte[]-based strings rarely become char[]-based strings and vice versa. I don't have any evidence for or against that yet.
So it's a nearly identical problem for MacRuby, as I understand it. I'm interested in discussion around this topic, since we are still moving forward with JRuby and would like to improve interop with Java libraries. I will offer the following food for thought:
* Going with 100% objc strings at first is probably a good pragmatic start. You'll have the perf/memory impact of encoding/decoding and wasteful string contents, but you should be able to get it functioning well. And since interop is a primary goal for MacRuby (where it's been somewhat secondary in JRuby) this is probably a better place to start.
That’s where things stand today, and with Laurent’s ByteString work this all mostly works as long as you don’t try to change encodings around on strings.
* Alternatively, you could only support a minimum set of encodings and make it explicit that internally everything would be UTF-16 or MacRoman. In MacRuby's case, I think most people would happily accept that, just as a lot of JRuby users would probably accept that everything's UTF-16 since that's what they get from Java normally.
This seems like a bad situation in the face of the varied encoding landscape on the Internet.
Ultimately this is the exact reason I argued over a year ago that Ruby 1.9 should introduce a separate Bytes class used for IO. I was denied.
I was disappointed to see this turned down as well. The encoding situation in 1.9 feels worse than it was in 1.8, and that’s pretty impressive.
It's definitely a sticky issue, and Ruby has made it even stickier in 1.9 with arbitrary encoding support. None of the proposed solutions across all implementations (including JRuby) have really seemed ideal to me.
Laurent and I discussed this a bit tonight, and here’s what I think we can get away with: By default, store all strings as NSString (UTF-16 backed) with an ivar to store the encoding e. When getting bytes, convert to a ByteString in the appropriate encoding. When doing force_encoding, convert to a ByteString in the old encoding, then try to convert to an NSString in the new encoding. If we succeed, great. If not, leave as a tagged ByteString (and probably whine about it). All ASCII-8BIT strings are backed by ByteString. There’s some simplification here; some of the ByteStrings are really just NSDatas, &c., but the flow is there. Sorry the list above is a mess, I’m up much later than I’m accustomed to. -Ben