Re: [MacRuby-devel] Strings, Encodings and IO

8 Apr 2009

      ...
...
So plan B: We emulate Ruby 1.9 strings behavior on top of of  
NSString/NSData.
I'm really interested in this discussion too. A little background  
for JRuby:
Thanks for the background, Charlie. This sort of history is very  
instructive.
...
* Java's strings are all UTF-16. In order to represent binary data,  
we ended up using a "raw" encoder/decoder and only using the bottom  
byte of each character. Wasteful, since every string was 2x as  
large, and slow, since IO had to up/downcast byte[] contents to  
char[] and back.
Most CFStrings use a UTF-16 internal store as well.
...
* We want to move to an intermediate version, where we sometimes  
have a byte[]-backed string and sometimes a char[]/String-backed  
string. IronRuby does this already. This is, however, predicated on  
the idea that byte[]-based strings rarely become char[]-based  
strings and vice versa. I don't have any evidence for or against  
that yet.
So it's a nearly identical problem for MacRuby, as I understand it.  
I'm interested in discussion around this topic, since we are still  
moving forward with JRuby and would like to improve interop with  
Java libraries. I will offer the following food for thought:
* Going with 100% objc strings at first is probably a good pragmatic  
start. You'll have the perf/memory impact of encoding/decoding and  
wasteful string contents, but you should be able to get it  
functioning well. And since interop is a primary goal for MacRuby  
(where it's been somewhat secondary in JRuby) this is probably a  
better place to start.
That’s where things stand today, and with Laurent’s ByteString work  
this all mostly works as long as you don’t try to change encodings  
around on strings.
...
* Alternatively, you could only support a minimum set of encodings  
and make it explicit that internally everything would be UTF-16 or  
MacRoman. In MacRuby's case, I think most people would happily  
accept that, just as a lot of JRuby users would probably accept that  
everything's UTF-16 since that's what they get from Java normally.
This seems like a bad situation in the face of the varied encoding  
landscape on the Internet.
...
Ultimately this is the exact reason I argued over a year ago that  
Ruby 1.9 should introduce a separate Bytes class used for IO. I was  
denied.
I was disappointed to see this turned down as well. The encoding  
situation in 1.9 feels worse than it was in 1.8, and that’s pretty  
impressive.
...
It's definitely a sticky issue, and Ruby has made it even stickier  
in 1.9 with arbitrary encoding support. None of the proposed  
solutions across all implementations (including JRuby) have really  
seemed ideal to me.
Laurent and I discussed this a bit tonight, and here’s what I think we  
can get away with:

By default, store all strings as NSString (UTF-16 backed) with an ivar  
to store the encoding e.
When getting bytes, convert to a ByteString in the appropriate encoding.
When doing force_encoding, convert to a ByteString in the old  
encoding, then try to convert to an NSString in the new encoding. If  
we succeed, great. If not, leave as a tagged ByteString (and probably  
whine about it).
All ASCII-8BIT strings are backed by ByteString.

There’s some simplification here; some of the ByteStrings are really  
just NSDatas, &c., but the flow is there. Sorry the list above is a  
mess, I’m up much later than I’m accustomed to.

-Ben

Re: [MacRuby-devel] Strings, Encodings and IO

Benjamin Stiglitz