[MacRuby-devel] Strings, Encodings and IO
Benjamin Stiglitz
ben at tanjero.com
Tue Apr 7 22:23:29 PDT 2009
>> So plan B: We emulate Ruby 1.9 strings behavior on top of of
>> NSString/NSData.
>
> I'm really interested in this discussion too. A little background
> for JRuby:
Thanks for the background, Charlie. This sort of history is very
instructive.
> * Java's strings are all UTF-16. In order to represent binary data,
> we ended up using a "raw" encoder/decoder and only using the bottom
> byte of each character. Wasteful, since every string was 2x as
> large, and slow, since IO had to up/downcast byte[] contents to
> char[] and back.
Most CFStrings use a UTF-16 internal store as well.
> * We want to move to an intermediate version, where we sometimes
> have a byte[]-backed string and sometimes a char[]/String-backed
> string. IronRuby does this already. This is, however, predicated on
> the idea that byte[]-based strings rarely become char[]-based
> strings and vice versa. I don't have any evidence for or against
> that yet.
>
> So it's a nearly identical problem for MacRuby, as I understand it.
> I'm interested in discussion around this topic, since we are still
> moving forward with JRuby and would like to improve interop with
> Java libraries. I will offer the following food for thought:
>
> * Going with 100% objc strings at first is probably a good pragmatic
> start. You'll have the perf/memory impact of encoding/decoding and
> wasteful string contents, but you should be able to get it
> functioning well. And since interop is a primary goal for MacRuby
> (where it's been somewhat secondary in JRuby) this is probably a
> better place to start.
That’s where things stand today, and with Laurent’s ByteString work
this all mostly works as long as you don’t try to change encodings
around on strings.
> * Alternatively, you could only support a minimum set of encodings
> and make it explicit that internally everything would be UTF-16 or
> MacRoman. In MacRuby's case, I think most people would happily
> accept that, just as a lot of JRuby users would probably accept that
> everything's UTF-16 since that's what they get from Java normally.
This seems like a bad situation in the face of the varied encoding
landscape on the Internet.
> Ultimately this is the exact reason I argued over a year ago that
> Ruby 1.9 should introduce a separate Bytes class used for IO. I was
> denied.
I was disappointed to see this turned down as well. The encoding
situation in 1.9 feels worse than it was in 1.8, and that’s pretty
impressive.
> It's definitely a sticky issue, and Ruby has made it even stickier
> in 1.9 with arbitrary encoding support. None of the proposed
> solutions across all implementations (including JRuby) have really
> seemed ideal to me.
Laurent and I discussed this a bit tonight, and here’s what I think we
can get away with:
By default, store all strings as NSString (UTF-16 backed) with an ivar
to store the encoding e.
When getting bytes, convert to a ByteString in the appropriate encoding.
When doing force_encoding, convert to a ByteString in the old
encoding, then try to convert to an NSString in the new encoding. If
we succeed, great. If not, leave as a tagged ByteString (and probably
whine about it).
All ASCII-8BIT strings are backed by ByteString.
There’s some simplification here; some of the ByteStrings are really
just NSDatas, &c., but the flow is there. Sorry the list above is a
mess, I’m up much later than I’m accustomed to.
-Ben
More information about the MacRuby-devel
mailing list