[MacRuby-devel] Strings, Encodings and IO

Vincent Isambart vincent.isambart at gmail.com
Mon Apr 6 22:47:41 PDT 2009


Hi,

MacRuby is getting better and better at great speed, but there is one
point where MacRuby still has much to do: strings, encodings and IO.

If you have some interest in that, but do not know Ruby 1.9's strings
well, I recommend you check at least the 2 last posts of James Gray
II's series about encodings
(http://blog.grayproductions.net/articles/ruby_19s_string and
http://blog.grayproductions.net/articles/ruby_19s_three_default_encodings).

A few of the new features have been added quite late in the 1.9
development cycle so if you played with 1.9 a few months before 1.9.1,
you should at least check the last post.

In MacRuby, we can't have the same strings as 1.9 because, for easy
interoperability with Cocoa, we have to use Cocoa's strings
(CFString/NSMutableString). And those strings work quite differently
from Ruby's strings, including:
- Cocoa strings are not made to store non-text data. For that you are
supposed to use NSData. But Ruby strings are used to store anything
(using in 1.9 the ASCII-8BIT encoding).
- Cocoa stores all strings in generally either UTF-16 or MacRoman (the
string is converted to one of these encodings if you give it data in
an other encoding). Ruby 1.9 does not do any conversion except if you
ask it explicitely.

The reason Ruby does not convert strings is that in a few rare cases
you can lose some data between conversions. But most people would
never see it, and implementation-wise it's hard to handle strings in
lots of different encodings. And in fact most programming languages
and frameworks use internally strings in one of the Unicode encodings
(a lot use UTF-16 and a few UTF-8 - I have never seen one using UTF-32
internally). Using only one encoding also makes string comparison
easier (in Ruby 1.9 comparing strings of different encodings is not a
good idea).

We won't have an implementation of strings 100% compatible with 1.9,
but I think we should still have something very similar, in the sense
that if your code doesn't do something too weird with strings it
should work on both 1.9 and MacRuby.

Patrick and Laurent, in the work on IO in 0.5, have created a new
ByteString type that inherits from NSMutableString but internally uses
NSData. But all the IO methods always return a ByteString which is, I
think, not a good idea.

My main idea is the following: at the places where 1.9 returns a
ASCII-8BIT string we should return a ByteString. And for strings with
a real encoding we should return a NSMutableString (the encoding being
the one chosen by Cocoa).

A few functions of 1.9 may also be disabled (like force_encoding). Of
course it would be possible to add the full functionality of Ruby 1.9
strings on ByteString but it wouldn't be worth it.

Ruby 1.9 also has default code and default external encodings
different depending on the environment, but I think always both of
them set to UTF-8 would be the best. (we may even completely ignore
the encoding pragmas in the code not to complicate the parser).

If you have any comments or critics do not hesitate, it's the reason I
posted it on the ML ;)

Cheers,
Vincent


More information about the MacRuby-devel mailing list