Strings, Encodings and IO

Vincent Isambart

7 Apr 2009 7 Apr '09

5:47 a.m.

Hi, MacRuby is getting better and better at great speed, but there is one point where MacRuby still has much to do: strings, encodings and IO. If you have some interest in that, but do not know Ruby 1.9's strings well, I recommend you check at least the 2 last posts of James Gray II's series about encodings (http://blog.grayproductions.net/articles/ruby_19s_string and http://blog.grayproductions.net/articles/ruby_19s_three_default_encodings). A few of the new features have been added quite late in the 1.9 development cycle so if you played with 1.9 a few months before 1.9.1, you should at least check the last post. In MacRuby, we can't have the same strings as 1.9 because, for easy interoperability with Cocoa, we have to use Cocoa's strings (CFString/NSMutableString). And those strings work quite differently from Ruby's strings, including: - Cocoa strings are not made to store non-text data. For that you are supposed to use NSData. But Ruby strings are used to store anything (using in 1.9 the ASCII-8BIT encoding). - Cocoa stores all strings in generally either UTF-16 or MacRoman (the string is converted to one of these encodings if you give it data in an other encoding). Ruby 1.9 does not do any conversion except if you ask it explicitely. The reason Ruby does not convert strings is that in a few rare cases you can lose some data between conversions. But most people would never see it, and implementation-wise it's hard to handle strings in lots of different encodings. And in fact most programming languages and frameworks use internally strings in one of the Unicode encodings (a lot use UTF-16 and a few UTF-8 - I have never seen one using UTF-32 internally). Using only one encoding also makes string comparison easier (in Ruby 1.9 comparing strings of different encodings is not a good idea). We won't have an implementation of strings 100% compatible with 1.9, but I think we should still have something very similar, in the sense that if your code doesn't do something too weird with strings it should work on both 1.9 and MacRuby. Patrick and Laurent, in the work on IO in 0.5, have created a new ByteString type that inherits from NSMutableString but internally uses NSData. But all the IO methods always return a ByteString which is, I think, not a good idea. My main idea is the following: at the places where 1.9 returns a ASCII-8BIT string we should return a ByteString. And for strings with a real encoding we should return a NSMutableString (the encoding being the one chosen by Cocoa). A few functions of 1.9 may also be disabled (like force_encoding). Of course it would be possible to add the full functionality of Ruby 1.9 strings on ByteString but it wouldn't be worth it. Ruby 1.9 also has default code and default external encodings different depending on the environment, but I think always both of them set to UTF-8 would be the best. (we may even completely ignore the encoding pragmas in the code not to complicate the parser). If you have any comments or critics do not hesitate, it's the reason I posted it on the ML ;) Cheers, Vincent

Show replies by date

Manfred Stienstra

7 Apr 7 Apr

6:15 a.m.

On Apr 7, 2009, at 7:47 AM, Vincent Isambart wrote: I have two small comments and a general statement about your essay;

...

A few functions of 1.9 may also be disabled (like force_encoding). Of course it would be possible to add the full functionality of Ruby 1.9 strings on ByteString but it wouldn't be worth it.

The force_encoding method will be absolutely _vital_ to working with encodings in Ruby. Most library authors don't know anything about character encoding and _will_ do the wrong things. And I'm not even talking about libraries written for 1.8 which are totally unaware of the String changes. For example, in a fictional HTTP library that totally doesn't exist today: response = HTTP.get('http://www.google.com') response.body.encoding #=> #<Encoding:US-ASCII> Even though the headers clearly say: "Content-Type: text/html; charset=UTF-8". So we need force_encoding to fix these problems. Even the library author probably needs force_encoding method because somewhere deep down in the library there might be C / Obj-C code that returns a byte string to Ruby without specifying the encoding.

...

Ruby 1.9 also has default code and default external encodings different depending on the environment, but I think always both of them set to UTF-8 would be the best. (we may even completely ignore the encoding pragmas in the code not to complicate the parser).

Also, a no-go. ERB uses this pragma to signal what the encoding of the template is, encoding will break when you ignore this. Finally; I don't think it's a good idea to discuss this a great length without actual code but in order to write a compatible implementation most (if not all) of the String awkwardness will have to be implemented. Manfred

Vincent Isambart

6:36 a.m.

...

Finally; I don't think it's a good idea to discuss this a great length without actual code but in order to write a compatible implementation most (if not all) of the String awkwardness will have to be implemented.

Thank you very much for the remarks. We will indeed need to go to actual code quickly but I think your remarks already made my mail worth writing ;). So back to the whiteboard :-)

Vincent Isambart

12:44 p.m.

Hi again, So plan B: We emulate Ruby 1.9 strings behavior on top of of NSString/NSData. Internally we would use a NSData when the encoding is not valid or it's binary, and NSString in the other cases (never both at the same time). We would have a Ruby encoding that may be completely different from the real encoding of the string (it's just a facade for Ruby). force_encoding would transform back the string into NSData using the Ruby encoding if the data is stored as a NSString, or use the NSData. And then it would try to make a NSString from that data using the encoding given to force_encoding. If it's successful, we use the string, if not we use the NSData. Of course, force_encoding would be slow, as would be accessing bytes when it's a NSString. However accessing the nth character of a string would probably be faster than in 1.9. I am however a bit afraid it would not be very easy to implement. The main problem is not implementing the string itself, but have something that plays well with real NSStrings, and with Objective-C code that waits for normal NSStrings of course. And well I do not know CFString/NSString well. As before it's just an idea. Comments are welcome. Implementation will wait at least until a few people seems it could be useful (and anyone wanting to do it would be welcome as personally I'm not sure I will have the time necessary until May).

Charles Oliver Nutter

10:23 p.m.

Vincent Isambart wrote:

...

Hi again,

So plan B: We emulate Ruby 1.9 strings behavior on top of of NSString/NSData.

I'm really interested in this discussion too. A little background for JRuby: We started out (or really, the original authors started out) with JRuby using all Java strings and stringbuffers for Ruby's String. This made interop easy and of course simplified implementation, but it ran into the same problems you enumerated in your original post: * Java's strings are all UTF-16. In order to represent binary data, we ended up using a "raw" encoder/decoder and only using the bottom byte of each character. Wasteful, since every string was 2x as large, and slow, since IO had to up/downcast byte[] contents to char[] and back. * So we made a move about two years ago to using all byte[]-based strings in JRuby. This allowed us maximum compatibility, maximum performance, and a future-proof path, but it damages interop. Currently whenever you pass a string across the Ruby/Java boundary, we have to do the decode/encode. That affects performance pretty severely. We also had to implement our own regular expression engine, since no Java regex works with byte[]. * We want to move to an intermediate version, where we sometimes have a byte[]-backed string and sometimes a char[]/String-backed string. IronRuby does this already. This is, however, predicated on the idea that byte[]-based strings rarely become char[]-based strings and vice versa. I don't have any evidence for or against that yet. So it's a nearly identical problem for MacRuby, as I understand it. I'm interested in discussion around this topic, since we are still moving forward with JRuby and would like to improve interop with Java libraries. I will offer the following food for thought: * Going with 100% objc strings at first is probably a good pragmatic start. You'll have the perf/memory impact of encoding/decoding and wasteful string contents, but you should be able to get it functioning well. And since interop is a primary goal for MacRuby (where it's been somewhat secondary in JRuby) this is probably a better place to start. * We have considered having a modified 1.9 mode that normalizes all strings into UTF-16 internally. That might work for you as well. I presume there are byte-decoding APIs in objc that could produce your standard strings. You'd be able to at least pretend you support 1.9 encoding, but be transcoding to the user's selected encoding. It wouldn't be fast, but it would work. * Alternatively, you could only support a minimum set of encodings and make it explicit that internally everything would be UTF-16 or MacRoman. In MacRuby's case, I think most people would happily accept that, just as a lot of JRuby users would probably accept that everything's UTF-16 since that's what they get from Java normally. Ultimately this is the exact reason I argued over a year ago that Ruby 1.9 should introduce a separate Bytes class used for IO. I was denied. It's definitely a sticky issue, and Ruby has made it even stickier in 1.9 with arbitrary encoding support. None of the proposed solutions across all implementations (including JRuby) have really seemed ideal to me. - Charlie

Benjamin Stiglitz

8 Apr 8 Apr

5:23 a.m.

...

...
So plan B: We emulate Ruby 1.9 strings behavior on top of of NSString/NSData.

I'm really interested in this discussion too. A little background for JRuby:

Thanks for the background, Charlie. This sort of history is very instructive.

...

* Java's strings are all UTF-16. In order to represent binary data, we ended up using a "raw" encoder/decoder and only using the bottom byte of each character. Wasteful, since every string was 2x as large, and slow, since IO had to up/downcast byte[] contents to char[] and back.

Most CFStrings use a UTF-16 internal store as well.

...

* We want to move to an intermediate version, where we sometimes have a byte[]-backed string and sometimes a char[]/String-backed string. IronRuby does this already. This is, however, predicated on the idea that byte[]-based strings rarely become char[]-based strings and vice versa. I don't have any evidence for or against that yet.

So it's a nearly identical problem for MacRuby, as I understand it. I'm interested in discussion around this topic, since we are still moving forward with JRuby and would like to improve interop with Java libraries. I will offer the following food for thought:

* Going with 100% objc strings at first is probably a good pragmatic start. You'll have the perf/memory impact of encoding/decoding and wasteful string contents, but you should be able to get it functioning well. And since interop is a primary goal for MacRuby (where it's been somewhat secondary in JRuby) this is probably a better place to start.

That’s where things stand today, and with Laurent’s ByteString work this all mostly works as long as you don’t try to change encodings around on strings.

...

* Alternatively, you could only support a minimum set of encodings and make it explicit that internally everything would be UTF-16 or MacRoman. In MacRuby's case, I think most people would happily accept that, just as a lot of JRuby users would probably accept that everything's UTF-16 since that's what they get from Java normally.

This seems like a bad situation in the face of the varied encoding landscape on the Internet.

...

Ultimately this is the exact reason I argued over a year ago that Ruby 1.9 should introduce a separate Bytes class used for IO. I was denied.

I was disappointed to see this turned down as well. The encoding situation in 1.9 feels worse than it was in 1.8, and that’s pretty impressive.

...

It's definitely a sticky issue, and Ruby has made it even stickier in 1.9 with arbitrary encoding support. None of the proposed solutions across all implementations (including JRuby) have really seemed ideal to me.

Laurent and I discussed this a bit tonight, and here’s what I think we can get away with: By default, store all strings as NSString (UTF-16 backed) with an ivar to store the encoding e. When getting bytes, convert to a ByteString in the appropriate encoding. When doing force_encoding, convert to a ByteString in the old encoding, then try to convert to an NSString in the new encoding. If we succeed, great. If not, leave as a tagged ByteString (and probably whine about it). All ASCII-8BIT strings are backed by ByteString. There’s some simplification here; some of the ByteStrings are really just NSDatas, &c., but the flow is there. Sorry the list above is a mess, I’m up much later than I’m accustomed to. -Ben

Manfred Stienstra

7:25 a.m.

On Apr 8, 2009, at 7:23 AM, Benjamin Stiglitz wrote:

...

When doing force_encoding, convert to a ByteString in the old encoding, then try to convert to an NSString in the new encoding. If we succeed, great. If not, leave as a tagged ByteString (and probably whine about it).

That's actually wrong. All force_encoding does is change the encoding attribute of the string, it shouldn't change the internal encoding of the bytes. The encoding attribute is basically a switch to describe which set of string methods should be used on the bytes. For more information see: http://blog.grayproductions.net/articles/ruby_19s_string An example: We're in the same hypothetical HTTP library as before, and this library author has decided to _always_ force encoding to Shift JIS because he hates humanity: response = HTTP.get('http://example.com') response.body.encoding #=> Encoding::Shift_JIS If MacRuby internally forces the body encoding to Shift JIS information might get lost. So when someone decides to make it right afterwards: encoding = response.header['Content- type'].split(';').last.split('=').last encoding #=> 'utf-8' They might get into trouble here: response.body.force_encoding(Encoding::UTF_8) Cuz' Encoding.compatible?(Encoding::Shift_JIS, Encoding::UTF_8) #=> nil I think the best course of action is to expand String specs in RubySpec for 1.9, after that anyone can freely hack away at a most optimal solution without fear of incompatibility. Reading those specs is also likely to give an idea for the most elegant solution. Manfred

Vincent Isambart

7:54 a.m.

...

That's actually wrong. All force_encoding does is change the encoding attribute of the string, it shouldn't change the internal encoding of the bytes. The encoding attribute is basically a switch to describe which set of string methods should be used on the bytes.

That's what force_encoding does in Ruby 1.9, but it's not possible to do the same if we want to use as NSStrings as much as possible.

...

response = HTTP.get('http://example.com') response.body.encoding #=> Encoding::Shift_JIS (...) response.body.force_encoding(Encoding::UTF_8)

If MacRuby internally forces the body encoding to Shift JIS information might get lost.

No it would not. If it was valid Shift_JIS, the conversion back from UTF-16 to Shift_JIS should get the original data back (well as long as the encoding conversion tables do correct round-tripping). And if the string was not valid Shift_JIS, we keep it as bytes so nothing is lost.

...

I think the best course of action is to expand String specs in RubySpec for 1.9, after that anyone can freely hack away at a most optimal solution without fear of incompatibility. Reading those specs is also likely to give an idea for the most elegant solution.

I think everyone agrees that having a Ruby 1.9 String specs will be necessity. And we'll also need to decide what parts of it to follow and what parts we do not need to. For example handling access to characters in a string with a partly invalid encoding exactly the same way as 1.9 seems hard to do:

...

s # a string in UTF-8 with a broken first byte => "\x00\x81\x93んにちは\n" s.length => 8 [s[0], s[1], s[2], s[3], s[4], s[5]] => ["\x00", "\x81", "\x93", "ん", "に", "ち"]

Handling everything as bytes when the encoding is invalid would be easy, but handling only the bad part as such seems hard if you do not want to have to write code for each encoding. And the UTF-16 support should also be made better in MacRuby than in Ruby 1.9.

Charles Oliver Nutter

8:02 a.m.

Vincent Isambart wrote:

...

I think everyone agrees that having a Ruby 1.9 String specs will be necessity. And we'll also need to decide what parts of it to follow and what parts we do not need to. For example handling access to characters in a string with a partly invalid encoding exactly the same way as 1.9 seems hard to do:

GOD yes...any sort of complete string specs would be most welcome. Marcin Mielzynski, our porting machine, believes that oure 1.9 String stuff is basically done--and true to form the test_string tests in Ruby 1.9 repository do seem to mostly function--but I have no confidence in the coverage and completeness of the existing tests (having spent almost no time actually looking at them, though). We have also been kicking around how to transparently handlg UTF-16 by just using Java strings...but having moved to our own Oniguruma port, it would mean some regexp behaviors would have to change back to Java regex behavior. We simply could not match Ruby regex exactly until we ported the same engines Ruby uses :( - Charlie - Charlie

Vincent Isambart

11:33 a.m.

...

the test_string tests in Ruby 1.9 repository do seem to mostly function You mean test/ruby/test_m17n.rb, test/ruby/test_m17n_comb.rb, test/ruby/test_io_m17n.rb and test/ruby/enc/test_*.rb? test/ruby/test_string.rb does not contain anything m17n related.

...

We simply could not match Ruby regex exactly until we ported the same engines Ruby uses :( No two regexp engines have the same behavior, there's nothing anyone can do about that... It looks like Oniguruma has a support for UTF-16 so I was thinking about using that in MacRuby. But as Oniguruma sees everything as a list of bytes, I do not know if you could use the Oniguruma UTF-16 support without modifying your Oniguruma port.

Charles Oliver Nutter

5:34 p.m.

Vincent Isambart wrote:

...

...
the test_string tests in Ruby 1.9 repository do seem to mostly function You mean test/ruby/test_m17n.rb, test/ruby/test_m17n_comb.rb, test/ruby/test_io_m17n.rb and test/ruby/enc/test_*.rb? test/ruby/test_string.rb does not contain anything m17n related.

...
We simply could not match Ruby regex exactly until we ported the same engines Ruby uses :( No two regexp engines have the same behavior, there's nothing anyone can do about that...

Well, the problems we ran into is that those behavioral differences hindered our ability to run stuff like Rails. We didn't really have a choice.

...

It looks like Oniguruma has a support for UTF-16 so I was thinking about using that in MacRuby. But as Oniguruma sees everything as a list of bytes, I do not know if you could use the Oniguruma UTF-16 support without modifying your Oniguruma port.

Yes, I have talked with Marcin about us doing a separate fork of "JOni" that works with Java's UTF-16 characters directly. I think it could become the best Java regexp engine. - Charlie

Benjamin Stiglitz

6:40 p.m.

...

...
When doing force_encoding, convert to a ByteString in the old encoding, then try to convert to an NSString in the new encoding. If we succeed, great. If not, leave as a tagged ByteString (and probably whine about it).

That's actually wrong. All force_encoding does is change the encoding attribute of the string, it shouldn't change the internal encoding of the bytes. The encoding attribute is basically a switch to describe which set of string methods should be used on the bytes.

We have to go through this dance to get force_encoding to play nicely with NSString. Namely, NSString is always backed by an array of UTF-16 code points. So, to reinterpret, we have to convert the internal rep to whatever the external encoding was, then back in, converting to UTF-16 from the new external encoding.

...

We're in the same hypothetical HTTP library as before, and this library author has decided to _always_ force encoding to Shift JIS because he hates humanity:

response = HTTP.get('http://example.com') response.body.encoding #=> Encoding::Shift_JIS

If MacRuby internally forces the body encoding to Shift JIS information might get lost. So when someone decides to make it right afterwards:

encoding = response.header['Content- type'].split(';').last.split('=').last encoding #=> 'utf-8'

They might get into trouble here:

response.body.force_encoding(Encoding::UTF_8)

Cuz'

Encoding.compatible?(Encoding::Shift_JIS, Encoding::UTF_8) #=> nil

Vincent already answered this part; we’re still doing reinterpretation of what is essentially the original bytestream. Are there any encodings that map multiple sequences to the equivalent code point? (And I’m not talking about Unicode NFC/NFD/&c., that still makes it through the UTF-16 link alright.) -Ben

6058

Age (days ago)

6059

Last active (days ago)

List overview

Download

11 comments

4 participants

participants (4)

Benjamin Stiglitz
Charles Oliver Nutter
Manfred Stienstra
Vincent Isambart