[MacRuby-devel] [MacRuby] #339: YAML error with UTF-16 string
B. Ohr
jazzbox at 7zz.de
Sun Nov 15 01:15:42 PST 2009
Am 14.11.2009 um 21:17 schrieb Matthias Neeracher:
>
> On Nov 14, 2009, at 15:44 , MacRuby wrote:
>
>> #339: YAML error with UTF-16 string
>> ---------------------------
>> +------------------------------------------------
>> Reporter: dev@… | Owner: lsansonetti@…
>> Type: defect | Status: closed
>> Priority: critical | Milestone: MacRuby 0.5
>> Component: MacRuby | Resolution: fixed
>> Keywords: YAML encoding |
>> ---------------------------
>> +------------------------------------------------
>>
>> Comment(by jazzbox@…):
>>
>> {{{
>> $ macruby -e 'require "yaml"; puts "Rübe".to_yaml'
>> --- "R\xFCbe"
>> $ ruby1.9 -e 'require "yaml"; puts "Rübe".to_yaml'
>> --- "R\xC3\xBCbe"
>> }}}
>>
>> seems to work now! Macruby escpapes to UTF-16 and Ruby1.9 escapes to
>> UTF-8.
>
> Actually, it seems to me (though I'm willing to be corrected on
> this), that the ruby1.9 encoding is simply wrong: It translates the
> accented character into UTF-8, and then escapes the two UTF-8
> characters separately. What this ends up encoding is "Rübe", which
> is not what you want.
>
>> I didn't find anything in YAML docs that describes that behaviour,
>> both methods seem to be correct.
>
> They can't possibly be BOTH correct, as interpreting the output of
> one according to the theory of the other would give a different
> result. If you look at the section in the YAML spec: <http://www.yaml.org/spec/1.2/spec.html#id2776092
> >, you will see
>
> [57] "Escaped 8-bit Unicode character."
>
> This is NOT an UTF-8 character.
>
>> But ruby 1.8 fails to load the UTF-16 YAML. That is not astonishing
>> because IMHO there is now way to guess what is the correct escaping
>> mode.
>
> It's not astonishing because (a) 1.8 has very poor Unicode support
> anyway and (b) this would hardly be the only bug in syck.
>
OK, you are right!
When I started generating a YAML in macruby and importing it to ruby
1.8 I haven't done anything with Unicode, so I am not very experienced
yet.
>> I think escaping is not necessary here because the encoding of
>> input and
>> output is the same. This can easly be tested by
>>
>> {{{
>> $ macruby -e 'require "yaml"; puts YAML::load "--- Rübe"'
>> Rübe
>> }}}
>
> That's an interesting point. I think you're right that the YAML spec
> does not require escaping of printable characters >\u007F. However,
> non-printable characters DO have to be escaped, and for the
> printable ones, it could be argued that erring on the side of
> escaping helps readability if the OS does not have font coverage for
> some printable characters. In any case, the current implementation
> tries to be conservative in what it generates and liberal in what it
> accepts. I'm open to persuasion that we should avoid escaping
> characters, provided there is a low-cost test for printability of
> general Unicode characters (I have not yet checked whether one of
> the built-in CFCharacterSets can give that; the descriptions were
> inconclusive).
>
The YAML spec, Chapter 5.1 Character Sets says:
> "To ensure readability, YAML streams use only the printable subset
of the Unicode character set"
> [1] c-printable ::= #x9 | #xA | #xD | [#x20-#x7E] /* 8
bit */
> | #x85 | [#xA0-#xD7FF] | [#xE000-#xFFFD] /* 16 bit */
> | [#x10000-#x10FFFF] /* 32 bit */
Only characters that are not "c-printable" MUST be escaped and this is
well defined. (For Strings you have to add the " and the \ as special
characters).
> "...In addition, any allowed characters known to be non-printable
SHOULD also be escaped.
> This isn’t mandatory since a full implementation would require
extensive character property tables."
So it is a SHOULD and not a MUST because it is too expensive. The YAML
spec is a little bit confusing with "allowed characters" and "non
printing characters".
Bernd
More information about the MacRuby-devel
mailing list