[MacRuby-devel] [MacRuby] #339: YAML error with UTF-16 string
MacRuby
ruby-noreply at macosforge.org
Wed Dec 16 17:56:27 PST 2009
#339: YAML error with UTF-16 string
---------------------------+------------------------------------------------
Reporter: dev@… | Owner: lsansonetti@…
Type: defect | Status: closed
Priority: critical | Milestone: MacRuby 0.5
Component: MacRuby | Resolution: fixed
Keywords: YAML encoding |
---------------------------+------------------------------------------------
Comment(by neeracher@…):
I responded to this in e-mail, but to preserve the answer for posterity,
I'm copying it here.
The ruby1.9 encoding is simply wrong: It translates the accented character
into UTF-8, and then escapes the two UTF-8 characters separately. What
this ends up encoding is "Rübe", which is '''not''' what you want.
I didn't find anything in YAML docs that describes that behaviour, both
methods seem to be correct.
They can't possibly be BOTH correct, as interpreting the output of one
according to the theory of the other would give a different result. If you
look at the section in the YAML spec:
<http://www.yaml.org/spec/1.2/spec.html#id2776092>, you will see
[ 57 ] "Escaped 8-bit Unicode character."
This is NOT an UTF-8 character.
But ruby 1.8 fails to load the UTF-16 YAML. That is not astonishing
because IMHO there is now way to guess what is the correct escaping mode.
It's not astonishing because (a) 1.8 has very poor Unicode support anyway
and (b) this would hardly be the only bug in syck.
I think escaping is not necessary here because the encoding of input and
output is the same. This can easly be tested by
{{{
$ macruby -e 'require "yaml"; puts YAML::load "--- Rübe"'
Rübe
}}}
That's an interesting point. I think you're right that the YAML spec does
not require escaping of printable characters >\u007F. However, non-
printable characters DO have to be escaped, and for the printable ones, it
could be argued that erring on the side of escaping helps readability if
the OS does not have font coverage for some printable characters. In any
case, the current implementation tries to be conservative in what it
generates and liberal in what it accepts. I'm open to persuasion that we
should avoid escaping characters, provided there is a low-cost test for
printability of general Unicode characters (I have not yet checked whether
one of the built-in CFCharacterSets can give that; the descriptions were
inconclusive).
--
Ticket URL: <http://www.macruby.org/trac/ticket/339#comment:4>
MacRuby <http://macruby.org/>
More information about the MacRuby-devel
mailing list