String#sub/gsub and text encodings
Hi, I just wrote a simple script for text processing and encountered a problem with String#sub/gsub. Original text: UTF-8 encoded ASCII character only text Replacing text: UTF-8 encoded text with ASCII and non-ASCII characters (including Japanese characters) The resulting text: all the non-ASCII characters were garbage. When I split the original text at the strings to be replaced and inserted the replacing text at these places, the resulting string object was fine; all the characters were kept as they should be in UTF-8 encoding. I checked the tickets, but couldn't find something like this. Is this a known issue? Best, Yasu
Hi, Can you post some sample code? Thanks On Sun, May 15, 2011 at 11:50, Yasu Imao <yimao.ml@gmail.com> wrote:
Hi,
I just wrote a simple script for text processing and encountered a problem with String#sub/gsub.
Original text: UTF-8 encoded ASCII character only text Replacing text: UTF-8 encoded text with ASCII and non-ASCII characters (including Japanese characters)
The resulting text: all the non-ASCII characters were garbage.
When I split the original text at the strings to be replaced and inserted the replacing text at these places, the resulting string object was fine; all the characters were kept as they should be in UTF-8 encoding.
I checked the tickets, but couldn't find something like this. Is this a known issue?
Best, Yasu _______________________________________________ MacRuby-devel mailing list MacRuby-devel@lists.macosforge.org http://lists.macosforge.org/mailman/listinfo.cgi/macruby-devel
If the script works different in CRuby 1.9, then a ticket will be helpful too, as it is likely something we need to fix. I don't know by heart if it's a well-known issue, but we will figure it out later. Filling dups is always a good idea as it helps up prioritizing work. Thanks, Laurent On May 15, 2011, at 8:10 AM, Caio Chassot wrote:
Hi,
Can you post some sample code?
Thanks
On Sun, May 15, 2011 at 11:50, Yasu Imao <yimao.ml@gmail.com> wrote:
Hi,
I just wrote a simple script for text processing and encountered a problem with String#sub/gsub.
Original text: UTF-8 encoded ASCII character only text Replacing text: UTF-8 encoded text with ASCII and non-ASCII characters (including Japanese characters)
The resulting text: all the non-ASCII characters were garbage.
When I split the original text at the strings to be replaced and inserted the replacing text at these places, the resulting string object was fine; all the characters were kept as they should be in UTF-8 encoding.
I checked the tickets, but couldn't find something like this. Is this a known issue?
Best, Yasu _______________________________________________ MacRuby-devel mailing list MacRuby-devel@lists.macosforge.org http://lists.macosforge.org/mailman/listinfo.cgi/macruby-devel
_______________________________________________ MacRuby-devel mailing list MacRuby-devel@lists.macosforge.org http://lists.macosforge.org/mailman/listinfo.cgi/macruby-devel
On 2011-05-16, at 00:37 , Laurent Sansonetti wrote:
Filling dups is always a good idea as it helps up prioritizing work.
Oh Laurent, please let's not encourage this terrible dupes-as-voting-system practice; just add a +1 button to tickets. Filling proper tickets is hard work. Time shouldn't be spent on filing known dupes. (Filing accidental dupes still preferred than not filing anything at all for laziness of searching to check it's not a dupe) Ah yes. This is my personal opinion. Laurent actually runs this thing, I just troll here. /petpeeve
Hi Caio, On May 16, 2011, at 2:21 PM, Caio Chassot wrote:
On 2011-05-16, at 00:37 , Laurent Sansonetti wrote:
Filling dups is always a good idea as it helps up prioritizing work.
Oh Laurent, please let's not encourage this terrible dupes-as-voting-system practice; just add a +1 button to tickets.
Filling proper tickets is hard work. Time shouldn't be spent on filing known dupes. (Filing accidental dupes still preferred than not filing anything at all for laziness of searching to check it's not a dupe)
Well I think it's quicker to file a dup than search through the entire database (using the awful Trac interface, but that's another topic), to then comment "hey, me too!". And let's also consider false positives (people thinking this ticket describes their bug, when it doesn't). I think it's a better idea to let the team triage the bugs, since they know about the big picture. Laurent
Hi, I finally found time to further investigate this. It thought it was sub/gsub in general, but it was sub!/gsub! in a special case, which is reading a text file with NSMutableString#initWithContentsOfFile:encoding:error: I created a text file in UTF-8 and the content is this is a test script. Then here's what I tested. #!/usr/local/bin/macruby framework 'cocoa' # -*- encoding: UTF-8 -*- a = "this is a test script." b = NSMutableString.alloc.initWithContentsOfFile("test.txt",encoding:NSUTF8StringEncoding,error:nil) p a.encoding #=> #<Encoding:UTF-8> p data.encoding #=> #<Encoding:UTF-8> print a.sub(/test/,"$B$"(B") #=> this is a $B$"(B script. print b.sub(/test/,"$B$"(B") #=> this is a $B$"(B script. a.sub!(/test/,"$B$"(B") print a #=> this is a $B$"(B script. b.sub!(/test/,"$B$"(B") print b #=> This is a $B!1(IAB(B script. Am I doing something wrong? If not, I'll file a ticket. Best, Yasu On 2011/05/16, at 7:37, Laurent Sansonetti wrote:
If the script works different in CRuby 1.9, then a ticket will be helpful too, as it is likely something we need to fix. I don't know by heart if it's a well-known issue, but we will figure it out later. Filling dups is always a good idea as it helps up prioritizing work.
Thanks, Laurent
On May 15, 2011, at 8:10 AM, Caio Chassot wrote:
Hi,
Can you post some sample code?
Thanks
On Sun, May 15, 2011 at 11:50, Yasu Imao <yimao.ml@gmail.com> wrote:
Hi,
I just wrote a simple script for text processing and encountered a problem with String#sub/gsub.
Original text: UTF-8 encoded ASCII character only text Replacing text: UTF-8 encoded text with ASCII and non-ASCII characters (including Japanese characters)
The resulting text: all the non-ASCII characters were garbage.
When I split the original text at the strings to be replaced and inserted the replacing text at these places, the resulting string object was fine; all the characters were kept as they should be in UTF-8 encoding.
I checked the tickets, but couldn't find something like this. Is this a known issue?
Best, Yasu _______________________________________________ MacRuby-devel mailing list MacRuby-devel@lists.macosforge.org http://lists.macosforge.org/mailman/listinfo.cgi/macruby-devel
_______________________________________________ MacRuby-devel mailing list MacRuby-devel@lists.macosforge.org http://lists.macosforge.org/mailman/listinfo.cgi/macruby-devel
_______________________________________________ MacRuby-devel mailing list MacRuby-devel@lists.macosforge.org http://lists.macosforge.org/mailman/listinfo.cgi/macruby-devel
Hi,
I finally found time to further investigate this. Thanks for investigating it.
Am I doing something wrong? If not, I'll file a ticket. No it's indeed a bug, could you please file a ticket. By the way I've made a shorter version of your code:
framework 'Cocoa' s1 = NSMutableString.stringWithString("this is a test script.") s1.sub!(/test/, "$B$"(B") puts s1 #=> this is a $B!1(IAB(B script. s2 = NSMutableString.stringWithString("this is a test script.") s2[10..14] = "$B$"(B" puts s2 #=> this is a $B!1(IAB(B script. ------- (in fact sub! uses []= internally) Cheers, Vincent
participants (4)
-
Caio Chassot
-
Laurent Sansonetti
-
Vincent Isambart
-
Yasu Imao