Scanning Unicode strings for non-ascii characters
Hello all, This may be obvious, but in a Unicode world it's driving me nuts. Given an arbitrary string, which may contain unicode characters, how do I replace all characters not in the range 0x20..0x7e with spaces? Thanks for any guidance, Bob Schaaf AIU Holdings
On Mar 3, 2009, at 12:37 PM, Robert Schaaf wrote:
This may be obvious, but in a Unicode world it's driving me nuts. Given an arbitrary string, which may contain unicode characters, how do I replace all characters not in the range 0x20..0x7e with spaces?
This isn't really a MacRuby related question, but here you go (: string.unpack('U*').select { |c| (0x20..0x7e).include? (c) }.pack('U*') There are probably 200 other solutions, but this seems to be the easiest one. Remember that this not very fast and you probably want to use Iconv or something if your processing large pieces of text. Manfred
At 12:45 +0100 3/3/09, Manfred Stienstra wrote:
On Mar 3, 2009, at 12:37 PM, Robert Schaaf wrote:
string.unpack('U*'). select { |c| (0x20..0x7e).include? (c) }. pack('U*')
It looks to me like this is a solution for a different problem; that is, discarding characters outside of the specified range. Also, do we want to map newlines, etc? Anyway, irb sez:
a = "abc\x0adef" => "abc\ndef" a.gsub(/[^\x20-\x7e]/, ' ') => "abc def" a.gsub(/[^\x00-\x7e]/, ' ') => "abc\ndef"
-r -- http://www.cfcl.com/rdm Rich Morin http://www.cfcl.com/rdm/resume rdm@cfcl.com http://www.cfcl.com/rdm/weblog +1 650-873-7841 Technical editing and writing, programming, and web development
On Mar 3, 2009, at 4:18 PM, Rich Morin wrote:
It looks to me like this is a solution for a different problem; that is, discarding characters outside of the specified range. Also, do we want to map newlines, etc? Anyway, irb sez:
Oops, I misread that. Yeah, gsub is probably faster. string.unpack('U*').map { |c| (0x20..0x7e).include?(c) ? c : 32 }.pack('U*') Anyway, just throwing out characters doesn't seem like a likely use- case anyway. Manfred
Well, my medication has finally worn off, and I came up with this: a_string.tr('^ -~', ' ') Any comments on efficiency? God bless ascii for being contiguous. All this is to clean up imperfectly mapped EBCDIC (eeeww!) Thanks for the suggestions. Bob Schaaf On Mar 3, 2009, at 10:34 AM, Manfred Stienstra wrote:
On Mar 3, 2009, at 4:18 PM, Rich Morin wrote:
It looks to me like this is a solution for a different problem; that is, discarding characters outside of the specified range. Also, do we want to map newlines, etc? Anyway, irb sez:
Oops, I misread that. Yeah, gsub is probably faster.
string.unpack('U*').map { |c| (0x20..0x7e).include?(c) ? c : 32 }.pack('U*')
Anyway, just throwing out characters doesn't seem like a likely use- case anyway.
Manfred _______________________________________________ MacRuby-devel mailing list MacRuby-devel@lists.macosforge.org http://lists.macosforge.org/mailman/listinfo.cgi/macruby-devel
At 21:30 -0500 3/3/09, Robert Schaaf wrote:
a_string.tr('^ -~', ' ') Any comments on efficiency?
That's pretty much equivalent to this code: a.gsub(/[^\x20-\x7e]/, ' ') It may or may not be faster, more to your taste, etc. Before using it, be sure that you don't want to preserve characters such as tabs and/or newlines... -r -- http://www.cfcl.com/rdm Rich Morin http://www.cfcl.com/rdm/resume rdm@cfcl.com http://www.cfcl.com/rdm/weblog +1 650-873-7841 Technical editing and writing, programming, and web development
participants (3)
-
Manfred Stienstra
-
Rich Morin
-
Robert Schaaf