[MacRuby] #225: regexp engine broken when a string contains non ascii characters
#225: regexp engine broken when a string contains non ascii characters -------------------------------------+-------------------------------------- Reporter: mattaimonetti@… | Owner: lsansonetti@… Type: defect | Status: new Priority: critical | Milestone: MacRuby 0.4 Component: MacRuby | Keywords: regexp, bug -------------------------------------+-------------------------------------- Here is a sample code to reproduce the problem: {{{ html = %{<p><a href="http://www.flickr.com/people/jeanelietrujillo/">jeanelietrujillo</a> posted a photo:</p> <p><a href="http://www.flickr.com/photos/jeanelietrujillo/2211862262/" title="Galgani Décoration"><img src="http://farm3.static.flickr.com/2262/2211862262_2f08c343a3_m.jpg" width="240" height="240" alt="Galgani Décoration" /></a></p>} html.scan(/<img\s+src="(.+?)"/)[0][0] }}} ruby 1.9 returns: {{{ => "http://farm3.static.flickr.com/2262/2211862262_2f08c343a3_m.jpg" }}} macruby returns: {{{ => "ttp://farm3.static.flickr.com/2262/2211862262_2f08c343a3_m.jpg\"" }}} Now let's try to remove the é and replace it by a e: {{{ html = %{<p><a href="http://www.flickr.com/people/jeanelietrujillo/">jeanelietrujillo</a> posted a photo:</p> <p><a href="http://www.flickr.com/photos/jeanelietrujillo/2211862262/" title="Galgani Decoration"><img src="http://farm3.static.flickr.com/2262/2211862262_2f08c343a3_m.jpg" width="240" height="240" alt="Galgani Décoration" /></a></p>} html.scan(/<img\s+src="(.+?)"/)[0][0] }}} MacRuby now returns: {{{ => "http://farm3.static.flickr.com/2262/2211862262_2f08c343a3_m.jpg" }}} My guess is that the unicode characters mess up the the count to extract the matched string resulting in a substring starting one character too early. To prove my hypothesis here is another sample, this time with 2 "é" characters: {{{ html = %{<p><a href="http://www.flickr.com/people/jeanelietrujillo/">jeanelietrujillo</a>a posté une photo:</p> <p><a href="http://www.flickr.com/photos/jeanelietrujillo/2211862262/" title="Galgani Décoration"><img src="http://farm3.static.flickr.com/2262/2211862262_2f08c343a3_m.jpg" width="240" height="240" alt="Galgani Décoration" /></a></p>} html.scan(/<img\s+src="(.+?)"/)[0][0] }}} MacRuby returns: {{{ => "tp://farm3.static.flickr.com/2262/2211862262_2f08c343a3_m.jpg\" " }}} -- Ticket URL: <http://www.macruby.org/trac/ticket/225> MacRuby <http://macruby.org/>
#225: regexp engine broken when a string contains non ascii characters -------------------------------------+-------------------------------------- Reporter: mattaimonetti@… | Owner: lsansonetti@… Type: defect | Status: new Priority: critical | Milestone: MacRuby 0.4 Component: MacRuby | Keywords: regexp, bug -------------------------------------+-------------------------------------- Comment(by mattaimonetti@…): looks like a duplicate of http://www.macruby.org/trac/ticket/94 -- Ticket URL: <http://www.macruby.org/trac/ticket/225#comment:1> MacRuby <http://macruby.org/>
#225: regexp engine broken when a string contains non ascii characters -------------------------------------+-------------------------------------- Reporter: mattaimonetti@… | Owner: lsansonetti@… Type: defect | Status: closed Priority: critical | Milestone: MacRuby 0.4 Component: MacRuby | Resolution: duplicate Keywords: regexp, bug | -------------------------------------+-------------------------------------- Changes (by mattaimonetti@…): * status: new => closed * resolution: => duplicate -- Ticket URL: <http://www.macruby.org/trac/ticket/225#comment:2> MacRuby <http://macruby.org/>
participants (1)
-
MacRuby