[MacRuby-devel] [MacRuby] #225: regexp engine broken when a string contains non ascii characters

MacRuby ruby-noreply at macosforge.org
Sun Mar 1 01:33:36 PST 2009


#225: regexp engine broken when a string contains non ascii characters
-------------------------------------+--------------------------------------
 Reporter:  mattaimonetti@…          |       Owner:  lsansonetti@…        
     Type:  defect                   |      Status:  new                  
 Priority:  critical                 |   Milestone:  MacRuby 0.4          
Component:  MacRuby                  |    Keywords:  regexp, bug          
-------------------------------------+--------------------------------------
 Here is a sample code to reproduce the problem:

 {{{
 html = %{<p><a
 href="http://www.flickr.com/people/jeanelietrujillo/">jeanelietrujillo</a>
 posted a photo:</p>
 <p><a href="http://www.flickr.com/photos/jeanelietrujillo/2211862262/"
 title="Galgani Décoration"><img
 src="http://farm3.static.flickr.com/2262/2211862262_2f08c343a3_m.jpg"
 width="240" height="240" alt="Galgani Décoration" /></a></p>}

 html.scan(/<img\s+src="(.+?)"/)[0][0]
 }}}

 ruby 1.9 returns:
 {{{
   => "http://farm3.static.flickr.com/2262/2211862262_2f08c343a3_m.jpg"
 }}}

 macruby returns:

 {{{
   => "ttp://farm3.static.flickr.com/2262/2211862262_2f08c343a3_m.jpg\""
 }}}

 Now let's try to remove the é and replace it by a e:


 {{{
 html = %{<p><a
 href="http://www.flickr.com/people/jeanelietrujillo/">jeanelietrujillo</a>
 posted a photo:</p>
 <p><a href="http://www.flickr.com/photos/jeanelietrujillo/2211862262/"
 title="Galgani Decoration"><img
 src="http://farm3.static.flickr.com/2262/2211862262_2f08c343a3_m.jpg"
 width="240" height="240" alt="Galgani Décoration" /></a></p>}

 html.scan(/<img\s+src="(.+?)"/)[0][0]
 }}}

 MacRuby now returns:


 {{{
   => "http://farm3.static.flickr.com/2262/2211862262_2f08c343a3_m.jpg"
 }}}

 My guess is that the unicode characters mess up the the count to extract
 the matched string resulting in a substring starting one character too
 early.


 To prove my hypothesis here is another sample, this time with 2 "é"
 characters:

 {{{
 html = %{<p><a
 href="http://www.flickr.com/people/jeanelietrujillo/">jeanelietrujillo</a>a
 posté une photo:</p>
 <p><a href="http://www.flickr.com/photos/jeanelietrujillo/2211862262/"
 title="Galgani Décoration"><img
 src="http://farm3.static.flickr.com/2262/2211862262_2f08c343a3_m.jpg"
 width="240" height="240" alt="Galgani Décoration" /></a></p>}

 html.scan(/<img\s+src="(.+?)"/)[0][0]
 }}}

 MacRuby returns:

 {{{
   => "tp://farm3.static.flickr.com/2262/2211862262_2f08c343a3_m.jpg\" "
 }}}

-- 
Ticket URL: <http://www.macruby.org/trac/ticket/225>
MacRuby <http://macruby.org/>



More information about the MacRuby-devel mailing list