Revision: 77344 http://trac.macports.org/changeset/77344 Author: l2g@macports.org Date: 2011-03-27 15:38:48 -0700 (Sun, 27 Mar 2011) Log Message: ----------- p5-html-encoding: new version 0.61; assumed maintainership (open); added license info; edited description; added patch with my significant edits to the perldoc, hopefully improving grammar and clarity Modified Paths: -------------- trunk/dports/perl/p5-html-encoding/Portfile Added Paths: ----------- trunk/dports/perl/p5-html-encoding/files/ trunk/dports/perl/p5-html-encoding/files/patch-lib-HTML-Encoding.pm.diff Modified: trunk/dports/perl/p5-html-encoding/Portfile =================================================================== --- trunk/dports/perl/p5-html-encoding/Portfile 2011-03-27 22:15:56 UTC (rev 77343) +++ trunk/dports/perl/p5-html-encoding/Portfile 2011-03-27 22:38:48 UTC (rev 77344) @@ -4,20 +4,21 @@ PortSystem 1.0 PortGroup perl5 1.0 -perl5.setup HTML-Encoding 0.60 -revision 2 +perl5.setup HTML-Encoding 0.61 platforms darwin -maintainers nomaintainer +maintainers l2g openmaintainer +license Artistic GPL supported_archs noarch -description Determines the encoding of HTML and XML/XHTML documents +description Determine the encoding of HTML/XML/XHTML documents long_description ${description} -checksums md5 b6a0ded3d1a085bc7b3cdb5ae07e89d2 \ - sha1 7bff5ea3a512dd8a1667f8d2ee16c3505d512da2 \ - rmd160 e4f2621e737598bf758c691bf605b44a65c42af3 +checksums sha1 539c09038c812ae8b2215ab3824b69e50e20b33c \ + rmd160 568d0d6b46778644802b9e4f5ac4642a4ad1c419 +patchfiles patch-lib-HTML-Encoding.pm.diff + depends_lib-append port:p5-encode \ port:p5-html-parser \ port:p5-libwww-perl Added: trunk/dports/perl/p5-html-encoding/files/patch-lib-HTML-Encoding.pm.diff =================================================================== --- trunk/dports/perl/p5-html-encoding/files/patch-lib-HTML-Encoding.pm.diff (rev 0) +++ trunk/dports/perl/p5-html-encoding/files/patch-lib-HTML-Encoding.pm.diff 2011-03-27 22:38:48 UTC (rev 77344) @@ -0,0 +1,463 @@ +--- lib/HTML/Encoding.pm.orig 2011-03-27 14:53:03.000000000 -0700 ++++ lib/HTML/Encoding.pm 2011-03-27 15:28:44.000000000 -0700 +@@ -523,20 +523,20 @@ + + =head1 WARNING + +-The interface and implementation are guranteed to change before this ++The interface and implementation are guaranteed to change before this + module reaches version 1.00! Please send feedback to the author of + this module. + + =head1 DESCRIPTION + + HTML::Encoding helps to determine the encoding of HTML and XML/XHTML +-documents... ++documents. + + =head1 DEFAULT ENCODINGS + +-Most routines need to know some suspected character encodings which ++Most routines need to know some suspected character encodings; these + can be provided through the C<encodings> option. This option always +-defaults to the $HTML::Encoding::DEFAULT_ENCODINGS array reference ++defaults to the $HTML::Encoding::DEFAULT_ENCODINGS array reference, + which means the following encodings are considered by default: + + * ISO-8859-1 +@@ -546,7 +546,7 @@ + * UTF-32BE + * UTF-8 + +-If you change the values or pass custom values to the routines note ++If you change the values or pass custom values to the routines, note + that L<Encode> must support them in order for this module to work + correctly. + +@@ -554,7 +554,7 @@ + + C<encoding_from_xml_document>, C<encoding_from_html_document>, and + C<encoding_from_http_message> return in list context the encoding +-source and the encoding name, possible encoding sources are ++source and the encoding name. Possible encoding sources are: + + * protocol (Content-Type: text/html;charset=encoding) + * bom (leading U+FEFF) +@@ -565,21 +565,21 @@ + + =head1 ROUTINES + +-Routines exported by this module at user option. By default, nothing +-is exported. ++Routines may be exported by this module at the user's option. By ++default, nothing is exported. + + =over 2 + + =item encoding_from_content_type($content_type) + + Takes a byte string and uses L<HTTP::Headers::Util> to extract the +-charset parameter from the C<Content-Type> header value and returns ++charset parameter from the C<Content-Type> header value. Returns + its value or C<undef> (or an empty list in list context) if there + is no such value. Only the first component will be examined + (HTTP/1.1 only allows for one component), any backslash escapes in + strings will be unescaped, all leading and trailing quote marks + and white-space characters will be removed, all white-space will be +-collapsed to a single space, empty charset values will be ignored ++collapsed to a single space, empty charset values will be ignored, + and no case folding is performed. + + Examples: +@@ -596,28 +596,28 @@ + | "text/html;charset=\" UTF-8 \"" | 'UTF-8' | + +-----------------------------------------+-----------+ + +-If you pass a string with the UTF-8 flag turned on the string will ++If you pass a string with the UTF-8 flag turned on, the string will + be converted to bytes before it is passed to L<HTTP::Headers::Util>. +-The return value will thus never have the UTF-8 flag turned on (this +-might change in future versions). ++The return value will thus never have the UTF-8 flag turned on. (This ++might change in future versions.) + + =item encoding_from_byte_order_mark($octets [, %options]) + +-Takes a sequence of octets and attempts to read a byte order mark +-at the beginning of the octet sequence. It will go through the list +-of $options{encodings} or the list of default encodings if no +-encodings are specified and match the beginning of the string against +-any byte order mark octet sequence found. +- +-The result can be ambiguous, for example qq(\xFF\xFE\x00\x00) could +-be both, a complete BOM in UTF-32LE or a UTF-16LE BOM followed by a +-U+0000 character. It is also possible that C<$octets> starts with ++Takes a sequence of octets and attempts to read a byte order mark at the ++beginning of the octet sequence. It will go through the list of ++$options{encodings} (or the list of default encodings if no encodings ++are specified) and match the beginning of the string against any byte ++order mark octet sequence found. ++ ++The result can be ambiguous. For example, qq(\xFF\xFE\x00\x00) could ++be either a complete BOM in UTF-32LE or a UTF-16LE BOM followed by a ++U+0000 character. It is also possible for C<$octets> to start with + something that looks like a byte order mark but actually is not. + +-encoding_from_byte_order_mark sorts the list of possible encodings +-by the length of their BOM octet sequence and returns in scalar +-context only the encoding with the longest match, and all encodings +-ordered by length of their BOM octet sequence in list context. ++encoding_from_byte_order_mark sorts the list of possible encodings by ++the length of their BOM octet sequence. In scalar context, it returns ++only the encoding with the longest match. In list context, it returns ++all encodings ordered by length of their BOM octet sequence. + + Examples: + +@@ -634,9 +634,9 @@ + | "\x2B\x2F\x76\x38\x2D" | UTF-7 | qw(UTF-7) | + +-------------------------+------------+-----------------------+ + +-Note however that for UTF-7 it is in theory possible that the U+FEFF +-combines with other characters in which case such detection would fail, +-for example consider: ++Note, however, that for UTF-7 it is theoretically possible for U+FEFF to ++combine with other characters, in which case such detection would fail. ++For example, consider: + + +--------------------------------------+-----------+-----------+ + | Input | Encodings | Result | +@@ -649,15 +649,17 @@ + relevant for most applications as there should never be need to use + UTF-7 in the encoding list for existing documents. + +-If no BOM can be found it returns C<undef> in scalar context and an +-empty list in list context. This routine should not be used with +-strings with the UTF-8 flag turned on. ++If no BOM can be found, it returns C<undef> in scalar context or an ++empty list in list context. ++ ++This routine should not be used with strings with the UTF-8 flag turned ++on. + + =item encoding_from_xml_declaration($declaration) + + Attempts to extract the value of the encoding pseudo-attribute in an XML + declaration or text declaration in the character string $declaration. If +-there does not appear to be such a value it returns nothing. This would ++there does not appear to be such a value, it returns nothing. This would + typically be used with the return values of xml_declaration_from_octets. + Normalizes whitespaces like encoding_from_content_type. + +@@ -688,12 +690,14 @@ + =item xml_declaration_from_octets($octets [, %options]) + + Attempts to find a ">" character in the byte string $octets using the +-encodings in $encodings and upon success attempts to find a preceding +-"<" character. Returns all the strings found this way in the order of +-number of successful matches in list context and the best match in +-scalar context. Should probably be combined with the only user of this +-routine, encoding_from_xml_declaration... You can modify the list of +-suspected encodings using $options{encodings}; ++encodings in $encodings, and upon success attempts to find a preceding ++"<" character. In list context, returns all the strings found this way ++in the order of number of successful matches; or in scalar context, ++returns the best match. You can modify the list of suspected encodings ++using $options{encodings}; ++ ++(Should probably be combined with the only user of this routine, ++encoding_from_xml_declaration...) + + =item encoding_from_first_chars($octets [, %options]) + +@@ -707,9 +711,9 @@ + document is a HTML document) to get at least a base encoding which can be + used to decode enough of the document to find <meta> elements using + encoding_from_meta_element. $options{whitespace} defaults to qw/CR LF SP TB/. +-Returns nothing if unsuccessful. Returns the matching encodings in order +-of the number of octets matched in list context and the best match in +-scalar context. ++Returns nothing if unsuccessful. In list context, returns the matching ++encodings in order of the number of octets matched. In scalar context, ++returns the best match. + + Examples: + +@@ -742,14 +746,14 @@ + <meta http-equiv=Content-Type content='...'> + + are found, uses encoding_from_content_type to extract the charset +-parameter. It returns all such encodings it could find in document +-order in list context or the first encoding in scalar context (it +-will currently look for others regardless of calling context) or +-nothing if that fails for some reason. ++parameter. It returns (in list context) all such encodings it could find ++in document order, or (in scalar context) the first encoding, or nothing ++if that fails for some reason. (Currently it will look for any and all ++encodings even when called in scalar context.) + +-Note that there are many edge cases where this does not yield in ++Note that there are many edge cases where this does not yield + "proper" results depending on the capabilities of the HTML::Parser +-version and the options you pass for it, for example, ++version and the options you pass for it. For example: + + <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" [ + <!ENTITY content_type "text/html;charset=utf-8"> +@@ -759,19 +763,19 @@ + <p>...</p> + + This would likely not detect the C<utf-8> value if HTML::Parser +-does not resolve the entity. This should however only be a concern ++does not resolve the entity. This should, however, only be a concern + for documents specifically crafted to break the encoding detection. + + =item encoding_from_xml_document($octets, [, %options]) + +-Uses encoding_from_byte_order_mark to detect the encoding using a +-byte order mark in the byte string and returns the return value of +-that routine if it succeeds. Uses xml_declaration_from_octets and +-encoding_from_xml_declaration and returns the encoding for which +-the latter routine found most matches in scalar context, and all +-encodings ordered by number of occurences in list context. It +-does not return a value of neither byte order mark not inbound +-declarations declare a character encoding. ++Uses encoding_from_byte_order_mark to detect the encoding using a byte ++order mark in the byte string. Returns the return value of that routine ++if it succeeds. Uses xml_declaration_from_octets and ++encoding_from_xml_declaration, and (in scalar context) returns the ++encoding for which the latter routine found most matches, or (in list ++context) all encodings ordered by number of occurences. It does not ++return a value of neither byte order mark not inbound declarations ++declare a character encoding. + + Examples: + +@@ -787,12 +791,12 @@ + +----------------------------+----------+-----------+----------+ + + Lacking a return value from this routine and higher-level protocol +-information (such as protocol encoding defaults) processors would ++information (such as protocol encoding defaults), processors would + be required to assume that the document is UTF-8 encoded. + +-Note however that the return value depends on the set of suspected ++Note, however, that the return value depends on the set of suspected + encodings you pass to it. For example, by default, EBCDIC encodings +-would not be considered and thus for ++would not be considered, and thus for + + <?xml version='1.0' encoding='cp37'?> + +@@ -803,7 +807,7 @@ + + Uses encoding_from_xml_document and encoding_from_meta_element to + determine the encoding of HTML documents. If $options{xhtml} is +-set to a false value uses encoding_from_byte_order_mark and ++set to a false value, uses encoding_from_byte_order_mark and + encoding_from_meta_element to determine the encoding. The xhtml + option is on by default. The $options{encodings} can be used to + modify the suspected encodings and $options{parser_options} can +@@ -811,13 +815,13 @@ + encoding_from_meta_element (see the relevant documentation). + + Returns nothing if no declaration could be found, the winning +-declaration in scalar context and a list of encoding source +-and encoding name in list context, see ENCODING SOURCES. ++declaration in scalar context, or a list of encoding source ++and encoding name in list context. See L</"ENCODING SOURCES">. + + ... + + Other problems arise from differences between HTML and XHTML syntax +-and encoding detection rules, for example, the input could be ++and encoding detection rules. For example, the input could be: + + Content-Type: text/html + +@@ -829,14 +833,14 @@ + <title></title> + <p>...</p> + +-This is a perfectly legal HTML 4.01 document and implementations +-might be expected to consider the document ISO-8859-2 encoded as +-XML rules for encoding detection do not apply to HTML documents. +-This module attempts to avoid making decisions which rules apply +-for a specific document and would thus by default return 'utf-8' +-for this input. ++This is a perfectly legal HTML 4.01 document and implementations might ++be expected to consider the document to have ISO-8859-2 encoding, as XML ++rules for encoding detection do not apply to HTML documents. This ++module attempts to avoid making decisions on which rules apply for a ++specific document, and would thus by default return 'utf-8' for this ++input. + +-On the other hand, if the input omits the encoding declaration, ++On the other hand, if the input omits the encoding declaration, thus: + + Content-Type: text/html + +@@ -848,8 +852,10 @@ + <title></title> + <p>...</p> + +-It would return 'iso-8859-2'. Similar problems would arise from +-other differences between HTML and XHTML, for example consider ++it would return 'iso-8859-2'. ++ ++Similar problems would arise from other differences between HTML and ++XHTML. For example, consider: + + Content-Type: text/html + +@@ -864,69 +870,70 @@ + + If this is processed using HTML rules, the first > will end the + processing instruction and the XHTML document type declaration +-would be the relevant declaration for the document, if it is ++would be the relevant declaration for the document. If it is + processed using XHTML rules, the ?> will end the processing + instruction and the HTML document type declaration would be the + relevant declaration. + +-IOW, an application would need to assume a certain character +-encoding (family) to process enough of the document to determine +-whether it is XHTML or HTML and the result of this detection would +-depend on which processing rules are assumed in order to process it. +-It is thus in essence not possible to write a "perfect" detection +-algorithm, which is why this routine attempts to avoid making any +-decisions on this matter. ++In other words, an application would need to assume a certain character ++encoding (family) to process enough of the document to determine whether ++it is XHTML or HTML, and the result of this detection would depend on ++which processing rules are assumed in order to process it. It is thus ++in essence not possible to write a "perfect" detection algorithm, which ++is why this routine attempts to avoid making any decisions on this ++matter. + + =item encoding_from_http_message($message [, %options]) + +-Determines the encoding of HTML / XML / XHTML documents enclosed +-in HTTP message. $message is an object compatible to L<HTTP::Message>, +-e.g. a L<HTTP::Response> object. %options is a hash with the following +-possible entries: ++Determines the encoding of HTML/XML/XHTML documents enclosed in an HTTP ++message. $message is an object compatible withL<HTTP::Message>, e.g. a ++L<HTTP::Response> object. %options is a hash with the following possible ++entries: + + =over 2 + + =item encodings + +-array references of suspected character encodings, defaults to ++Array references of suspected character encodings; defaults to + C<$HTML::Encoding::DEFAULT_ENCODINGS>. + + =item is_html + + Regular expression matched against the content_type of the message +-to determine whether to use HTML rules for the entity body, defaults ++to determine whether to use HTML rules for the entity body; defaults + to C<qr{^text/html$}i>. + + =item is_xml + + Regular expression matched against the content_type of the message +-to determine whether to use XML rules for the entity body, defaults ++to determine whether to use XML rules for the entity body; defaults + to C<qr{^.+/(?:.+\+)?xml$}i>. + + =item is_text_xml + + Regular expression matched against the content_type of the message +-to determine whether to use text/html rules for the message, defaults ++to determine whether to use text/html rules for the message; defaults + to C<qr{^text/(?:.+\+)?xml$}i>. This will only be checked if is_xml +-matches aswell. ++matches as well. + + =item html_default + +-Default encoding for documents determined (by is_html) as HTML, ++Default encoding for documents determined (by is_html) as HTML; + defaults to C<ISO-8859-1>. + + =item xml_default + +-Default encoding for documents determined (by is_xml) as XML, ++Default encoding for documents determined (by is_xml) as XML; + defaults to C<UTF-8>. + + =item text_xml_default + +-Default encoding for documents determined (by is_text_xml) as text/xml, +-defaults to C<undef> in which case the default is ignored. This should +-be set to C<US-ASCII> if desired as this module is by default +-inconsistent with RFC 3023 which requires that for text/xml documents +-without a charset parameter in the HTTP header C<US-ASCII> is assumed. ++Default encoding for documents determined (by is_text_xml) as text/xml; ++defaults to C<undef>, in which case the default is ignored. This should ++be set to C<US-ASCII> if desired, as this module is by default ++inconsistent with RFC 3023; that RFC requires that for text/xml ++documents without a charset parameter in the HTTP header, C<US-ASCII> is ++assumed. + + This requirement is inconsistent with RFC 2616 (HTTP/1.1) which requires + to assume C<ISO-8859-1>, has been widely ignored and is thus disabled by +@@ -935,18 +942,18 @@ + =item xhtml + + Whether the routine should look for an encoding declaration in the +-XML declaration of the document (if any), defaults to C<1>. ++XML declaration of the document (if any); defaults to C<1>. + + =item default + + Whether the relevant default value should be returned when no other +-information can be determined, defaults to C<1>. ++information can be determined; defaults to C<1>. + + =back + +-This is furhter possibly inconsistent with XML MIME types that differ +-in other ways from application/xml, for example if the MIME Type does +-not allow for a charset parameter in which case applications might be ++This is possibly further inconsistent with XML MIME types that differ ++in other ways from application/xml (for example, if the MIME type does ++not allow for a charset parameter), in which case applications might be + expected to ignore the charset parameter if erroneously provided. + + =back +@@ -954,17 +961,17 @@ + =head1 EBCDIC SUPPORT + + By default, this module does not support EBCDIC encodings. To enable +-support for EBCDIC encodings you can either change the ++support for EBCDIC encodings, you can either change the + $HTML::Encodings::DEFAULT_ENCODINGS array reference or pass the +-encodings to the routines you use using the encodings option, for +-example ++encodings to the routines you use using the encodings option; for ++example: + + my @try = qw/UTF-8 UTF-16LE cp500 posix-bc .../; + my $enc = encoding_from_xml_document($doc, encodings => \@try); + + Note that there are some subtle differences between various EBCDIC +-encodings, for example C<!> is mapped to 0x5A in C<posix-bc> and +-to 0x4F in C<cp500>; these differences might affect processing in ++encodings. For example, C<!> is mapped to 0x5A in C<posix-bc> and ++to 0x4F in C<cp500>. These differences might affect processing in + yet undetermined ways. + + =head1 TODO +@@ -994,4 +1001,8 @@ + Copyright (c) 2004-2008 Bjoern Hoehrmann <bjoern@hoehrmann.de>. + This module is licensed under the same terms as Perl itself. + ++ This document has been edited for grammar, spelling, and clarity by ++ Larry Gilbert <l2g@macports.org> for the MacPorts Project. (Some ++ especially opaque passages have been left alone.) ++ + =cut