[77344] trunk/dports/perl/p5-html-encoding

27 Mar 2011

Revision: 77344
          http://trac.macports.org/changeset/77344
Author:   l2g@macports.org
Date:     2011-03-27 15:38:48 -0700 (Sun, 27 Mar 2011)
Log Message:
-----------
p5-html-encoding: new version 0.61; assumed maintainership (open); added
license info; edited description; added patch with my significant edits
to the perldoc, hopefully improving grammar and clarity

Modified Paths:
--------------
    trunk/dports/perl/p5-html-encoding/Portfile

Added Paths:
-----------
    trunk/dports/perl/p5-html-encoding/files/
    trunk/dports/perl/p5-html-encoding/files/patch-lib-HTML-Encoding.pm.diff

Modified: trunk/dports/perl/p5-html-encoding/Portfile
===================================================================

--- trunk/dports/perl/p5-html-encoding/Portfile	2011-03-27 22:15:56 UTC (rev 77343)
+++ trunk/dports/perl/p5-html-encoding/Portfile	2011-03-27 22:38:48 UTC (rev 77344)
@@ -4,20 +4,21 @@
 PortSystem          1.0
 PortGroup           perl5 1.0
 
-perl5.setup         HTML-Encoding 0.60
-revision            2
+perl5.setup         HTML-Encoding 0.61
 platforms           darwin
-maintainers         nomaintainer
+maintainers         l2g openmaintainer
+license             Artistic GPL
 supported_archs     noarch
 
-description         Determines the encoding of HTML and XML/XHTML documents
+description         Determine the encoding of HTML/XML/XHTML documents
 
 long_description    ${description}
 
-checksums           md5 b6a0ded3d1a085bc7b3cdb5ae07e89d2 \
-                    sha1 7bff5ea3a512dd8a1667f8d2ee16c3505d512da2 \
-                    rmd160 e4f2621e737598bf758c691bf605b44a65c42af3
+checksums           sha1    539c09038c812ae8b2215ab3824b69e50e20b33c \
+                    rmd160  568d0d6b46778644802b9e4f5ac4642a4ad1c419
 
+patchfiles          patch-lib-HTML-Encoding.pm.diff
+
 depends_lib-append  port:p5-encode \
                     port:p5-html-parser \
                     port:p5-libwww-perl

Added: trunk/dports/perl/p5-html-encoding/files/patch-lib-HTML-Encoding.pm.diff
===================================================================
--- trunk/dports/perl/p5-html-encoding/files/patch-lib-HTML-Encoding.pm.diff	                        (rev 0)
+++ trunk/dports/perl/p5-html-encoding/files/patch-lib-HTML-Encoding.pm.diff	2011-03-27 22:38:48 UTC (rev 77344)
@@ -0,0 +1,463 @@
+--- lib/HTML/Encoding.pm.orig	2011-03-27 14:53:03.000000000 -0700
++++ lib/HTML/Encoding.pm	2011-03-27 15:28:44.000000000 -0700
+@@ -523,20 +523,20 @@
+ 
+ =head1 WARNING
+ 
+-The interface and implementation are guranteed to change before this
++The interface and implementation are guaranteed to change before this
+ module reaches version 1.00! Please send feedback to the author of
+ this module.
+ 
+ =head1 DESCRIPTION
+ 
+ HTML::Encoding helps to determine the encoding of HTML and XML/XHTML
+-documents...
++documents.
+ 
+ =head1 DEFAULT ENCODINGS
+ 
+-Most routines need to know some suspected character encodings which
++Most routines need to know some suspected character encodings; these
+ can be provided through the C<encodings> option. This option always
+-defaults to the $HTML::Encoding::DEFAULT_ENCODINGS array reference
++defaults to the $HTML::Encoding::DEFAULT_ENCODINGS array reference,
+ which means the following encodings are considered by default:
+ 
+   * ISO-8859-1
+@@ -546,7 +546,7 @@
+   * UTF-32BE
+   * UTF-8
+ 
+-If you change the values or pass custom values to the routines note
++If you change the values or pass custom values to the routines, note
+ that L<Encode> must support them in order for this module to work
+ correctly.
+ 
+@@ -554,7 +554,7 @@
+ 
+ C<encoding_from_xml_document>, C<encoding_from_html_document>, and
+ C<encoding_from_http_message> return in list context the encoding
+-source and the encoding name, possible encoding sources are
++source and the encoding name. Possible encoding sources are:
+ 
+   * protocol         (Content-Type: text/html;charset=encoding)
+   * bom              (leading U+FEFF)
+@@ -565,21 +565,21 @@
+ 
+ =head1 ROUTINES
+ 
+-Routines exported by this module at user option. By default, nothing
+-is exported.
++Routines may be exported by this module at the user's option. By
++default, nothing is exported.
+ 
+ =over 2
+ 
+ =item encoding_from_content_type($content_type)
+ 
+ Takes a byte string and uses L<HTTP::Headers::Util> to extract the
+-charset parameter from the C<Content-Type> header value and returns
++charset parameter from the C<Content-Type> header value. Returns
+ its value or C<undef> (or an empty list in list context) if there
+ is no such value. Only the first component will be examined
+ (HTTP/1.1 only allows for one component), any backslash escapes in
+ strings will be unescaped, all leading and trailing quote marks
+ and white-space characters will be removed, all white-space will be
+-collapsed to a single space, empty charset values will be ignored
++collapsed to a single space, empty charset values will be ignored,
+ and no case folding is performed.
+ 
+ Examples:
+@@ -596,28 +596,28 @@
+   | "text/html;charset=\" UTF-8 \""         | 'UTF-8'   |
+   +-----------------------------------------+-----------+
+ 
+-If you pass a string with the UTF-8 flag turned on the string will
++If you pass a string with the UTF-8 flag turned on, the string will
+ be converted to bytes before it is passed to L<HTTP::Headers::Util>.
+-The return value will thus never have the UTF-8 flag turned on (this
+-might change in future versions).
++The return value will thus never have the UTF-8 flag turned on. (This
++might change in future versions.)
+ 
+ =item encoding_from_byte_order_mark($octets [, %options])
+ 
+-Takes a sequence of octets and attempts to read a byte order mark
+-at the beginning of the octet sequence. It will go through the list
+-of $options{encodings} or the list of default encodings if no
+-encodings are specified and match the beginning of the string against
+-any byte order mark octet sequence found.
+-
+-The result can be ambiguous, for example qq(\xFF\xFE\x00\x00) could
+-be both, a complete BOM in UTF-32LE or a UTF-16LE BOM followed by a
+-U+0000 character. It is also possible that C<$octets> starts with
++Takes a sequence of octets and attempts to read a byte order mark at the
++beginning of the octet sequence. It will go through the list of
++$options{encodings} (or the list of default encodings if no encodings
++are specified) and match the beginning of the string against any byte
++order mark octet sequence found.
++
++The result can be ambiguous. For example, qq(\xFF\xFE\x00\x00) could
++be either a complete BOM in UTF-32LE or a UTF-16LE BOM followed by a
++U+0000 character. It is also possible for C<$octets> to start with
+ something that looks like a byte order mark but actually is not.
+ 
+-encoding_from_byte_order_mark sorts the list of possible encodings
+-by the length of their BOM octet sequence and returns in scalar
+-context only the encoding with the longest match, and all encodings
+-ordered by length of their BOM octet sequence in list context.
++encoding_from_byte_order_mark sorts the list of possible encodings by
++the length of their BOM octet sequence. In scalar context, it returns
++only the encoding with the longest match. In list context, it returns
++all encodings ordered by length of their BOM octet sequence.
+ 
+ Examples:
+ 
+@@ -634,9 +634,9 @@
+   | "\x2B\x2F\x76\x38\x2D"  | UTF-7      | qw(UTF-7)             |
+   +-------------------------+------------+-----------------------+
+ 
+-Note however that for UTF-7 it is in theory possible that the U+FEFF
+-combines with other characters in which case such detection would fail,
+-for example consider:
++Note, however, that for UTF-7 it is theoretically possible for U+FEFF to
++combine with other characters, in which case such detection would fail.
++For example, consider:
+ 
+   +--------------------------------------+-----------+-----------+
+   | Input                                | Encodings | Result    |
+@@ -649,15 +649,17 @@
+ relevant for most applications as there should never be need to use
+ UTF-7 in the encoding list for existing documents.
+ 
+-If no BOM can be found it returns C<undef> in scalar context and an
+-empty list in list context. This routine should not be used with
+-strings with the UTF-8 flag turned on. 
++If no BOM can be found, it returns C<undef> in scalar context or an
++empty list in list context.
++
++This routine should not be used with strings with the UTF-8 flag turned
++on. 
+ 
+ =item encoding_from_xml_declaration($declaration)
+ 
+ Attempts to extract the value of the encoding pseudo-attribute in an XML
+ declaration or text declaration in the character string $declaration. If
+-there does not appear to be such a value it returns nothing. This would
++there does not appear to be such a value, it returns nothing. This would
+ typically be used with the return values of xml_declaration_from_octets.
+ Normalizes whitespaces like encoding_from_content_type.
+ 
+@@ -688,12 +690,14 @@
+ =item xml_declaration_from_octets($octets [, %options])
+ 
+ Attempts to find a ">" character in the byte string $octets using the
+-encodings in $encodings and upon success attempts to find a preceding
+-"<" character. Returns all the strings found this way in the order of
+-number of successful matches in list context and the best match in
+-scalar context. Should probably be combined with the only user of this
+-routine, encoding_from_xml_declaration... You can modify the list of
+-suspected encodings using $options{encodings};
++encodings in $encodings, and upon success attempts to find a preceding
++"<" character. In list context, returns all the strings found this way
++in the order of number of successful matches; or in scalar context,
++returns the best match. You can modify the list of suspected encodings
++using $options{encodings};
++
++(Should probably be combined with the only user of this routine,
++encoding_from_xml_declaration...)
+ 
+ =item encoding_from_first_chars($octets [, %options])
+ 
+@@ -707,9 +711,9 @@
+ document is a HTML document) to get at least a base encoding which can be
+ used to decode enough of the document to find <meta> elements using
+ encoding_from_meta_element. $options{whitespace} defaults to qw/CR LF SP TB/.
+-Returns nothing if unsuccessful. Returns the matching encodings in order
+-of the number of octets matched in list context and the best match in
+-scalar context.
++Returns nothing if unsuccessful. In list context, returns the matching
++encodings in order of the number of octets matched. In scalar context,
++returns the best match.
+ 
+ Examples:
+ 
+@@ -742,14 +746,14 @@
+   <meta http-equiv=Content-Type content='...'>
+   
+ are found, uses encoding_from_content_type to extract the charset
+-parameter. It returns all such encodings it could find in document
+-order in list context or the first encoding in scalar context (it
+-will currently look for others regardless of calling context) or
+-nothing if that fails for some reason.
++parameter. It returns (in list context) all such encodings it could find
++in document order, or (in scalar context) the first encoding, or nothing
++if that fails for some reason. (Currently it will look for any and all
++encodings even when called in scalar context.)
+ 
+-Note that there are many edge cases where this does not yield in
++Note that there are many edge cases where this does not yield
+ "proper" results depending on the capabilities of the HTML::Parser
+-version and the options you pass for it, for example,
++version and the options you pass for it. For example:
+ 
+   <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" [
+     <!ENTITY content_type "text/html;charset=utf-8">
+@@ -759,19 +763,19 @@
+   <p>...</p>
+ 
+ This would likely not detect the C<utf-8> value if HTML::Parser
+-does not resolve the entity. This should however only be a concern
++does not resolve the entity. This should, however, only be a concern
+ for documents specifically crafted to break the encoding detection.
+ 
+ =item encoding_from_xml_document($octets, [, %options])
+ 
+-Uses encoding_from_byte_order_mark to detect the encoding using a
+-byte order mark in the byte string and returns the return value of
+-that routine if it succeeds. Uses xml_declaration_from_octets and
+-encoding_from_xml_declaration and returns the encoding for which
+-the latter routine found most matches in scalar context, and all
+-encodings ordered by number of occurences in list context. It
+-does not return a value of neither byte order mark not inbound
+-declarations declare a character encoding.
++Uses encoding_from_byte_order_mark to detect the encoding using a byte
++order mark in the byte string. Returns the return value of that routine
++if it succeeds. Uses xml_declaration_from_octets and
++encoding_from_xml_declaration, and (in scalar context) returns the
++encoding for which the latter routine found most matches, or (in list
++context) all encodings ordered by number of occurences. It does not
++return a value of neither byte order mark not inbound declarations
++declare a character encoding.
+ 
+ Examples:
+ 
+@@ -787,12 +791,12 @@
+   +----------------------------+----------+-----------+----------+
+ 
+ Lacking a return value from this routine and higher-level protocol
+-information (such as protocol encoding defaults) processors would
++information (such as protocol encoding defaults), processors would
+ be required to assume that the document is UTF-8 encoded.
+ 
+-Note however that the return value depends on the set of suspected
++Note, however, that the return value depends on the set of suspected
+ encodings you pass to it. For example, by default, EBCDIC encodings
+-would not be considered and thus for
++would not be considered, and thus for
+ 
+   <?xml version='1.0' encoding='cp37'?>
+   
+@@ -803,7 +807,7 @@
+ 
+ Uses encoding_from_xml_document and encoding_from_meta_element to
+ determine the encoding of HTML documents. If $options{xhtml} is
+-set to a false value uses encoding_from_byte_order_mark and 
++set to a false value, uses encoding_from_byte_order_mark and 
+ encoding_from_meta_element to determine the encoding. The xhtml
+ option is on by default. The $options{encodings} can be used to
+ modify the suspected encodings and $options{parser_options} can
+@@ -811,13 +815,13 @@
+ encoding_from_meta_element (see the relevant documentation).
+ 
+ Returns nothing if no declaration could be found, the winning
+-declaration in scalar context and a list of encoding source
+-and encoding name in list context, see ENCODING SOURCES.
++declaration in scalar context, or a list of encoding source
++and encoding name in list context. See L</"ENCODING SOURCES">.
+ 
+ ...
+ 
+ Other problems arise from differences between HTML and XHTML syntax
+-and encoding detection rules, for example, the input could be
++and encoding detection rules. For example, the input could be:
+ 
+   Content-Type: text/html
+ 
+@@ -829,14 +833,14 @@
+   <title></title>
+   <p>...</p>
+ 
+-This is a perfectly legal HTML 4.01 document and implementations
+-might be expected to consider the document ISO-8859-2 encoded as
+-XML rules for encoding detection do not apply to HTML documents.
+-This module attempts to avoid making decisions which rules apply
+-for a specific document and would thus by default return 'utf-8'
+-for this input.
++This is a perfectly legal HTML 4.01 document and implementations might
++be expected to consider the document to have ISO-8859-2 encoding, as XML
++rules for encoding detection do not apply to HTML documents.  This
++module attempts to avoid making decisions on which rules apply for a
++specific document, and would thus by default return 'utf-8' for this
++input.
+ 
+-On the other hand, if the input omits the encoding declaration,
++On the other hand, if the input omits the encoding declaration, thus:
+ 
+   Content-Type: text/html
+ 
+@@ -848,8 +852,10 @@
+   <title></title>
+   <p>...</p>
+ 
+-It would return 'iso-8859-2'. Similar problems would arise from
+-other differences between HTML and XHTML, for example consider
++it would return 'iso-8859-2'.
++
++Similar problems would arise from other differences between HTML and
++XHTML. For example, consider:
+ 
+   Content-Type: text/html
+ 
+@@ -864,69 +870,70 @@
+   
+ If this is processed using HTML rules, the first > will end the
+ processing instruction and the XHTML document type declaration
+-would be the relevant declaration for the document, if it is
++would be the relevant declaration for the document. If it is
+ processed using XHTML rules, the ?> will end the processing
+ instruction and the HTML document type declaration would be the
+ relevant declaration.
+ 
+-IOW, an application would need to assume a certain character
+-encoding (family) to process enough of the document to determine
+-whether it is XHTML or HTML and the result of this detection would
+-depend on which processing rules are assumed in order to process it.
+-It is thus in essence not possible to write a "perfect" detection
+-algorithm, which is why this routine attempts to avoid making any
+-decisions on this matter.
++In other words, an application would need to assume a certain character
++encoding (family) to process enough of the document to determine whether
++it is XHTML or HTML, and the result of this detection would depend on
++which processing rules are assumed in order to process it.  It is thus
++in essence not possible to write a "perfect" detection algorithm, which
++is why this routine attempts to avoid making any decisions on this
++matter.
+ 
+ =item encoding_from_http_message($message [, %options])
+ 
+-Determines the encoding of HTML / XML / XHTML documents enclosed
+-in HTTP message. $message is an object compatible to L<HTTP::Message>,
+-e.g. a L<HTTP::Response> object. %options is a hash with the following
+-possible entries:
++Determines the encoding of HTML/XML/XHTML documents enclosed in an HTTP
++message. $message is an object compatible withL<HTTP::Message>, e.g. a
++L<HTTP::Response> object. %options is a hash with the following possible
++entries:
+ 
+ =over 2
+ 
+ =item encodings
+ 
+-array references of suspected character encodings, defaults to
++Array references of suspected character encodings; defaults to
+ C<$HTML::Encoding::DEFAULT_ENCODINGS>.
+ 
+ =item is_html
+ 
+ Regular expression matched against the content_type of the message
+-to determine whether to use HTML rules for the entity body, defaults
++to determine whether to use HTML rules for the entity body; defaults
+ to C<qr{^text/html$}i>.
+ 
+ =item is_xml
+ 
+ Regular expression matched against the content_type of the message
+-to determine whether to use XML rules for the entity body, defaults
++to determine whether to use XML rules for the entity body; defaults
+ to C<qr{^.+/(?:.+\+)?xml$}i>.
+ 
+ =item is_text_xml
+ 
+ Regular expression matched against the content_type of the message
+-to determine whether to use text/html rules for the message, defaults
++to determine whether to use text/html rules for the message; defaults
+ to C<qr{^text/(?:.+\+)?xml$}i>. This will only be checked if is_xml
+-matches aswell.
++matches as well.
+ 
+ =item html_default
+ 
+-Default encoding for documents determined (by is_html) as HTML,
++Default encoding for documents determined (by is_html) as HTML;
+ defaults to C<ISO-8859-1>.
+ 
+ =item xml_default
+ 
+-Default encoding for documents determined (by is_xml) as XML,
++Default encoding for documents determined (by is_xml) as XML;
+ defaults to C<UTF-8>.
+ 
+ =item text_xml_default
+ 
+-Default encoding for documents determined (by is_text_xml) as text/xml,
+-defaults to C<undef> in which case the default is ignored. This should
+-be set to C<US-ASCII> if desired as this module is by default
+-inconsistent with RFC 3023 which requires that for text/xml documents
+-without a charset parameter in the HTTP header C<US-ASCII> is assumed.
++Default encoding for documents determined (by is_text_xml) as text/xml;
++defaults to C<undef>, in which case the default is ignored. This should
++be set to C<US-ASCII> if desired, as this module is by default
++inconsistent with RFC 3023; that RFC requires that for text/xml
++documents without a charset parameter in the HTTP header, C<US-ASCII> is
++assumed.
+ 
+ This requirement is inconsistent with RFC 2616 (HTTP/1.1) which requires
+ to assume C<ISO-8859-1>, has been widely ignored and is thus disabled by
+@@ -935,18 +942,18 @@
+ =item xhtml
+ 
+ Whether the routine should look for an encoding declaration in the
+-XML declaration of the document (if any), defaults to C<1>.
++XML declaration of the document (if any); defaults to C<1>.
+ 
+ =item default
+ 
+ Whether the relevant default value should be returned when no other
+-information can be determined, defaults to C<1>.
++information can be determined; defaults to C<1>.
+ 
+ =back
+ 
+-This is furhter possibly inconsistent with XML MIME types that differ
+-in other ways from application/xml, for example if the MIME Type does
+-not allow for a charset parameter in which case applications might be
++This is possibly further inconsistent with XML MIME types that differ
++in other ways from application/xml (for example, if the MIME type does
++not allow for a charset parameter), in which case applications might be
+ expected to ignore the charset parameter if erroneously provided.
+ 
+ =back
+@@ -954,17 +961,17 @@
+ =head1 EBCDIC SUPPORT
+ 
+ By default, this module does not support EBCDIC encodings. To enable
+-support for EBCDIC encodings you can either change the
++support for EBCDIC encodings, you can either change the
+ $HTML::Encodings::DEFAULT_ENCODINGS array reference or pass the
+-encodings to the routines you use using the encodings option, for
+-example
++encodings to the routines you use using the encodings option; for
++example:
+ 
+   my @try = qw/UTF-8 UTF-16LE cp500 posix-bc .../;
+   my $enc = encoding_from_xml_document($doc, encodings => \@try);
+ 
+ Note that there are some subtle differences between various EBCDIC
+-encodings, for example C<!> is mapped to 0x5A in C<posix-bc> and
+-to 0x4F in C<cp500>; these differences might affect processing in
++encodings. For example, C<!> is mapped to 0x5A in C<posix-bc> and
++to 0x4F in C<cp500>. These differences might affect processing in
+ yet undetermined ways.
+ 
+ =head1 TODO
+@@ -994,4 +1001,8 @@
+   Copyright (c) 2004-2008 Bjoern Hoehrmann <bjoern@hoehrmann.de>.
+   This module is licensed under the same terms as Perl itself.
+ 
++  This document has been edited for grammar, spelling, and clarity by
++  Larry Gilbert <l2g@macports.org> for the MacPorts Project. (Some
++  especially opaque passages have been left alone.)
++
+ =cut

    

[77344] trunk/dports/perl/p5-html-encoding

l2g＠macports.org