Perl: Handle malformed UTF-8 strings with Encode::encode
Having the error message “Malformed UTF-8 character (fatal)” in my log files, I tried to handle this properly without letting the process die nor throwing away the whole string.
Having some research on Google I came up with following solution:
sub encode_utf_8 { my $string = @_;
my $utf8_encoded = ''; eval { $utf8_encoded = Encode::encode('UTF-8', $string, Encode::FB_CROAK); }; if ($@) { # sanitize malformed UTF-8 $utf8_encoded = ''; my @chars = split(//, $string); foreach my $char (@chars) { my $utf_8_char = eval { Encode::encode('UTF-8', $char, Encode::FB_CROAK) } or next; $utf8_encoded .= $utf_8_char; } } return $utf8_encoded;}See also:
http://perldoc.perl.org/Encode.html#Handling-Malformed-Data
http://www.perlmonks.org/?node_id=839519
Set a custom HTTP User-Agent in Perl with WWW::Mechanize
This is how you can dynamically set a custom HTTP User-Agent for your Perl requests to fake a device or browser for testing purpose or getting a device-specific version of a website.
WWW::Mechanize supports setting a custom user-agent with the constructor and after this gives a choice of 6 pre-defined basic user-agents ( $mech->agent_alias() ), only.
The following code demonstrates how to dynamically change the user-agent on a Mechanize object.
use WWW::Mechanize;
my $initial_user_agent = 'Mozilla/5.0 (Linux; U; Android 2.2; de-de; HTC Desire HD 1.18.161.2 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1';my @user_agents = ( 'Mozilla/5.0 (Windows; U; Windows NT 6.1; nl; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13', 'Mozilla/5.0 (iPad; U; CPU iPhone OS 3_2 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Mobile/7D11', 'Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5',);
# Set an initial custom header with the contructormy $mech = WWW::Mechanize->new( agent => $initial_user_agent );
# get a page and print current URI (WWW::Mechanize follows redirections)$mech->get( 'http://www.facebook.com' );print sprintf( "User-Agent %s\n redirects to: %s\n\n", $initial_user_agent, $mech->uri() );
foreach my $http_user_agent (@user_agents) { # dynamically set custom HTTP User-agents $mech->add_header( 'User-agent' => $http_user_agent);
$mech->get( 'http://www.facebook.com' ); print sprintf( "User-Agent %s\n redirects to: %s\n\n", $http_user_agent, $mech->uri() );}
# $ perl ./mechanize-user-agent.pl# User-Agent Mozilla/5.0 (Linux; U; Android 2.2; de-de; HTC Desire HD 1.18.161.2 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1# redirects to: http://m.facebook.com/?w2m&refsrc=http%3A%2F%2Fwww.facebook.com%2F&_rdr# # User-Agent Mozilla/5.0 (Windows; U; Windows NT 6.1; nl; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13# redirects to: http://www.facebook.com# # User-Agent Mozilla/5.0 (iPad; U; CPU iPhone OS 3_2 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Mobile/7D11# redirects to: http://m.facebook.com/?w2m&refsrc=http%3A%2F%2Fwww.facebook.com%2F&_rdr# # User-Agent Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5# redirects to: http://m.facebook.com/?w2m&refsrc=http%3A%2F%2Fwww.facebook.com%2F&_rdrStrip all HTML tags with Perl like PHP’s strip_tags() does
The Perl regular expression (regexp/regex) equivalent to PHP’s strip_tags() is:
while ($string =~ s/<\S[^<>]*(?:>|$)//gs) {};
Please note that it also denotes an opening “<” (followed by a non-whitespace character) as a tag and strips all characters behind, even it is not closed by a “>”. This is the same behavior as PHP’s strip_tags().
Update: This regexp is only satisfying my test against PHP 4.x, but 5.x is pretty smarter when it comes to edge cases. It will be a challenge to build a Perl equivalent as all the different approaches in CPAN also fail the test.
Update 2010-07-07: I’m currently porting strip_tags() from the C source code of PHP 5.3.2 to a CPAN Module. Stay tuned.
Update 2011-05-25: Today I finally uploaded my Perl port to CPAN: http://search.cpan.org/~hinnerk/HTML-StripTags-1.00/
New home of this module is http://www.hinnerk-altenburg.de/perl-strip_tags/
Moved from epublica GmbH to XING AG
As per February, 1st I moved with epublica’s entire XING.com core development team to the XING AG itself, now developing the platform ‘inhouse’ as XING employee.
PerlIDS-Artikel im deutschen Perl-Magazin $foo erschienen
Mein vierseitiger Artikel zum Perl-CPAN-Modul CGI::IDS ist in der aktuellen Ausgabe 1/2009 des deutschen Perl-Magazins $foo erschienen.
Ich gebe darin einen Überblick Über die Funktion und den Einsatz von PerlIDS zur frühzeitigen Erkennung von CrossSite-Scripting, SQL-Injections und Ähnlichen Angriffen auf Webapplikationen.
I just published a four pages long article in the German Perl magazine $foo about my Perl CPAN module CGI::IDS, a Website Intrusion Detection System.
OpenSource Perl Website Intrusion Detection System PerlIDS (CGI::IDS) released
Today, we at epublica have officially released my work of the last months – a Perl port of PHPIDS, a tool for detection of Cross-Site-Scripting (XSS), Cross-Site-Request-Forgery (CSRF), SQL-Injections (SQLI), Local-File-Inclusions (LFI) etc. in website requests.
The tool is released as CGI::IDS Perl module “PerlIDS” on CPAN.org under the OpenSource “Lesser GNU Public License” (LGPL).
Einsatz von Google Analytics nun mit Datenschutzhinweis rechtmäßig
Update: Zur aktuellen Lage lest bitte den Artikel “Webanalyse datenschutzkonform betreiben: Google Analytics anonymisieren“ in der T3N zu diesem Thema! Mein folgender Artikel (aus dem Oktober 2008) entspricht nicht mehr dem aktuellen Stand!
Laut einem Artikel auf heise.de ist nun nach einer Aussage des Bundesverband Digitale Wirtschaft (BVDW) endlich die seit einiger Zeit durch Datenschützer entdeckte Unsicherheit bzgl. des Einsatzes des populären und kostenfreien Website Analyse- und Statistiktools Google Analytics auf deutschen Websites ausgeräumt. Als kritisch wurde u.a. die Übermittlung der Nutzungsdaten und vor allem der IP-Adressen der Websitebesucher zu Google Inc. in die USA – also außerhalb der strengeren deutschen und europäischen Datenschutzbestimmungen – erachtet.
Google hat nun auf dieses Problem reagiert und seine Google Analytics AGB im Punkt “8. DATENSCHUTZ” angepasst:
Relaunch of Derix Glasstudios website finally online
The relaunch of the corporate website of Derix Glasstudios, Taunusstein/Germany and Derix Art Glass Consultants, Portland/USA is now finally online!
I have already concepted and developed it in 2005 and I am happy to see it online now! The website is developed in PHP/MySQL with a custom-made admin interface.
Derix Glasstudios have been founded in 1866 and are today making art glass for prominent projects all over the world.
[Update] The website is now available in Russian and Spanish, too.
[Update] I am now also doing search engine optimization and Google AdWords campaigns for them.
[Update] Redesign now online with a new color theme and lightbox project viewer and AJAX projects preview using Prototype JS and Scriptaculous.
Two TYPO3 OpenSource extensions published
I am now the author of two TYPO3 extensions published in TER (TYPO3 Extension Repository). These extensions are frontend plugins that add functionality to the mm_forum extension.
exinit_latesttopics displays the latest forum topics in a box, exinit_pollwidget displays an AJAX box for forum polls to make voting possible on any page.
My New Jobs since May 2008
Since May, I am employed by epublica GmbH, Hamburg, doing Perl development mainly for the XING Web platform. Have a look at their brand new office in the heart of the city upstairs from XING.
Also I am working as a freelancer for the TYPO3 agency EXINIT GmbH & Co. KG, Hamburg doing TYPO3 extension development in PHP.
