Processing Text with Perl Modules - Page 11
September 24, 2001
|
In the previous article, we learned how to use Perl's built-in
routines to perform many common text manipulation function. In
the final article of this series on text processing, we will take
a tour through a cornucopia of useful text processing modules
that will kick the tar out of some of those arduous text
processing tasks.
|
The Power of CPAN
The Comprehensive Perl Archive
Network is a group of servers around the world that provide
access to the Perl source code and hundred of Perl modules that
have been contributed by volunteers. CPAN is one of the things I
imagine other language authors wish they had for their respective
languages (like Java) but don't. Fortunately for us, pre-built
modules that bundle up the code and logic for performing many
common tasks are freely available for the taking. See the list of
resources on the last page of this article for a list of
resources.
Installing Modules
Part of what makes CPAN powerful is the fact that Perl supports
it directly with the CPAN.pm module, which has been distributed
with the Perl source code for several years now. The module is
capable of searching for, downloading, and installing modules
directly from CPAN. It will even handle module dependencies where
the module you're trying to install requires other modules from
CPAN before it can be installed.
On most operating systems, you can install a CPAN module by
typing:
perl -MCPAN -e 'install HTML::Parser'
where HTML::Parser is the name of the module you
wish to install. This will automatically find, download, compile,
and install the module onto your system.
If you are using
Activestate Perl and the module you are installing is
available in Activestate's repository, you can type: ppm
install GD
PPM is a command-line utility that is only available if you are
using Activestate Perl. Note that not all Perl modules from CPAN
are available to PPM. So if you're running Activestate Perl on a
win32 platform, you will also need to have Visual C++ and nmake
installed on your system to load modules from CPAN that are not
available to PPM.
Making Text HTML Safe
I'm sure most of you have had at least one occasion where you
needed to effectively cut and paste a text file into an HTML
file. If that text file contained any reserved characters like
& or <, you probably had to
convert them to HTML-safe entities such as <
for < by hand. Or maybe you haven't fixed the
text and you now have an invalid HTML document out there on your
Web site.
Well, if you find yourself doing this hand tuning on a regular
basis or if you're routinely posting text into HTML files without
checking to see if it's HTML safe, stop; because CPAN has a
module called HTML::Entities which does all of the
work for you.
The module contains a function appropriately named
encode_entities() that automatically encodes all
HTML reserved characters. So for example, if you have a string of
text that's contained in a variable named $text that needs
to be HTML encoded, you would first add the statement: use
HTML::Entities to the top of your script and then type:
encode_entities($text);
somewhere in the main body of your source code. So if
$text contained the string "Fred & Barney's
Bowling Academy", it would be converted into "Fred &
Barney's Bowling Academy".
We could also build a simple script that converts an entire file
such that we can execute the following on the command-line:
html_encode.pl < sample.txt > newtext.txt
Or in plain english, we direct a text file called
sample.txt to the script as input and write the
resulting encoded text to newtext.txt. The source of
the script would look like the following:
#!/usr/bin/perl -w
use strict;
use HTML::Entities;
while (<>) {
encode_entities($_);
print;
}
Sending Bulk E-mails - Page 10
Weaving Magic With Regular Expressions
Encrypting Text with RC4 - Page 12
|