Thông tin tài liệu
by Sean M. Burke
ISBN 0-596-00178-9
First Edition, published June 2002.
(See the
catalog page for this book.)
Search the text of Perl & LWP.
Table of Contents
Copyright Page
Foreword
Preface
Chapter 1: Introduction to Web Automation
Chapter 2: Web Basics
Chapter 3: The LWP Class Model
Chapter 4: URLs
Chapter 5: Forms
Chapter 6: Simple HTML Processing with Regular Expressions
Chapter 7: HTML Processing with Tokens
Chapter 8: Tokenizing Walkthrough
Chapter 9: HTML Processing with Trees
Chapter 10: Modifying HTML with Trees
Chapter 11: Cookies, Authentication, and Advanced Requests
Chapter 12: Spiders
Appendix A: LWP Modules
Appendix B: HTTP Status Codes
Appendix C: Common MIME Types
Appendix D: Language Tags
Appendix E: Common Content Encodings
Appendix F: ASCII Table
Appendix G: User's View of Object-Oriented Modules
Index
Colophon
Copyright © 2002 O'Reilly & Associates. All rights reserved.
www.it-ebooks.info
Search
Copyright © 2002 O'Reilly & Associates. All rights reserved.
www.it-ebooks.info
Copyright © 2002 O'Reilly & Associates, Inc. All rights reserved.
Printed in the United States of America.
Published by O'Reilly & Associates, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O'Reilly & Associates books may be purchased for educational, business, or sales promotional use. Online editions are
also available for most titles (
http://safari.oreilly.com). For more information contact our corporate/institutional sales
department: 800-998-9938 or
corporate@oreilly.com.
Nutshell Handbook, the Nutshell Handbook logo, and the O'Reilly logo are registered trademarks of O'Reilly &
Associates, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as
trademarks. Where those designations appear in this book, and O'Reilly & Associates, Inc. was aware of a trademark
claim, the designations have been printed in caps or initial caps. The association between the image of blesbok and the
the topic of Perl and LWP is a trademark of O'Reilly & Associates, Inc.
While every precaution has been taken in the preparation of this book, the publisher and the author assume no
responsibility for errors or omissions, or for damages resulting from the use of the information contained herein.
Table of Contents Foreword
Copyright © 2002 O'Reilly & Associates. All rights reserved.
www.it-ebooks.info
Foreword
I started playing around with the Web a long time ago—at least, it feels that way. The first versions of Mosaic had just
showed up, Gopher and Wais were still hot technology, and I discovered an HTTP server program called Plexus. What
was different was it was implemented in Perl. That made it easy to extend. CGI was not invented yet, so all we had were
servlets (although we didn't call them that then). Over time, I moved from hacking on the server side to the client side
but stayed with Perl as the programming language of choice. As a result, I got involved in LWP, the Perl web client
library.
A lot has happened to the web since then. These days there is almost no end to the information at our fingertips: news,
stock quotes, weather, government info, shopping, discussion groups, product info, reviews, games, and other
entertainment. And the good news is that LWP can help automate them all.
This book tells you how you can write your own useful web client applications with LWP and its related HTML
modules. Sean's done a great job of showing how this powerful library can be used to make tools that automate various
tasks on the Web. If you are like me, you probably have many examples of web forms that you find yourself filling out
over and over again. Why not write a simple LWP-based tool that does it all for you? Or a tool that does research for you
by collecting data from many web pages without you having to spend a single mouse click? After reading this book, you
should be well prepared for tasks such as these.
This book's focus is to teach you how to write scripts against services that are set up to serve traditional web browsers.
This means services exposed through HTML. Even in a world where people eventually have discovered that the Web
can provide real program-to-program interfaces (the current "web services" craze), it is likely that HTML scraping will
continue to be a valuable way to extract information from the Web. I strongly believe that Perl and LWP is one of the
best tools to get that job done. Reading Perl and LWP is a good way get you started.
It has been fun writing and maintaining the LWP codebase, and Sean's written a fine book about using it. Enjoy!
—Gisle Aas
Primary author and maintainer of LWP
Copyright Page Preface
Copyright © 2002 O'Reilly & Associates. All rights reserved.
www.it-ebooks.info
Index
Symbols & Numbers | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z
Copyright © 2002 O'Reilly & Associates, Inc. All Rights Reserved.
www.it-ebooks.info
Preface
Perl soared to popularity as a language for creating and managing web content. Perl is equally adept at consuming
information on the Web. Most web sites are created for people, but quite often you want to automate tasks that involve
accessing a web site in a repetitive way. Such tasks could be as simple as saying "here's a list of URLs; I want to be
emailed if any of them stop working," or they could involve more complex processing of any number of pages. This
book is about using LWP (the Library for World Wide Web in Perl) and Perl to fetch and process web pages.
For example, if you want to compare the prices of all O'Reilly books on Amazon.com and bn.com, you could look at
each page yourself and keep track of the prices. Or you could write an LWP program to fetch the product pages, extract
the prices, and generate a report. O'Reilly has a lot of books in print, and after reading this one, you'll be able to write
and run the program much more quickly than you could visit every catalog page.
Consider also a situation in which a particular page has links to several dozen files (images, music, and so on) that you
want to download. You could download each individually, by monotonously selecting each link in your browser and
choosing Save as , or you could dash off a short LWP program that scans for URLs in that page and downloads each,
unattended.
Besides extracting data from web pages, you can also automate submitting data through web forms. Whether this is a
matter of uploading 50 image files through your company's intranet interface, or searching the local library's online card
catalog every week for any new books with "Navajo" in the title, it's worth the time and piece of mind to automate
repetitive processes by writing LWP programs to submit data into forms and scan the resulting data.
0.1. Audience for This Book
This book is aimed at someone who already knows Perl and HTML, but I don't assume you're an expert at either. I give
quick refreshers on some of the quirkier aspects of HTML (e.g., forms), but in general, I assume you know what each of
the HTML tags means. If you know basic regular expressions and are familiar with references and maybe even objects,
you have all the Perl skills you need to use this book.
If you're new to Perl, consider reading Learning Perl (O'Reilly) and maybe also The Perl Cookbook (O'Reilly). If your
HTML is shaky, try the HTML Pocket Reference or HTML: The Definitive Guide (O'Reilly). If you don't feel
comfortable using objects in Perl, reading
Appendix G, "User's View of Object-Oriented Modules" in this book should
be enough to bring you up to speed.
Foreword 0.2. Structure of This Book
Copyright © 2002 O'Reilly & Associates. All rights reserved.
www.it-ebooks.info
G.8. The Gory Details
For sake of clarity of explanation, I had to oversimplify some of the facts about objects. Here's a few of the gorier
details:
● Every example I gave of a constructor was a class method. But object methods can be constructors, too, if the
class was written to work that way: $new = $old->copy, $node_y = $node_x->new_subnode, or the
like.
● I've given the impression that there's two kinds of methods: object methods and class methods. In fact, the same
method can be both, because it's not the kind of method it is, but the kind of calls it's written to accept—calls that
pass an object, or calls that pass a class name.
● The term "object value" isn't something you'll find used much anywhere else. It's just my shorthand for what
would properly be called an "object reference" or "reference to a blessed item." In fact, people usually say
"object" when they properly mean a reference to that object.
● I mentioned creating objects with constructors, but I didn't mention destroying them with destructor—a destructor
is a kind of method that you call to tidy up the object once you're done with it, and want it to neatly go away
(close connections, delete temporary files, free up memory, etc.). But because of the way Perl handles memory,
most modules won't require the user to know about destructors.
● I said that class method syntax has to have the class name, as in $session = Net::FTP->new($host).
Actually, you can instead use any expression that returns a class name: $ftp_class = 'Net::FTP';
$session = $ftp_class->new($host). Moreover, instead of the method name for object- or class-
method calls, you can use a scalar holding the method name: $foo->$method($host). But, in practice, these
syntaxes are rarely useful.
And finally, to learn about objects from the perspective of writing your own classes, see the perltoot documentation, or
Damian Conway's exhaustive and clear book Object Oriented Perl (Manning Publications, 1999).
G.7. So Why Do Some Modules Use
Objects?
Index
Copyright © 2002 O'Reilly & Associates. All rights reserved.
www.it-ebooks.info
Colophon
Our look is the result of reader comments, our own experimentation, and feedback from distribution channels.
Distinctive covers complement our distinctive approach to technical topics, breathing personality and life into potentially
dry subjects.
The animals on the cover of Perl and LWP are blesbok. Blesbok are African antelopes related to the hartebeest. These
grazing animals, native to Africa's grasslands are extinct in the wild but preserved in farms and parks.
Blesbok have slender, horselike bodies that are shorter than four feet at the shoulder. They are deep red, with white
patches on their faces and rumps. A white blaze extends from between a blesbok's horns to the end of its nose, broken
only by a brown band above the eyes. The blesbok's horns sweep back, up, and inward. Both male and female blesbok
have horns, though the males' are thicker.
Blesbok are diurnal, most active in the morning and evening. They sleep in the shade during the hottest part of the day,
as they are very susceptible to the heat. They travel from place to place in long single-file lines, leaving distinct paths.
Their life span is about 13 years.
Linley Dolby was the production editor and copyeditor for Perl and LWP, and Sarah Sherman was the proofreader.
Rachel Wheeler and Claire Cloutier provided quality control. Johnna VanHoose Dinse wrote the index. Emily Quill
provided production support.
Emma Colby designed the cover of this book, based on a series design by Edie Freedman. The cover image is a 19th-
century engraving from the Dover Pictorial Archive. Emma Colby produced the cover layout with QuarkXPress 4.1
using Adobe's ITC Garamond font.
Melanie Wang designed the interior layout, based on a series design by David Futato. This book was converted to
FrameMaker 5.5.6 with a format conversion tool created by Erik Ray, Jason McIntosh, Neil Walls, and Mike Sierra that
uses Perl and XML technologies. The text font is Linotype Birka; the heading font is Adobe Myriad Condensed; and the
code font is LucasFont's TheSans Mono Condensed. The illustrations that appear in the book were produced by Robert
Romano and Jessamyn Read using Macromedia FreeHand 9 and Adobe Photoshop 6. This colophon was written by
Linley Dolby.
Index
Copyright © 2002 O'Reilly & Associates. All rights reserved.
www.it-ebooks.info
Symbols & Numbers | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z
Index: Symbols & Numbers
There are no index entries for this letter.
Symbols & Numbers | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z
Copyright © 2002 O'Reilly & Associates, Inc. All Rights Reserved.
www.it-ebooks.info
Symbols & Numbers | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z
Index: A
Aas, Gisle: 0. Foreword
ABEBooks.com POST request examples: 5.6. POST Example: ABEBooks.com
absolute URLs
converting from relative: 4.4. Converting Relative URLs to Absolute
converting to relative: 4.3. Converting Absolute URLs to Relative
absolute_base URL path: 4.3. Converting Absolute URLs to Relative
ActivePerl for Windows: 1.3. Installing LWP
agent( ) attribute, User-Agent header: 3.4.2. Request Parameters
AltaVista document fetch example: 2.5. Example: AltaVista
analysis, forms: 5.3. Automating Form Analysis
applets, tokenizing and: 8.6.2. Images and Applets
as_HTML( ) method: 10. Modifying HTML with Trees
attributes
altering: 4.1. Parsing URLs
HTML::Element methods: 10.1. Changing Attributes
modifying, code for: 10.1. Changing Attributes
nodes: 9.3.2. Attributes of a Node
authentication: 1.5.4. Authentication
11.3. Authentication
Authorization header: 11.3. Authentication
cookies and: 11.3.1. Comparing Cookies with Basic Authentication
credentials( ) method: 11.3.2. Authenticating via LWP
security and: 11.3.3. Security
Unicode mailing archive example: 11.4. An HTTP Authentication Example:The Unicode Mailing Archive
user agents: 3.4.5. Authentication
Symbols & Numbers | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z
Copyright © 2002 O'Reilly & Associates, Inc. All Rights Reserved.
www.it-ebooks.info
[...]... Individual Tokens LWP distributions: 1.3.2.1 Download distributions Google search: 1.2 History of LWP history of: 1.2 History of LWP installation: 1.3 Installing LWP CPAN shell: 1.3.1 Installing LWP from the CPAN Shell manual: 1.3.2 Installing LWP Manually sample code: 1.5 LWP in Action LWP class model, basic classes: 3.1 The Basic Classes LWP: : module namespace: 1.2 History of LWP LWP::ConnCache class:... installation, LWP: 1.3 Installing LWP CPAN shell: 1.3.1 Installing LWP from the CPAN Shell manual: 1.3.2 Installing LWP Manually interfaces, object-oriented: 1.5.1 The Object-Oriented Interface Symbols & Numbers | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z Copyright © 2002 O'Reilly & Associates, Inc All Rights Reserved www.it-ebooks.info Symbols & Numbers... Copyright CPAN (Comprehensive Perl Archive Network): 1.3 Installing LWP CPAN shell, LWP installation: 1.3.1 Installing LWP from the CPAN Shell credentials( ) method: 3.4.5 Authentication current_age( ) method: 3.5.4 Expiration Times Symbols & Numbers | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z Copyright © 2002 O'Reilly & Associates, Inc All Rights... Lines Symbols & Numbers | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z www.it-ebooks.info Copyright © 2002 O'Reilly & Associates, Inc All Rights Reserved www.it-ebooks.info Symbols & Numbers | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z Index: M MacPerl: 1.3 Installing LWP mailing archive... letter Symbols & Numbers | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z Copyright © 2002 O'Reilly & Associates, Inc All Rights Reserved www.it-ebooks.info Symbols & Numbers | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z Index: K There are no index entries for this letter Symbols & Numbers |... P | Q | R | S | T | U | V | W | X | Y | Z Copyright © 2002 O'Reilly & Associates, Inc All Rights Reserved www.it-ebooks.info Symbols & Numbers | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z Index: L li elements: 9.1 Introduction to Trees libwww -perl project: 1.2 History of LWP license plate example: 5.5 POST Example: License Plates link-checking... Bundling into a Program Symbols & Numbers | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z Copyright © 2002 O'Reilly & Associates, Inc All Rights Reserved www.it-ebooks.info Symbols & Numbers | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z Index: G get( ) function: 1.5 LWP in Action 2.3.1 Basic... class: 3.4.1 Connection Parameters LWP: :RobotUA: 12.2 A User Agent for Robots LWP: :Simple module: 2.3 LWP: :Simple document fetch: 2.3.1 Basic Document Fetch get( ) function: 2.3.1 Basic Document Fetch getprint( ) function: 2.3.3 Fetch and Print getstore( ) function: 2.3.2 Fetch and Store head( ) function: 2.3.4 Previewing with HEAD previewing and: 2.3.4 Previewing with HEAD LWP: :UserAgent class: 3.1 The... issues: 1.4.2 Copyright LWP: 1.3.2.1 Download distributions document fetching: 2.4 Fetching Documents Without LWP: :Simple AltaVista example: 2.5 Example: AltaVista do_GET( ) function: 2.4 Fetching Documents Without LWP: :Simple 3.3 Inside the do_GET and do_POST Functions do_POST( ) function: 3.3 Inside the do_GET and do_POST Functions dump( ) method: 9.2 HTML::TreeBuilder Symbols & Numbers | A | B | C... expressions: 6.2.4 Minimal and Greedy Matches MOMspider: 1.2 History of LWP mutter( ) function: 12.3.2 Overall Design in the Spider Symbols & Numbers | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z Copyright © 2002 O'Reilly & Associates, Inc All Rights Reserved www.it-ebooks.info Symbols & Numbers | A | B | C | D | E | F | G | H | I | J | K | L | M | . Perl skills you need to use this book.
If you're new to Perl, consider reading Learning Perl (O'Reilly) and maybe also The Perl Cookbook (O'Reilly) called an "object reference" or "reference to a blessed item." In fact, people usually say
"object" when they properly mean
Ngày đăng: 17/03/2014, 17:20
Xem thêm: Perl & LWP pptx