Thông tin tài liệu
Data Munging with Perl
Data Munging
with Perl
DAVID CROSS
MANNING
Greenwich
(74° w. long.)
For electronic information and ordering of this and other Manning books,
go to www.manning.com. The publisher offers discounts on this book
when ordered in quantity. For more information, please contact:
Special Sales Department
Manning Publications Co.
32 Lafayette Place Fax: (203) 661-9018
Greenwich, CT 06830 email: orders@manning.com
©2001 by Manning Publications Co. All rights reserved.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted,
in any form or by means electronic, mechanical, photocopying, or otherwise, without
prior written permission of the publisher.
Many of the designations used by manufacturers and sellers to distinguish their products
are claimed as trademarks. Where those designations appear in the book, and Manning
Publications was aware of a trademark claim, the designations have been printed in initial
caps or all caps.
Recognizing the importance of preserving what has been written, it is Manning’s policy to have
the books they publish printed on acid-free paper, and we exert our best efforts to that end.
Library of Congress Cataloging-in-Publication Data
Cross, David, 1962-
Data munging with Perl / David Cross.
p. cm.
Includes bibliographical references and index.
ISBN 1-930110-00-6 (alk. paper)
1. Perl (Computer program language) 2. Data structures (Computer science)
3. Data transmission systems. I. Title.
QA76.73.P22 C39 20001998
005.7'2—dc21 00-050009
CIP
Manning Publications Co. Copyeditor: Elizabeth Martin
32 Lafayette Place Typesetter: Dottie Marsico
Greenwich, CT 06830 Cover designer: Leslie Haimes
Printed in the United States of America
1 2 3 4 5 6 7 8 9 10 – VHG – 04 03 02 01
contents contents
foreword xi
preface xiii
about the cover illustration xviii
P
ART
IF
OUNDATIONS
1
1
Data, data munging, and Perl 3
1.1 What is data munging? 4
Data munging processes 4
■
Data recognition 5
Data parsing 6
■
Data filtering 6
■
Data
transformation 6
1.2 Why is data munging important? 7
Accessing corporate data repositories 7
■
Transferring
data between multiple systems 7
■
Real-world data
munging examples 8
1.3 Where does data come from? Where does it go? 9
Data files 9
■
Databases 10
■
Data pipes 11
Other sources/sinks 11
1.4 What forms does data take? 12
Unstructured data 12
■
Record-oriented data 13
Hierarchical data 13
■
Binary data 13
1.5 What is Perl? 14
Getting Perl 15
vi CONTENTS
1.6 Why is Perl good for data munging? 16
1.7 Further information 17
1.8 Summary 17
2
General munging practices 18
2.1 Decouple input, munging, and output processes 19
2.2 Design data structures carefully 20
Example: the CD file revisited 20
2.3 Encapsulate business rules 25
Reasons to encapsulate business rules 26
■
Ways to
encapsulate business rules 26
■
Simple module 27
Object class 28
2.4 Use UNIX “filter” model 31
Overview of the filter model 31
■
Advantages of
the filter model 32
2.5 Write audit trails 36
What to write to an audit trail 36
■
Sample audit
trail 37
■
Using the UNIX system logs 37
2.6 Further information 38
2.7 Summary 38
3
Useful Perl idioms 39
3.1 Sorting 40
Simple sorts 40
■
Complex sorts 41
■
The Orcish
Manoeuvre 42
■
Schwartzian transform 43
The Guttman-Rosler transform 46
■
Choosing a
sort technique 46
3.2 Database Interface (DBI) 47
Sample DBI program 47
3.3 Data::Dumper 49
3.4 Benchmarking 51
3.5 Command line scripts 53
CONTENTS vii
3.6 Further information 55
3.7 Summary 56
4
Pattern matching 57
4.1 String handling functions 58
Substrings 58
■
Finding strings within strings (index
and rindex) 59
■
Case transformations 60
4.2 Regular expressions 60
What are regular expressions? 60
■
Regular expression
syntax 61
■
Using regular expressions 65
■
Example:
translating from English to American 70
■
More
examples: /etc/passwd 73
■
Taking it to extremes 76
4.3 Further information 77
4.4 Summary 78
P
ART
II D
ATA
MUNGING
79
5
Unstructured data 81
5.1 ASCII text files 82
Reading the file 82
■
Text transformations 84
Text statistics 85
5.2 Data conversions 87
Converting the character set 87
■
Converting line
endings 88
■
Converting number formats 90
5.3 Further information 94
5.4 Summary 95
6
Record-oriented data 96
6.1 Simple record-oriented data 97
Reading simple record-oriented data 97
■
Processing
simple record-oriented data 100
■
Writing simple
record-oriented data 102
■
Caching data 105
viii CONTENTS
6.2 Comma-separated files 108
Anatomy of CSV data 108
■
Text::CSV_XS 109
6.3 Complex records 110
Example: a different CD file 111
Special values for $/ 113
6.4 Special problems with date fields 114
Built-in Perl date functions 114
Date::Calc 120
■
Date::Manip 121
Choosing between date modules 122
6.5 Extended example: web access logs 123
6.6 Further information 126
6.7 Summary 126
7
Fixed-width and binary data 127
7.1 Fixed-width data 128
Reading fixed-width data 128
■
Writing
fixed-width data 135
7.2 Binary data 139
Reading PNG files 140
■
Reading and writing
MP3 files 143
7.3 Further information 144
7.4 Summary 145
P
ART
III
SIMPLE
DATA
PARSING
147
8
Complex data formats 149
8.1 Complex data files 150
Example: metadata in the CD file 150
■
Example:
reading the expanded CD file 152
8.2 How not to parse HTML 154
Removing tags from HTML 154
■
Limitations of
regular expressions 157
CONTENTS ix
8.3 Parsers 158
An introduction to parsers 158
■
Parsers in Perl 161
8.4 Further information 162
8.5 Summary 162
9
HTML 163
9.1 Extracting HTML data from the World Wide Web 164
9.2 Parsing HTML 165
Example: simple HTML parsing 165
9.3 Prebuilt HTML parsers 167
HTML::LinkExtor 167
■
HTML::TokeParser 169
HTML::TreeBuilder and HTML::Element 171
9.4 Extended example: getting weather forecasts 172
9.5 Further information 174
9.6 Summary 174
10
XML 175
10.1 XML overview 176
What’s wrong with HTML? 176
■
What is XML? 176
10.2 Parsing XML with XML::Parser 178
Example: parsing weather.xml 178
■
Using
XML::Parser 179
■
Other XML::Parser styles 181
XML::Parser handlers 188
10.3 XML::DOM 191
Example: parsing XML using XML::DOM 191
10.4 Specialized parsers—XML::RSS 193
What is RSS? 193
■
A sample RSS file 193
Example: creating an RSS file with XML::RSS 195
Example: parsing an RSS file with XML::RSS 196
10.5 Producing different document formats 197
Sample XML input file 197
■
XML document
transformation script 198
■
Using the XML
document transformation script 205
x CONTENTS
10.6 Further information 208
10.7 Summary 208
11
Building your own parsers 209
11.1 Introduction to Parse::RecDescent 210
Example: parsing simple English sentences 210
11.2 Returning parsed data 212
Example: parsing a Windows INI file 212
Understanding the INI file grammar 213
Parser actions and the @item array 214
Example: displaying the contents of @item 214
Returning a data structure 216
11.3 Another example: the CD data file 217
Understanding the CD grammar 218
■
Testing
the CD file grammar 219
■
Adding parser actions 220
11.4 Other features of Parse::RecDescent 223
11.5 Further information 224
11.6 Summary 224
P
ART
IV
T
HE
BIG
PICTURE
225
12
Looking back—and ahead 227
12.1 The usefulness of things 228
The usefulness of data munging 228
■
The usefulness of
Perl 228
■
The usefulness of the Perl community 229
12.2 Things to know 229
Know your data 229
■
Know your tools 230
Know where to go for more information 230
appendix A Modules reference 232
appendix B Essential Perl 254
index 273
[...]... their journey continues 1 Data, data munging, and Perl What this chapter covers: s The process of munging data s Sources and sinks of data s Forms data takes s Perl and why it is perfect for data munging 3 4 CHAPTER Data, data munging, and Perl 1.1 What is data munging? munge (muhnj) vt 1 [derogatory] To imperfectly transform information 2 A comprehensive rewrite of a routine, a data structure, or the... CHAPTER Data, data munging, and Perl to make use of the data, it will need to be transformed in various ways as it moves from one system to the next This is where data munging comes in It lives in the interstices between computer systems, ensuring that data produced by one system can be used by another 1.2.3 Real-world data munging examples Let’s look at a couple of simple examples where data munging. .. file and puts the useful data into variables that are accessible from within our program As with data recognition, it is far easier to parse data if you know what you are going to do with it, as this will affect the kinds of data structures that you use In practice, many data munging programs are written so that the data recognition and data parsing phases are combined 1.1.4 Data filtering It is quite... introduction to Perl (see appendix B) About this book The book begins by addressing introductory and general topics, before gradually exploring more complex types of data munging PART I sets the scene for the rest of the book Chapter 1 introduces data munging and Perl I discuss why Perl is particularly well suited to data munging and survey the types of data that you might meet, along with the mechanisms... none Perl supports arbitrarily complex data structures—When munging data, you will usually want to build up internal data structures to store the data in interim forms before writing it to the output file Some programming languages impose limits on the complexity of internal data structures Since the introduction of Perl 5, Perl has had no such constraints Perl encourages code reuse—You will often be munging. .. full multi-user product such as Oracle or Sybase Adaptive Server Enterprise Imposing structure on data Databases have advantages over data files in that they impose structure on your data A database designer will have defined a database schema, which defines the shape and type of all of your data objects It will define, for example, exactly which data items are stored for each customer in the database,... Unstructured data While there is a great deal of unstructured data in the world, it is unlikely that you will come across very much of it, because the job of data munging is to convert data from one structure to another It is very difficult for a computer program to impose structure on data that isn’t already structured in some way Of course, one common data munging task is to take data with no apparent... of the data A data item is contained by its parent and contains its own children.6 At this point, the record-at-a-time processing methods that we will have been using on simpler data types no longer work and we will be forced to find more powerful tools We will look at hierarchical data (specifically HTML and XML) in chapters 9 and 10 1.4.4 Binary data Finally, there is binary data This is data that... actual data records in any detail An important part of recognizing data is realizing what context the data is found in For example, data items that are in header and footer records will have to be processed completely differently from data items which are in the body of the data It is therefore very important to understand what our input data looks like and what we need to do with it 6 CHAPTER Data, data. .. This is what most programmers do most of the time Perl is particularly good at these kinds of tasks It helps programmers write data conversion programs quickly In fact, the same characteristics that make Perl ideal for CGI programming also make it ideal for data munging (CGI programs are really data munging programs in flashy disguise.) In keeping with the Perl community slogan, “There’s more than one . 1 1 Data, data munging, and Perl 3 1.1 What is data munging? 4 Data munging processes 4 ■ Data recognition 5 Data parsing 6 ■ Data filtering 6 ■ Data transformation 6 1.2 Why is data munging. Data Munging with Perl Data Munging with Perl DAVID CROSS MANNING Greenwich (74° w. long.) For electronic information and ordering of this and other Manning books, go to www .manning. com forms does data take? 12 Unstructured data 12 ■ Record-oriented data 13 Hierarchical data 13 ■ Binary data 13 1.5 What is Perl? 14 Getting Perl 15 vi CONTENTS 1.6 Why is Perl good for data munging?
Ngày đăng: 25/03/2014, 10:25
Xem thêm: data munging with perl - manning 2001