OReilly hadoop the definitive guide 4th edition 2015 3

Thông tin tài liệu

Get ready to unlock the power of your data. With the fourth edition of this comprehensive guide, you’ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters. Using Hadoop 2 exclusively, author Tom White presents new chapters on YARN and several Hadooprelated projects such as Parquet, Flume, Crunch, and Spark. You’ll learn about recent changes to Hadoop, and explore new case studies on Hadoop’s role in healthcare systems and genomics data processing

on ed iti dat Ed p h &U 4t s e d Re vi Hadoop: The Definitive Guide Using Hadoop exclusively, author Tom White presents new chapters on YARN and several Hadoop-related projects such as Parquet, Flume, Crunch, and Spark You’ll learn about recent changes to Hadoop, and explore new case studies on Hadoop’s role in healthcare systems and genomics data processing ■■ Learn fundamental components such as MapReduce, HDFS, and YARN ■■ Explore MapReduce in depth, including steps for developing applications with it ■■ Set up and maintain a Hadoop cluster running HDFS and MapReduce on YARN ■■ Learn two data formats: Avro for data serialization and Parquet for nested data ■■ Use data ingestion tools such as Flume (for streaming data) and Sqoop (for bulk data transfer) ■■ Understand how high-level data processing tools like Pig, Hive, Crunch, and Spark work with Hadoop ■■ Learn the HBase distributed database and the ZooKeeper distributed configuration service you have the “Now opportunity to learn about Hadoop from a master—not only of the technology, but also of common sense and plain talk ” —Doug Cutting Cloudera Tom White, an engineer at Cloudera and member of the Apache Software Foundation, has been an Apache Hadoop committer since 2007 He has written numerous articles for oreilly.com, java.net, and IBM’s developerWorks, and speaks regularly about Hadoop at industry conferences US $49.99 Twitter: @oreillymedia facebook.com/oreilly White PROGR AMMING L ANGUAGES/HADOOP FOURTH EDITION Hadoop: The Definitive Guide Get ready to unlock the power of your data With the fourth edition of this comprehensive guide, you’ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters Hadoop The Definitive Guide STORAGE AND ANALYSIS AT INTERNET SCALE CAN $57.99 ISBN: 978-1-491-90163-2 Tom White on ed iti dat Ed p h &U 4t s e d Re vi Hadoop: The Definitive Guide Using Hadoop exclusively, author Tom White presents new chapters on YARN and several Hadoop-related projects such as Parquet, Flume, Crunch, and Spark You’ll learn about recent changes to Hadoop, and explore new case studies on Hadoop’s role in healthcare systems and genomics data processing ■■ Learn fundamental components such as MapReduce, HDFS, and YARN ■■ Explore MapReduce in depth, including steps for developing applications with it ■■ Set up and maintain a Hadoop cluster running HDFS and MapReduce on YARN ■■ Learn two data formats: Avro for data serialization and Parquet for nested data ■■ Use data ingestion tools such as Flume (for streaming data) and Sqoop (for bulk data transfer) ■■ Understand how high-level data processing tools like Pig, Hive, Crunch, and Spark work with Hadoop ■■ Learn the HBase distributed database and the ZooKeeper distributed configuration service you have the “Now opportunity to learn about Hadoop from a master—not only of the technology, but also of common sense and plain talk ” —Doug Cutting Cloudera Tom White, an engineer at Cloudera and member of the Apache Software Foundation, has been an Apache Hadoop committer since 2007 He has written numerous articles for oreilly.com, java.net, and IBM’s developerWorks, and speaks regularly about Hadoop at industry conferences US $49.99 Twitter: @oreillymedia facebook.com/oreilly White PROGR AMMING L ANGUAGES/HADOOP FOURTH EDITION Hadoop: The Definitive Guide Get ready to unlock the power of your data With the fourth edition of this comprehensive guide, you’ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters Hadoop The Definitive Guide STORAGE AND ANALYSIS AT INTERNET SCALE CAN $57.99 ISBN: 978-1-491-90163-2 Tom White FOURTH EDITION Hadoop: The Definitive Guide Tom White Hadoop: The Definitive Guide, Fourth Edition by Tom White Copyright © 2015 Tom White All rights reserved Printed in the United States of America Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472 O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (http://safaribooksonline.com) For more information, contact our corporate/ institutional sales department: 800-998-9938 or corporate@oreilly.com Editors: Mike Loukides and Meghan Blanchette Production Editor: Matthew Hacker Copyeditor: Jasmine Kwityn Proofreader: Rachel Head June 2009: First Edition October 2010: Second Edition May 2012: Third Edition April 2015: Fourth Edition Indexer: Lucie Haskins Cover Designer: Ellie Volckhausen Interior Designer: David Futato Illustrator: Rebecca Demarest Revision History for the Fourth Edition: 2015-03-19: First release 2015-04-17: Second release See http://oreilly.com/catalog/errata.csp?isbn=9781491901632 for release details The O’Reilly logo is a registered trademark of O’Reilly Media, Inc Hadoop: The Definitive Guide, the cover image of an African elephant, and related trade dress are trademarks of O’Reilly Media, Inc Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks Where those designations appear in this book, and O’Reilly Media, Inc was aware of a trademark claim, the designations have been printed in caps or initial caps While the publisher and the author have used good faith efforts to ensure that the information and instruc‐ tions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work Use of the information and instructions contained in this work is at your own risk If any code samples or other technology this work contains or describes is subject to open source licenses or the intel‐ lectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights ISBN: 978-1-491-90163-2 [LSI] For Eliane, Emilia, and Lottie Table of Contents Foreword xvii Preface xix Part I Hadoop Fundamentals Meet Hadoop Data! Data Storage and Analysis Querying All Your Data Beyond Batch Comparison with Other Systems Relational Database Management Systems Grid Computing Volunteer Computing A Brief History of Apache Hadoop What’s in This Book? 6 8 10 11 12 15 MapReduce 19 A Weather Dataset Data Format Analyzing the Data with Unix Tools Analyzing the Data with Hadoop Map and Reduce Java MapReduce Scaling Out Data Flow Combiner Functions Running a Distributed MapReduce Job Hadoop Streaming 19 19 21 22 22 24 30 30 34 37 37 v Ruby Python 37 40 The Hadoop Distributed Filesystem 43 The Design of HDFS HDFS Concepts Blocks Namenodes and Datanodes Block Caching HDFS Federation HDFS High Availability The Command-Line Interface Basic Filesystem Operations Hadoop Filesystems Interfaces The Java Interface Reading Data from a Hadoop URL Reading Data Using the FileSystem API Writing Data Directories Querying the Filesystem Deleting Data Data Flow Anatomy of a File Read Anatomy of a File Write Coherency Model Parallel Copying with distcp Keeping an HDFS Cluster Balanced 43 45 45 46 47 48 48 50 51 53 54 56 57 58 61 63 63 68 69 69 72 74 76 77 YARN 79 Anatomy of a YARN Application Run Resource Requests Application Lifespan Building YARN Applications YARN Compared to MapReduce Scheduling in YARN Scheduler Options Capacity Scheduler Configuration Fair Scheduler Configuration Delay Scheduling Dominant Resource Fairness Further Reading vi | Table of Contents 80 81 82 82 83 85 86 88 90 94 95 96 Hadoop I/O 97 Data Integrity Data Integrity in HDFS LocalFileSystem ChecksumFileSystem Compression Codecs Compression and Input Splits Using Compression in MapReduce Serialization The Writable Interface Writable Classes Implementing a Custom Writable Serialization Frameworks File-Based Data Structures SequenceFile MapFile Other File Formats and Column-Oriented Formats Part II 97 98 99 99 100 101 105 107 109 110 113 121 126 127 127 135 136 MapReduce Developing a MapReduce Application 141 The Configuration API Combining Resources Variable Expansion Setting Up the Development Environment Managing Configuration GenericOptionsParser, Tool, and ToolRunner Writing a Unit Test with MRUnit Mapper Reducer Running Locally on Test Data Running a Job in a Local Job Runner Testing the Driver Running on a Cluster Packaging a Job Launching a Job The MapReduce Web UI Retrieving the Results Debugging a Job Hadoop Logs 141 143 143 144 146 148 152 153 156 156 157 158 160 160 162 165 167 168 172 Table of Contents | vii Remote Debugging Tuning a Job Profiling Tasks MapReduce Workflows Decomposing a Problem into MapReduce Jobs JobControl Apache Oozie 174 175 175 177 177 178 179 How MapReduce Works 185 Anatomy of a MapReduce Job Run Job Submission Job Initialization Task Assignment Task Execution Progress and Status Updates Job Completion Failures Task Failure Application Master Failure Node Manager Failure Resource Manager Failure Shuffle and Sort The Map Side The Reduce Side Configuration Tuning Task Execution The Task Execution Environment Speculative Execution Output Committers 185 186 187 188 189 190 192 193 193 194 195 196 197 197 198 201 203 203 204 206 MapReduce Types and Formats 209 MapReduce Types The Default MapReduce Job Input Formats Input Splits and Records Text Input Binary Input Multiple Inputs Database Input (and Output) Output Formats Text Output Binary Output viii | Table of Contents 209 214 220 220 232 236 237 238 238 239 239 Spark and, 566 task assignments, 188 task execution, 189 task failures, 194 testing with MRUnit, 153–156 tuning checklist, 175 tuning properties, 202 MapDriver class, 153 MapFile class, 135 MapFileOutputFormat class, 240 MapFn class, 524, 527 Mapper interface about, 153–156 finding information on input splits, 227 task execution, 203 type parameters, 210 Mapred class, 546 mapred-env.sh file, 292 mapred-site.xml file, 292 mapred.child.java.opts property, 174, 201, 302 mapred.combiner.class property, 213 mapred.input.format.class property, 213 mapred.job.tracker property, 158 mapred.map.runner.class property, 213 mapred.mapoutput.key.class property, 213 mapred.mapoutput.value.class property, 213 mapred.mapper.class property, 213 mapred.output.format.class property, 213 mapred.output.key.class property, 213 mapred.output.key.comparator.class property, 213 mapred.output.value.class property, 213 mapred.output.value.groupfn.class property, 213 mapred.partitioner.class property, 213 mapred.reducer.class property, 213 MapReduce about, 6, 19, 177 anatomy of job runs, 185–192 Avro support, 359–365 batch processing, benchmarking with TeraSort, 315 cluster setup and installation, 288 compression and, 106–109 counters, 247–255 Crunch and, 520 daemon properties, 301–303 decomposing problems into jobs, 177–178 default job, 214–219 714 | Index developing applications about, 141 Configuration API, 141–144 running locally on test data, 156–160 running on clusters, 160–175 setting up development environment, 144–152 tuning jobs, 175–176 workflows, 177–184 writing unit tests, 152–156 failure considerations, 193–196 Hadoop Streaming, 37–41 HBase and, 587–589 Hive and, 477 input formats, 220–238 joining data, 268–273 library classes supported, 279 old and new API comparison, 697–698 old API signatures, 211 output formats, 238–245 Parquet support, 377–379 progress reporting in, 191 querying data, 6, 503 resource requests, 82 shuffle and sort, 197–203 side data distribution, 273–279 sorting data, 255–268, 363–365 Spark and, 558 Sqoop support, 405, 408, 419 starting and stopping daemons, 292 system comparison, 8–12 task execution, 189, 203–208 types supported, 209–214 weather dataset example, 19–37 YARN comparison, 83–85 Mapreduce class, 546 MapReduce mode (Pig), 425, 467 MAPREDUCE statement (Pig Latin), 435 mapreduce.am.max-attempts property, 194 mapreduce.client.progressmonitor.pollinterval property, 191 mapreduce.client.submit.file.replication proper‐ ty, 187 mapreduce.cluster.acls.enabled property, 313 mapreduce.cluster.local.dir property, 174, 198 mapreduce.framework.name property, 158, 159, 425, 687 mapreduce.input.fileinputformat.input.dir.re‐ cursive property, 224 mapreduce.input.fileinputformat.inputdir prop‐ erty, 224 mapreduce.input.fileinputformat.split.maxsize property, 225 mapreduce.input.fileinputformat.split.minsize property, 225 mapreduce.input.keyvaluelinerecordread‐ er.key.value.separator property, 233 mapreduce.input.lineinputformat.linespermap property, 234 mapreduce.input.linerecordreader.line.max‐ length property, 233 mapreduce.input.pathFilter.class property, 224 mapreduce.job.acl-modify-job property, 313 mapreduce.job.acl-view-job property, 313 mapreduce.job.combine.class property, 212 mapreduce.job.end-notification.url property, 192 mapreduce.job.hdfs-servers property, 313 mapreduce.job.id property, 203 mapreduce.job.inputformat.class property, 212 mapreduce.job.map.class property, 212 mapreduce.job.maxtaskfailures.per.tracker property, 195 mapreduce.job.output.group.comparator.class property, 212 mapreduce.job.output.key.class property, 212 mapreduce.job.output.key.comparator.class property, 212, 258 mapreduce.job.output.value.class property, 212 mapreduce.job.outputformat.class property, 212 mapreduce.job.partitioner.class property, 212 mapreduce.job.queuename property, 90 mapreduce.job.reduce.class property, 212 mapreduce.job.reduce.slowstart.completedmaps property, 308 mapreduce.job.reduces property, 187 mapreduce.job.ubertask.enable property, 188 mapreduce.job.ubertask.maxbytes property, 187 mapreduce.job.ubertask.maxmaps property, 187 mapreduce.job.ubertask.maxreduces property, 187 mapreduce.job.user.classpath.first property, 162 mapreduce.jobhistory.address property, 305 mapreduce.jobhistory.bind-host property, 305 mapreduce.jobhistory.webapp.address property, 306 mapreduce.map.combine.minspills property, 198, 202 mapreduce.map.cpu.vcores property, 188, 303 mapreduce.map.failures.maxpercent property, 194 mapreduce.map.input.file property, 228 mapreduce.map.input.length property, 228 mapreduce.map.input.start property, 228 mapreduce.map.java.opts property, 302 mapreduce.map.log.level property, 173 mapreduce.map.maxattempts property, 194 mapreduce.map.memory.mb property, 188, 302 mapreduce.map.output.compress property, 109, 198, 202 mapreduce.map.output.compress.codec proper‐ ty, 109, 198, 202 mapreduce.map.output.key.class property, 212 mapreduce.map.output.value.class property, 212 mapreduce.map.sort.spill.percent property, 197, 202 mapreduce.map.speculative property, 205 mapreduce.mapper.multithreadedmap‐ per.threads property, 222 mapreduce.output.fileoutputformat.compress property, 107, 372 mapreduce.output.fileoutputformat.com‐ press.codec property, 107 mapreduce.output.fileoutputformat.com‐ press.type property, 108 mapreduce.output.textoutputformat.separator property, 239 mapreduce.reduce.cpu.vcores property, 188, 303 mapreduce.reduce.failures.maxpercent proper‐ ty, 194 mapreduce.reduce.input.buffer.percent proper‐ ty, 201, 203 mapreduce.reduce.java.opts property, 302 mapreduce.reduce.log.level property, 173 mapreduce.reduce.maxattempts property, 194 mapreduce.reduce.memory.mb property, 188, 302 mapreduce.reduce.merge.inmem.threshold property, 199, 201, 203 mapreduce.reduce.shuffle.input.buffer.percent property, 199, 202 mapreduce.reduce.shuffle.maxfetchfailures property, 202 mapreduce.reduce.shuffle.merge.percent prop‐ erty, 199, 202 mapreduce.reduce.shuffle.parallelcopies proper‐ ty, 199, 202 Index | 715 mapreduce.reduce.speculative property, 205 mapreduce.shuffle.max.threads property, 198, 202 mapreduce.shuffle.port property, 306 mapreduce.shuffle.ssl.enabled property, 314 mapreduce.task.attempt.id property, 203 mapreduce.task.files.preserve.failedtasks prop‐ erty, 174 mapreduce.task.files.preserve.filepattern prop‐ erty, 174 mapreduce.task.id property, 203 mapreduce.task.io.sort.factor property, 198, 199, 202 mapreduce.task.io.sort.mb property, 197, 201 mapreduce.task.ismap property, 204 mapreduce.task.output.dir property, 207 mapreduce.task.partition property, 204 mapreduce.task.profile property, 176 mapreduce.task.profile.maps property, 176 mapreduce.task.profile.reduces property, 176 mapreduce.task.timeout property, 193 mapreduce.task.userlog.limit.kb property, 173 MapWritable class, 120 MAP_INPUT_RECORDS counter, 249 MAP_OUTPUT_BYTES counter, 249 MAP_OUTPUT_MATERIALIZED_BYTES counter, 249 MAP_OUTPUT_RECORDS counter, 249 mashups, Massie, Matt, 653 master nodes (HBase), 578 master−worker pattern (namenodes), 46 materialization process (Crunch), 535–537 Maven POM (Project Object Model), 144–145, 160, 351 MAX function (Pig Latin), 444, 446 MB_MILLIS_MAPS counter, 251 MB_MILLIS_REDUCES counter, 251 memory management buffering writes, 197 container virtual memory constraints, 303 daemons, 295 memory heap size, 294 namenodes, 286, 294 Spark and, 549 task assignments, 188 MemPipeline class, 524 MERGED_MAP_OUTPUTS counter, 250 Message Passing Interface (MPI), 10 716 | Index metadata backups of, 332 block sizes and, 46 filesystems and, 318–320 Hive metastore, 472, 478, 480–482 namenode memory requirements, 44 Parquet considerations, 370 querying, 63–65 upgrade considerations, 338–341 metastore (Hive), 472, 478, 480–482 METASTORE_PORT environment variable, 478 metrics counters and, 331 HBase and, 601 JMX and, 331 Microsoft Research MyLifeBits project, MILLIS_MAPS counter, 251 MILLIS_REDUCES counter, 251 MIN function (Pig Latin), 446 miniclusters, testing in, 159 minimal replication condition, 323 MIP (Message Passing Interface), 10 mkdir command, 436 mntr command (ZooKeeper), 606 monitoring clusters about, 330 logging support, 330 metrics and JMX, 331 MorphlineSolrSink class, 390 MRBench benchmark, 316 MRUnit library about, 145, 152 testing map functions, 153–156 testing reduce functions, 156 multiple files input formats, 237 MultipleOutputs class, 242–244 output formats, 240–244 partitioning data, 240–242 MultipleInputs class, 237, 270 MultipleOutputFormat class, 240 MultipleOutputs class, 242–244 multiplexing selectors, 390 multiquery execution, 434 multitable insert, 501 MultithreadedMapper class, 222, 279 MultithreadedMapRunner class, 279 mv command, 436 MyLifeBits project, MySQL creating database schemas, 404 Hive and, 481 HiveQL and, 473 installing and configuring, 404 populating database, 404 mysqlimport utility, 420 N namenodes about, 12, 46 block caching, 48 checkpointing process, 320 cluster setup and installation, 290 cluster sizing, 286 commissioning nodes, 334–335 data integrity and, 98 DataStreamer class and, 72 decommissioning nodes, 335–337 DFSInputStream class and, 70 directory structure, 317–318 failover controllers and, 50 filesystem metadata and, 44 HDFS federation, 48 memory considerations, 286, 294 replica placement, 73 safe mode, 323–324 secondary, 47, 291, 321 single points of failure, 48 starting, 291, 320 namespaceID identifier, 318 National Climatic Data Center (NCDC) data format, 19 encapsulating parsing logic, 154 multiple inputs, 237 preparing weather datafiles, 693–695 NativeAzureFileSystem class, 54 NCDC (National Climatic Data Center) data format, 19 encapsulating parsing logic, 154 multiple inputs, 237 preparing weather datafiles, 693–695 NDFS (Nutch Distributed Filesystem), 13 nested encoding, 370 net.topology.node.switch.mapping.impl proper‐ ty, 288 net.topology.script.file.name property, 288 network topology, 70, 74, 286–288 NFS gateway, 56 NLineInputFormat class, 208, 234 NNBench benchmark, 316 node managers about, 80 blacklisting, 195 commissioning nodes, 334–335 decommissioning nodes, 335–337 failure considerations, 195 heartbeat requests, 94 job initialization process, 187 resource manager failure, 196 starting, 291 streaming tasks, 189 task execution, 189 task failure, 193 tasktrackers and, 84 normalization (data), NullOutputFormat class, 239 NullWritable class, 118, 119, 239 NUM_FAILED_MAPS counter, 251 NUM_FAILED_REDUCES counter, 251 NUM_FAILED_UBERTASKS counter, 251 NUM_KILLED_MAPS counter, 251 NUM_KILLED_REDUCES counter, 251 NUM_UBER_SUBMAPS counter, 251 NUM_UBER_SUBREDUCES counter, 251 Nutch Distributed Filesystem (NDFS), 13 Nutch search engine, 12–13 O Object class (Java), 123 object properties, printing, 149–151 ObjectWritable class, 119 ODBC drivers, 479 oozie command-line tool, 183 oozie.wf.application.path property, 183 OOZIE_URL environment variable, 183 operations (ZooKeeper) exceptions supported, 630–634, 635 language bindings, 617 multiupdate, 616 watch triggers, 618 znode supported, 616 operators (HiveQL), 488 operators (Pig) combining and splitting data, 466 filtering data, 457–459 grouping and joining data, 459–464 Index | 717 loading and storing data, 456 sorting data, 465 Optimized Record Columnar File (ORCFile), 367, 498 ORCFile (Optimized Record Columnar File), 367, 498 OrcStorage function (Pig Latin), 447 ORDER BY clause (Hive), 503 ORDER statement (Pig Latin), 435, 465 org.apache.avro.mapreduce package, 359 org.apache.crunch.io package, 531 org.apache.crunch.lib package, 545 org.apache.flume.serialization package, 388 org.apache.hadoop.classification package, 337 org.apache.hadoop.conf package, 141 org.apache.hadoop.hbase package, 585 org.apache.hadoop.hbase.mapreduce package, 587 org.apache.hadoop.hbase.util package, 586 org.apache.hadoop.io package, 25, 113 org.apache.hadoop.io.serializer package, 126 org.apache.hadoop.mapreduce package, 220 org.apache.hadoop.mapreduce.jobcontrol pack‐ age, 179 org.apache.hadoop.mapreduce.join package, 270 org.apache.hadoop.streaming.mapreduce pack‐ age, 235 org.apache.pig.builtin package, 450 org.apache.spark.rdd package, 558 OTHER_LOCAL_MAPS counter, 251 outer joins, 506 output formats binary output, 239 database output, 238 lazy output, 245 multiple outputs, 240–244 text output, 239 OutputCollector interface, 207 OutputCommitter class, 188, 189, 206–208 OutputFormat interface, 206, 238–245 OVERWRITE keyword (Hive), 475 OVERWRITE write mode, 532 O’Malley, Owen, 14 P packaging jobs about, 160 client classpath, 161 718 | Index packaging dependencies, 161 task classpath, 161 task classpath precedence, 162 packaging Oozie workflow applications, 182 PageRank algorithm, 543 Pair class, 525, 527 PairRDDFunctions class, 553 PARALLEL keyword (Pig Latin), 458, 467 parallel processing, 76–78 ParallelDo fusion, 543 parameter substitution (Pig), 467–469 Parquet about, 137, 367 Avro and, 375–377 binary storage format and, 498 configuring, 372 data model, 368–370 file format, 370–372 Hive support, 406 MapReduce support, 377–379 nested encoding, 370 Protocol Buffers and, 375–377 Sqoop support, 406 Thrift and, 375–377 tool support, 367 writing and reading files, 373–377 parquet.block.size property, 372, 379 parquet.compression property, 372 parquet.dictionary.page.size property, 372 parquet.enable.dictionary property, 372 parquet.example.data package, 373 parquet.example.data.simple package, 373 parquet.page.size property, 372 ParquetLoader function (Pig Latin), 447 ParquetReader class, 374 ParquetStorer function (Pig Latin), 447 ParquetWriter class, 374 partial sort, 257–258 PARTITION clause (Hive), 500 PARTITIONED BY clause (Hive), 492 partitioned data about, HDFS sinks and, 387 Hive tables and, 491–493 weather dataset example, 240–242 Partitioner interface, 211, 272 Path class, 58, 61 PATH environment variable, 339 PathFilter interface, 65–68 Paxos algorithm, 621 PCollection interface about, 521 asCollection() method, 537 checkpointing pipelines, 545 materialize() method, 535–537 parallelDo() method, 521, 524–525, 541 pipeline execution, 538 reading files, 531 types supported, 528–530 union() method, 523 writing files, 532 permissions ACL, 620 HDFS considerations, 52 storing, 46 persistence, RDD, 560–562 persistent data structures, 317 persistent znodes, 614 PGroupedTable interface about, 522, 526 combineValues() method, 526–528, 534 mapValues() method, 534 PHYSICAL_MEMORY_BYTES counter, 249, 303 Pig about, 423 additional information, 469 anonymous relations and, 467 comparison with databases, 430–431 Crunch and, 519 data processing operators, 456–466 execution types, 424–426 installing and running, 424–427 parallelism and, 467 parameter substitution and, 467–469 practical techniques, 466–469 sorting data, 259 user-defined functions, 448–456 weather dataset example, 427–430 Pig Latin about, 423, 432 built-in types, 439–441 commands supported, 436 editor support, 427 expressions, 438–439 functions, 440, 445–447 macros, 447–448 schemas, 441–445 statements, 433–437 structure, 432 pig.auto.local.enabled property, 426 pig.auto.local.input.maxbytes, 426 PigRunner class, 426 PigServer class, 426 PigStorage function (Pig Latin), 446 PIG_CONF_DIR environment variable, 425 pipeline execution (Crunch) about, 538 checkpointing pipelines, 545 inspecting plans, 540–543 iterative algorithms, 543–544 running pipelines, 538–539 stopping pipelines, 539 Pipeline interface done() method, 539 enableDebug() method, 539 read() method, 531 readTextFile() method, 521 run() method, 538–539 runAsync() method, 539 PipelineExecution interface, 539 PipelineResult class, 523, 538 PObject interface, 537, 543 PositionedReadable interface, 60 preemption, 93 PrimitiveEvalFunc class, 452 printing object properties, 149–151 profiling tasks, 175–176 progress, tracking for tasks, 190 Progressable interface, 61 properties daemon, 296–303 map-side tuning, 202 printing for objects, 149–151 reduce-side tuning, 202 znodes, 614–615 Protocol Buffers, 375–377 ProtoParquetWriter class, 375 psdsh shell tool, 293 pseudodistributed mode (Hadoop), 688–690 PTable interface about, 522 asMap() method, 537 creating instance, 525 finding set of unique values for keys, 535 groupByKey() method, 526 materializeToMap() method, 536 Index | 719 reading text files, 531 PTables class, 546 PTableType interface, 522 PType interface, 524, 528–530, 535 Public Data Sets, pwd command, 436 PySpark API, 555 pyspark command, 555 Python language Avro and, 354 incrementing counters, 255 querying data, 504 Spark example, 555 weather dataset example, 40 Q QJM (quorum journal manager), 49 querying data about, aggregating data, 503 batch processing, FileStatus class, 63–65 FileSystem class, 63–68 HBase online query application, 589–597 joining data, 505–508 MapReduce scripts, 503 sorting data, 503 subqueries, 508 views, 509 queue elasticity, 88 queues Capacity Scheduler, 88–90 Fair Scheduler, 90–94 quit command, 437 quorum journal manager (QJM), 49 R r (read) permission, 52 rack local tasks, 188 rack topology, 287–288 Rackspace MailTrust, RACK_LOCAL_MAPS counter, 251 RAID (redundant array of independent disks), 285 Rajaraman, Anand, RANK statement (Pig Latin), 435 RawComparator interface, 112, 123, 258 RawLocalFileSystem class, 53, 99 720 | Index RDBMSs (Relational Database Management Systems) about, 8–9 HBase comparison, 597–600 Hive metadata and, 489 Pig comparison, 430 RDD class filter() method, 551 map() method, 551 RDDs (Resilient Distributed Datasets) about, 550, 556 creating, 556 Java and, 555 operations on, 557–560 persistence and, 560–562 serialization, 562 read (r) permission, 52 READ permission (ACL), 620 reading data Crunch support, 531 FileSystem class and, 58–61, 69 from Hadoop URL, 57 HDFS data flow, 69–70 Parquet and, 373–377 SequenceFile class, 129–132 short-circuiting local reads, 308 ReadSupport class, 373 READ_OPS counter, 250 RecordReader class, 221, 229 records, processing files as, 228–232 REDUCE clause (Hive), 503 reduce functions (MapReduce) about, 23 data flow tasks, 31–36 general form, 209 Hadoop Streaming, 37 Java example, 25 joining data, 270–273 progress and status updates, 190 shuffle and sort, 198–201 Spark and, 567 task assignments, 188 task execution, 189 task failures, 194 testing with MRUnit, 156 tuning checklist, 175 tuning properties, 202 ReduceDriver class, 156 Reducer interface, 203, 210 REDUCE_INPUT_GROUPS counter, 249 REDUCE_INPUT_RECORDS counter, 249 REDUCE_OUTPUT_RECORDS counter, 249 REDUCE_SHUFFLE_BYTES counter, 249 redundant array of independent disks (RAID), 285 reference genomes, 659 ReflectionUtils class, 102, 130 RegexMapper class, 279 RegexSerDe class, 499 regionservers (HBase), 578 REGISTER statement (Pig Latin), 436 regular expressions, 498 Relational Database Management Systems (see RDBMSs) remote debugging, 174 remote procedure calls (RPCs), 109 replicated mode (ZooKeeper), 620, 639 Reporter interface, 191 reserved storage space, 307 Resilient Distributed Datasets (see RDDs) resource manager page, 165 resource managers about, 80 application master failure, 195 cluster sizing, 286 commissioning nodes, 334–335 decommissioning nodes, 335–337 failure considerations, 196 heartbeat requests, 94 job initialization process, 187 job submission process, 187 jobtrackers and, 84 node manager failure, 195 progress and status updates, 191 starting, 291 task assignments, 188 task execution, 189 thread dumps, 331 resource requests, 81 REST, HBase and, 589 Result class, 587 ResultScanner interface, 586 ResultSet interface, 409 rg.apache.hadoop.hbase.client package, 585 rm command, 436 rmf command, 436 ROW FORMAT clause (Hive), 474, 498, 510 RowCounter class, 587 RPC server properties, 305 RpcClient class (Java), 398 RPCs (remote procedure calls), 109 Ruby language, 37–40 run command, 434, 437 Runnable interface (Java), 83 ruok command (ZooKeeper), 605 S S3AFileSystem class, 53 safe mode, 322–324 Sammer, Eric, 284 Sample class, 546 SAMPLE statement (Pig Latin), 435 Scala application example, 552–554 scaling out (data) about, 30 combiner functions, 34–36 data flow, 30–34 running distributed jobs, 37 Scan class, 586 scheduling in YARN about, 85 Capacity Scheduler, 88–90 delay scheduling, 94 Dominant Resource Fairness, 95 Fair Scheduler, 90–94 FIFO Scheduler, 86 jobs, 308 scheduling tasks in Spark, 569 schema-on-read, 9, 482 schema-on-write, 482 schemas Avro, 346–349, 375 HBase online query application, 590 MySQL, 404 Parquet, 373 Pig Latin, 441–445, 456 ScriptBasedMapping class, 288 scripts MapReduce, 503 Pig, 426 Python, 504 ZooKeeper, 638 search platforms, secondary namenodes about, 47 checkpointing process, 320 directory structure, 321 Index | 721 starting, 291 secondary sort, 262–268 SecondarySort class, 546 security about, 309 additional enhancements, 313–314 delegation tokens, 312 Kerberos and, 309–312 security.datanode.protocol.acl property, 314 seek time, Seekable interface, 59 SELECT statement (Hive) grouping rows, 475 index support, 483 partitioned data and, 500 subqueries and, 508 views and, 509 SELECT TRANSFORM statement (Hive), 510 selectors, replicating and multiplexing, 390 semi joins, 507 semi-structured data, semicolons, 432 SequenceFile class about, 127 compressing streams, 102 converting tar files, 127 displaying with command-line interface, 132 exports and, 421 format overview, 133–134 NullWritable class and, 119 ObjectWritable class and, 119 reading, 129–132 sorting and merging, 132 Sqoop support, 406 writing, 127–129 SequenceFileAsBinaryInputFormat class, 236 SequenceFileAsBinaryOutputFormat class, 240 SequenceFileAsTextInputFormat class, 236 SequenceFileInputFormat class, 236 SequenceFileOutputFormat class, 108, 231, 239 sequential znodes, 615 SerDe (Serializer-Deserializer), 496–499 SERDE keyword (Hive), 498 Serializable interface (Java), 533 serialization about, 109–110 Avro support, 349–352 DefaultStringifier class, 274 of functions, 533 722 | Index IDL support, 127 implementing custom Writable, 121–125 pluggable frameworks, 126–127 RDD, 562 Sqoop support, 407 tuning checklist, 175 Writable class hierarchy, 113–121 Writable interface, 110–112 Serialization interface, 126 Serializer interface, 126 serializer property, 388 Serializer-Deserializer (SerDe), 496–499 service requests, 310 Set class, 547 SET command (Hive), 476 set command (Pig), 437 setACL operation (ZooKeeper), 616 setData operation (ZooKeeper), 616 SetFile class, 135 SETI@home project, 11 sh command, 437 Shard class, 547 shared-nothing architecture, 10 ShareThis sharing network, 680–684 short-circuit local reads, 308 ShortWritable class, 113 SHOW FUNCTIONS statement (Hive), 489 SHOW LOCKS statement (Hive), 483 SHOW PARTITIONS statement (Hive), 493 SHOW TABLES statement (Hive), 509 shuffle process about, 197 configuration tuning, 201–203 map side, 197–198 reduce side, 198–201 SHUFFLED_MAPS counter, 250 side data distribution about, 273 distributed cache, 274–279 job configuration, 273 Sierra, Stuart, 127 single point of failure (SPOF), 48 single sign-ons, 310 sink groups (Flume), 395–397 sinkgroups property, 395 SIZE function (Pig Latin), 444, 446 slaves file, 290, 292, 335 Snappy compression, 100–101, 104 SnappyCodec class, 101 SORT BY clause (Hive), 503 Sort class, 520, 547 SortedMapWritable class, 120 sorting data about, 255 Avro and, 358, 363–365 controlling sort order, 258 Hive tables, 503 MapReduce and, 255–268, 363–365 partial sort, 257–258 Pig operators and, 465 preparation overview, 256 secondary sort, 262–268 shuffle process and, 197–203 total sort, 259–262 Source interface, 531 SourceTarget interface, 533 Spark about, 549 additonal information, 574 anatomy of job runs, 565–570 cluster managers and, 570 example of, 550–555 executors and, 570 Hive and, 477 installing, 550 MapReduce and, 558 RDDs and, 556–563 resource requests, 82 shared variables, 564–565 sorting data, 259 YARN and, 571–574 spark-shell command, 550 spark-submit command, 553, 573 spark.kryo.registrator property, 563 SparkConf class, 553 SparkContext class, 550, 571–574 SpecificDatumReader class, 352 speculative execution of tasks, 204–206 SPILLED_RECORDS counter, 249 SPLIT statement (Pig Latin), 435, 466 splits (input data) (see input splits) SPLIT_RAW_BYTES counter, 249 SPOF (single point of failure), 48 Sqoop about, 401 additional information, 422 Avro support, 406 connectors and, 403 escape sequences supported, 418 export process, 417–422 file formats, 406 generated code, 407 getting, 401–403 import process, 408–412 importing large objects, 415–417 MapReduce support, 405, 408, 419 Parquet support, 406 sample import, 403–407 SequenceFile class and, 406 serialization and, 407 tool support, 402, 407 working with imported data, 412–415 srst command (ZooKeeper), 605 srvr command (ZooKeeper), 605 SSH, configuring, 289, 296, 689 stack traces, 331 Stack, Michael, 575–602 standalone mode (Hadoop), 687 stat command (ZooKeeper), 605 statements (Pig Latin) about, 433–437 control flow, 438 expressions and, 438–439 states (ZooKeeper), 625–627, 631 status updates for tasks, 190 storage handlers, 499 store functions (Pig Latin), 446 STORE statement (Pig Latin), 434, 435, 465 STORED AS clause (Hive), 498 STORED BY clause (Hive), 499 STREAM statement (Pig Latin), 435, 458 stream.map.input.field.separator property, 219 stream.map.input.ignoreKey property, 218 stream.map.output.field.separator property, 219 stream.non.zero.exit.is.failure property, 193 stream.num.map.output.key.fields property, 219 stream.num.reduce.output.key.fields property, 219 stream.recordreader.class property, 235 stream.reduce.input.field.separator property, 219 stream.reduce.output.field.separator property, 219 Streaming programs about, default job, 218–219 secondary sort, 266–268 Index | 723 task execution, 189 user-defined counters, 255 StreamXmlRecordReader class, 235 StrictHostKeyChecking SSH setting, 296 String class (Java), 115–118, 349 StringTokenizer class (Java), 279 StringUtils class, 111, 453 structured data, subqueries, 508 SUM function (Pig Latin), 446 SWebHdfsFileSystem class, 53 SwiftNativeFileSystem class, 54 SWIM repository, 316 sync operation (ZooKeeper), 616 syncLimit property, 639 syslog file (Java), 172 system administration commissioning nodes, 334–335 decommissioning nodes, 335–337 HDFS support, 317–329 monitoring, 330–332 routine procedures, 332–334 upgrading clusters, 337–341 System class (Java), 151 system logfiles, 172, 295 T TableInputFormat class, 238, 587 TableMapper class, 588 TableMapReduceUtil class, 588 TableOutputFormat class, 238, 587 tables (HBase) about, 576–578 creating, 583 inserting data into, 583 locking, 578 regions, 578 removing, 584 wide tables, 591 tables (Hive) about, 489 altering, 502 buckets and, 491, 493–495 dropping, 502 external tables, 490–491 importing data, 500–501 managed tables, 490–491 partitions and, 491–493 storage formats, 496–499 724 | Index views, 509 TABLESAMPLE clause (Hive), 495 TableSource interface, 531 Target interface, 532 task attempt IDs, 164, 203 task attempts page (MapReduce), 169 task counters, 248–250 task IDs, 164, 203 task logs (MapReduce), 172 TaskAttemptContext interface, 191 tasks executing, 189, 203–208, 570 failure considerations, 193 profiling, 175–176 progress and status updates, 190 scheduling in Spark, 569 Spark support, 552 speculative execution, 204–206 streaming, 189 task assignments, 188 tasks page (MapReduce), 169 tasktrackers, 83 TEMPORARY keyword (Hive), 513 teragen program, 315 TeraSort program, 315 TestDFSIO benchmark, 316 testing HBase installation, 582–584 Hive considerations, 473 job drivers, 158–160 MapReduce test runs, 27–30 in miniclusters, 159 running jobs locally on test data, 156–160 running jobs on clusters, 160–175 writing unit tests with MRUnit, 152–156 Text class, 115–118, 121–124, 210 text formats controlling maximum line length, 233 KeyValueTextInputFormat class, 233 NLineInputFormat class, 234 NullOutputFormat class, 239 TextInputFormat class, 232 TextOutputFormat class, 239 XML documents and, 235 TextInputFormat class about, 232 MapReduce types and, 157, 211 Sqoop imports and, 412 TextLoader function (Pig Latin), 446 TextOutputFormat class, 123, 239, 523 TGT (Ticket-Granting Ticket), 310 thread dumps, 331 Thrift HBase and, 589 Hive and, 479 Parquet and, 375–377 ThriftParquetWriter class, 375 tick time (ZooKeeper), 624 Ticket-Granting Ticket (TGT), 310 timeline servers, 84 TOBAG function (Pig Latin), 440, 446 tojson command, 355 TokenCounterMapper class, 279 TOKENSIZE function (Pig Latin), 446 ToLowerFn function, 536 TOMAP function (Pig Latin), 440, 446 Tool interface, 148–152 ToolRunner class, 148–152 TOP function (Pig Latin), 446 TotalOrderPartitioner class, 260 TOTAL_LAUNCHED_MAPS counter, 251 TOTAL_LAUNCHED_REDUCES counter, 251 TOTAL_LAUNCHED_UBERTASKS counter, 251 TOTUPLE function (Pig Latin), 440, 446 TPCx-HS benchmark, 316 transfer rate, TRANSFORM clause (Hive), 503 transformations, RDD, 557–560 Trash class, 308 trash facility, 307 TRUNCATE TABLE statement (Hive), 502 tuning jobs, 175–176 TwoDArrayWritable class, 120 U uber tasks, 187 UDAF class, 514 UDAFEvaluator interface, 514 UDAFs (user-defined aggregate functions), 510, 513–517 UDF class, 512 UDFs (user-defined functions) Hive and, 510–517 Pig and, 424, 447, 448–456 UDTFs (user-defined table-generating func‐ tions), 510 Unicode characters, 116–117 UNION statement (Pig Latin), 435, 466 unit tests with MRUnit, 145, 152–156 Unix user accounts, 288 unmanaged application masters, 81 unstructured data, UPDATE statement (Hive), 483 upgrading clusters, 337–341 URL class (Java), 57 user accounts, Unix, 288 user identity, 147 user-defined aggregate functions (UDAFs), 510, 513–517 user-defined functions (see UDFs) user-defined table-generating functions (UDTFs), 510 USING JAR clause (Hive), 512 V VCORES_MILLIS_MAPS counter, 251 VCORES_MILLIS_REDUCES counter, 251 VERSION file, 318 versions (Hive), 472 ViewFileSystem class, 48, 53 views (virtual tables), 509 VIntWritable class, 113 VIRTUAL_MEMORY_BYTES counter, 250, 303 VLongWritable class, 113 volunteer computing, 11 W w (write) permission, 52 Walters, Chad, 576 WAR (Web application archive) files, 160 watches (ZooKeeper), 615, 618 Watson, James D., 655 wchc command (ZooKeeper), 606 wchp command (ZooKeeper), 606 wchs command (ZooKeeper), 606 Web application archive (WAR) files, 160 WebHDFS protocol, 54 WebHdfsFileSystem class, 53 webtables (HBase), 575 Wensel, Chris K., 669 Whitacre, Micah, 643 whoami command, 147 WITH SERDEPROPERTIES clause (Hive), 499 work units, 11, 30 Index | 725 workflow engines, 179 workflows (MapReduce) about, 177 Apache Oozie system, 179–184 decomposing problems into jobs, 177–178 JobControl class, 178 Writable interface about, 110–112 class hierarchy, 113–121 Crunch and, 528 implementing custom, 121–125 WritableComparable interface, 112, 258 WritableComparator class, 112 WritableSerialization class, 126 WritableUtils class, 125 write (w) permission, 52 WRITE permission (ACL), 620 WriteSupport class, 373 WRITE_OPS counter, 250 writing data Crunch support, 532 using FileSystem API, 61–63 HDFS data flow, 72–73 Parquet and, 373–377 SequenceFile class, 127–129 X x (execute) permission, 52 XML documents, 235 Y Yahoo!, 13 YARN (Yet Another Resource Negotiator) about, 7, 79, 96 anatomy of application run, 80–83 application lifespan, 82 application master failure, 194 building applications, 82 cluster setup and installation, 288 cluster sizing, 286 daemon properties, 300–303 distributed shell, 83 log aggregation, 172 MapReduce comparison, 83–85 scaling out data, 30 scheduling in, 85–95, 308 Spark and, 571–574 starting and stopping daemons, 291 726 | Index YARN client mode (Spark), 571 YARN cluster mode (Spark), 573–574 yarn-env.sh file, 292 yarn-site.xml file, 292, 296 yarn.app.mapreduce.am.job.recovery.enable property, 195 yarn.app.mapreduce.am.job.speculator.class property, 205 yarn.app.mapreduce.am.job.task.estimator.class property, 205 yarn.log-aggregation-enable property, 172 yarn.nodemanager.address property, 306 yarn.nodemanager.aux-services property, 300, 687 yarn.nodemanager.bind-host property, 305 yarn.nodemanager.container-executor.class property, 193, 304, 313 yarn.nodemanager.delete.debug-delay-sec prop‐ erty, 174 yarn.nodemanager.hostname property, 305 yarn.nodemanager.linux-container-executor property, 304 yarn.nodemanager.local-dirs property, 300 yarn.nodemanager.localizer.address property, 306 yarn.nodemanager.log.retain-second property, 173 yarn.nodemanager.resource.cpu-vcores proper‐ ty, 301, 303 yarn.nodemanager.resource.memory-mb prop‐ erty, 150, 301 yarn.nodemanager.vmem-pmem-ratio property, 301, 303 yarn.nodemanager.webapp.address property, 306 yarn.resourcemanager.address property about, 300, 305 Hive and, 476 Pig and, 425 yarn.resourcemanager.admin.address property, 305 yarn.resourcemanager.am.max-attempts prop‐ erty, 194, 196 yarn.resourcemanager.bind-host property, 305 yarn.resourcemanager.hostname property, 300, 305, 687 yarn.resourcemanager.max-completedapplications property, 165 yarn.resourcemanager.nm.livenessmonitor.expiry-interval-ms property, 195 yarn.resourcemanager.nodes.exclude-path property, 307, 336 yarn.resourcemanager.nodes.include-path prop‐ erty, 307, 335 yarn.resourcemanager.resource-tracker.address property, 305 yarn.resourcemanager.scheduler.address prop‐ erty, 305 yarn.resourcemanager.scheduler.class property, 91 yarn.resourcemanager.webapp.address property, 306 yarn.scheduler.capacity.node-locality-delay property, 95 yarn.scheduler.fair.allocation.file property, 91 yarn.scheduler.fair.allow-undeclared-pools property, 93 yarn.scheduler.fair.locality.threshold.node prop‐ erty, 95 yarn.scheduler.fair.locality.threshold.rack prop‐ erty, 95 yarn.scheduler.fair.preemption property, 94 yarn.scheduler.fair.user-as-default-queue prop‐ erty, 93 yarn.scheduler.maximum-allocation-mb prop‐ erty, 303 yarn.scheduler.minimum-allocation-mb prop‐ erty, 303 yarn.web-proxy.address property, 306 YARN_LOG_DIR environment variable, 172 YARN_RESOURCEMANAGER_HEAPSIZE environment variable, 294 Z zettabytes, znodes about, 606 ACLs and, 619 creating, 607–609 deleting, 612 ephemeral, 614 joining groups, 609 listing , 610–612 operations supported, 616 persistent, 614 properties supported, 614–615 sequential, 615 ZOOCFGDIR environment variable, 605 ZooKeeper about, 603 additional information, 640 building applications configuration service, 627–630, 634–636 distributed data structures and protocols, 636 resilient, 630–634 consistency and, 621–623 data model, 614 example of, 606–613 failover controllers and, 50 HBase and, 579 high availability and, 49 implementing, 620 installing and running, 604–606 operations in, 616–620 production considerations, 637–640 sessions and, 623–625 states and, 625–627, 631 zxid, 622 Zab protocol, 621 Index | 727 About the Author Tom White is one of the foremost experts on Hadoop He has been an Apache Hadoop committer since February 2007, and is a member of the Apache Software Foundation Tom is a software engineer at Cloudera, where he has worked since its foundation on the core distributions from Apache and Cloudera Previously he was an independent Hadoop consultant, working with companies to set up, use, and extend Hadoop He has spoken at many conferences, including ApacheCon, OSCON, and Strata Tom has a BA in mathematics from the University of Cambridge and an MA in philosophy of science from the University of Leeds, UK He currently lives in Wales with his family Colophon The animal on the cover of Hadoop: The Definitive Guide is an African elephant These members of the genus Loxodonta are the largest land animals on Earth (slightly larger than their cousin, the Asian elephant) and can be identified by their ears, which have been said to look somewhat like the continent of Asia Males stand 12 feet tall at the shoulder and weigh 12,000 pounds, but they can get as big as 15,000 pounds, whereas females stand 10 feet tall and weigh 8,000–11,000 pounds Even young elephants are very large: at birth, they already weigh approximately 200 pounds and stand about feet tall African elephants live throughout sub-Saharan Africa Most of the continent’s elephants live on savannas and in dry woodlands In some regions, they can be found in desert areas; in others, they are found in mountains The species plays an important role in the forest and savanna ecosystems in which they live Many plant species are dependent on passing through an elephant’s digestive tract before they can germinate; it is estimated that at least a third of tree species in west African forests rely on elephants in this way Elephants grazing on vegetation also affect the structure of habitats and influence bush fire patterns For example, under natural conditions, elephants make gaps through the rainforest, enabling the sunlight to enter, which allows the growth of various plant species This, in turn, facilitates more abun‐ dance and more diversity of smaller animals As a result of the influence elephants have over many plants and animals, they are often referred to as a keystone species because they are vital to the long-term survival of the ecosystems in which they live Many of the animals on O’Reilly covers are endangered; all of them are important to the world To learn more about how you can help, go to animals.oreilly.com The cover image is from the Dover Pictorial Archive The cover fonts are URW Type‐ writer and Guardian Sans The text font is Adobe Minion Pro; the heading font is Adobe Myriad Condensed; and the code font is Dalton Maag’s Ubuntu Mono

Ngày đăng: 13/09/2016, 11:24

Xem thêm: OReilly hadoop the definitive guide 4th edition 2015 3, OReilly hadoop the definitive guide 4th edition 2015 3, Chapter 3. The Hadoop Distributed Filesystem, Chapter 6. Developing a MapReduce Application, Chapter 8. MapReduce Types and Formats, Chapter 10. Setting Up a Hadoop Cluster, Chapter 22. Composable Data at Cerner, Chapter 23. Biological Data Science: Saving Lives with Software, Appendix C. Preparing the NCDC Weather Data, Appendix D. The Old and New Java MapReduce APIs

OReilly hadoop the definitive guide 4th edition 2015 3

Thông tin tài liệu

Từ khóa liên quan

Mục lục

Cover

Copyright

Table of Contents

Foreword

Preface

Administrative Notes

What’s New in the Fourth Edition?

What’s New in the Third Edition?

What’s New in the Second Edition?

Conventions Used in This Book

Using Code Examples

Safari® Books Online

How to Contact Us

Acknowledgments

Part I. Hadoop Fundamentals

Chapter 1. Meet Hadoop

Data!

Data Storage and Analysis

Querying All Your Data

Beyond Batch

Comparison with Other Systems

Relational Database Management Systems

Grid Computing

Volunteer Computing

Tài liệu cùng người dùng

Tài liệu liên quan