1028 hadoop in practice

537 62 0
  • Loading ...
    Loading ...
    Loading ...

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Tài liệu liên quan

Thông tin tài liệu

Ngày đăng: 06/03/2019, 14:02

IN PRACTICE Alex Holmes MANNING www.it-ebooks.info Hadoop in Practice www.it-ebooks.info www.it-ebooks.info Hadoop in Practice ALEX HOLMES MANNING SHELTER ISLAND www.it-ebooks.info For online information and ordering of this and other Manning books, please visit www.manning.com The publisher offers discounts on this book when ordered in quantity For more information, please contact Special Sales Department Manning Publications Co 20 Baldwin Road PO Box 261 Shelter Island, NY 11964 Email: orders@manning.com ©2012 by Manning Publications Co All rights reserved No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine Manning Publications Co 20 Baldwin Road PO Box 261 Shelter Island, NY 11964 Development editor: Copyeditors: Proofreader: Typesetter: Illustrator: Cover designer: ISBN 9781617290237 Printed in the United States of America 10 – MAL – 17 16 15 14 13 12 www.it-ebooks.info Cynthia Kane Bob Herbtsman, Tara Walsh Katie Tennant Gordan Salinovic Martin Murtonen Marija Tudor To Michal, Marie, Oliver, Ollie, Mish, and Anch www.it-ebooks.info www.it-ebooks.info brief contents PART BACKGROUND AND FUNDAMENTALS .1 PART PART PART ■ Hadoop in a heartbeat DATA LOGISTICS 25 ■ Moving data in and out of Hadoop 27 ■ Data serialization—working with text and beyond 83 BIG DATA PATTERNS 137 ■ Applying MapReduce patterns to big data 139 ■ Streamlining HDFS for big data 169 ■ Diagnosing and tuning performance problems 194 DATA SCIENCE 251 ■ Utilizing data structures and algorithms 253 ■ Integrating R and Hadoop for statistics and more 285 ■ Predictive analytics with Mahout 305 vii www.it-ebooks.info viii PART BRIEF CONTENTS TAMING THE ELEPHANT .333 10 ■ Hacking with Hive 335 11 ■ Programming pipelines with Pig 359 12 ■ Crunch and other technologies 394 13 ■ Testing and debugging 410 www.it-ebooks.info contents preface xv acknowledgments xvii about this book xviii PART BACKGROUND AND FUNDAMENTALS 1 Hadoop in a heartbeat 1.1 1.2 1.3 What is Hadoop? Running Hadoop 14 Chapter summary 23 PART DATA LOGISTICS .25 Moving data in and out of Hadoop 27 2.1 2.2 Key elements of ingress and egress Moving data into Hadoop 30 TECHNIQUE TECHNIQUE TECHNIQUE TECHNIQUE TECHNIQUE 29 Pushing system log messages into HDFS with Flume 33 An automated mechanism to copy files into HDFS 43 Scheduling regular ingress activities with Oozie Database ingress with MapReduce 53 Using Sqoop to import data from MySQL 58 ix www.it-ebooks.info 48 475 NameNode embedded HTTP In conclusion, although Hadoop FUSE sounds like an interesting idea, it’s not ready for use in production environments B.5 NameNode embedded HTTP The advantage of using HTTP to access HDFS is that it relieves the burden of having to have the HDFS client code installed on any host that requires access Further, HTTP is ubiquitous and many tools and most programming languages have support for HTTP, which makes HDFS that much more accessible The NameNode has an embedded Jetty HTTP/HTTPS web server, which is used for the SecondaryNameNode to read images and merge them back It also supports the HTFP filesystem, which utilities such as distCp use to enable cross-cluster copies when Hadoop versions differ It supports a handful of operations and only read operations (HDFS writes aren’t supported) The web server can be seen in figure B.3 Hadoop cluster Client host NameNode curl HTTP Jetty HTTP server HTTP DataNodes Jetty HTTP server Figure B.3 The NameNode embedded HTTP server Let’s look at a handful of basic filesystem operations that can be made with the embedded HTTP server The following shows an example of using curl, a client HTTP utility, to perform a directory listing: For this exercise, you’ll create a directory in HDFS Create a small file in your HDFS directory $ hadoop fs -mkdir /nn-embedded $ echo "the cat and a mat" | hadoop fs -put - /nn-embedded/test.txt $ curl \ http://localhost:50070/listPaths/nn-embedded?ugi=aholmes,groups Issue a curl The response contains a listing element with two child elements, the first one being the element representing the directory This element represents the file Files can be downloaded from the embedded HTTP NameNode server too In the following example you’re downloading the file you uploaded in the last step (/nn-embedded/test.txt): $ wget \ http://localhost:50070/data/nn-embedded/test.txt?ugi=aholmes,groups \ -O test.txt $ cat test.txt the cat and a mat That’s pretty much all that the embedded HTTP server currently supports in terms of file-level operations What’s interesting with the implementation of this servlet is that it redirects the actual file download to one of the DataNodes that contains the first block of the file That DataNode then streams back the entire file to the client There are a few other operations that can be performed via the HTTP interface, such as fsck for retrieving any issues with the filesystem, and contentSummary, which returns statistical information about a directory, such as quota limits, size, and more: $ curl http://localhost:50070/fsck?ugi=aholmes,groups Status: HEALTHY Total size: 1087846514 B Total dirs: 214 Total files: 315 Total blocks (validated): 1007 (avg block size 1080284 B) Minimally replicated blocks: 1007 (100.0 %) Over-replicated blocks: (0.0 %) Under-replicated blocks: 360 (35.749752 %) Mis-replicated blocks: (0.0 %) Default replication factor: Average block replication: 1.0 Corrupt blocks: Missing replicas: 3233 (321.05264 %) Number of data-nodes: Number of racks: www.it-ebooks.info 477 Hoop $ curl http://localhost:50070/contentSummary/?ugi=aholmes,groups These three operations combined together can be used to read information from HDFS, but that’s about it B.6 HDFS proxy The HDFS proxy is a component in the Hadoop contrib that provides a web app proxy frontend to HDFS Its advantages over the embedded HTTP server are an access-control layer and support for multiple Hadoop versions Its architecture can be seen in figure B.4 Because the HDFS proxy leverages the embedded HTTP Jetty server in the NameNode, it has the same limitations that you saw in that section, primarily around only being able to support file reads Details about how to install and use the HDFS proxy are available at http://goo.gl/A9dYc B.7 Hoop Hoop is a REST, JSON-based HTTP/HTTPS server that provides access to HDFS, as seen in figure B.5 Its advantage over the current Hadoop HTTP interface is that it supports writes as well as reads It’s a project created by Cloudera as a full replacement for the existing Hadoop HTTP service, and it’s planned for contribution into Hadoop Hoop will be included in the 2.x Hadoop release (see https:// issues.apache.org/jira/browse/HDFS-2178) v0.19 Hadoop cluster NameNode + DataNode Proxy host Client host Web application container (Jetty/Tomcat) HTTPS client HTTP HTTP HDFS v0.19 HTTP HDFS client Embedded HTTP server Forwarding WAR v0.20 Hadoop cluster HDFS v0.20 HDFS v0.23 HTTP NameNode + DataNode Embedded HTTP server HTTP v0.23 Hadoop cluster NameNode + DataNode Embedded HTTP server Figure B.4 HDFS proxy architecture www.it-ebooks.info 478 APPENDIX B Hadoop built-in ingress and egress tools +RRSVHUYHUKRVW &OLHQWKRVW +7736 FOLHQW +')6 FOLHQW Figure B.5 +RRSVHUYHU 7RPFDW +773 6HUYOHWV +DGRRSFOXVWHU KGIV!! )LOH6\VWHP +DGRRS53& 1DPH1RGH 'DWD1RGHV +773 Hoop architecture Installation is documented at https://github.com/cloudera/hoop After you have the Hoop server up and running, it’s simple to perform filesystem commands via curl:1 Let’s go through a sequence of basic filesystem manipulations, where you create a directory, write a file, and then list the directory contents: Create a directory, /hoop-test $ curl -X POST \ "http://localhost:14000/hoop-test?op=mkdirs&user.name=aholmes" {"mkdirs":true} $ url="http://localhost:14000/hoop-test/example.txt" $ url=$url"?op=create&user.name=aholmes" Write a local file, /tmp/example.txt into /hoop-test/example.txt $ curl -X POST $url \ data-binary @/tmp/example.txt \ header "content-type: application/octet-stream" $ curl -i \ "http://localhost:14000/hoop-test?op=list&user.name=aholmes" [{ "path":"http:\/\/cdh:14000\/hoop-test\/example.txt", "isDir":false, "len":23, "owner":"aholmes", "group":"supergroup", "permission":"-rw-r r ", "accessTime":1320014062728, "modificationTime":1320014062728, "blockSize":67108864, "replication":3 }] Perform a directory listing of /hoop-test, which shows the file you just created $ url="http://localhost:14000/hoop-test" $ url=$url"?op=delete&recursive=true&user.name=aholmes" $ curl -X DELETE $url {"delete":true} Perform a recursive delete of /hoop-test All the REST operations are documented at http://cloudera.github.com/hoop/docs/latest/ HttpRestApi.html www.it-ebooks.info 479 WebHDFS Hoop, as with any HDFS proxy, will suffer from adding hops between a client and HDFS, as well as circumventing data locality features available when using the Java HDFS client But it’s a huge improvement over the HDFS proxy, primarily because it can support writes due to its use of the Java HDFS client B.8 WebHDFS WebHDFS, which is included in Hadoop versions 1.x and 2.x, is a whole new API in Hadoop providing REST/HTTP read/write access to HDFS Figure B.6 shows that it coexists alongside the existing HDFS HTTP services You’ll use WebHDFS to create a directory, write a file to that directory, and finally remove the file WebHDFS may be turned off by default; to enable it you may have to set dfs.webhdfs.enabled to true in hdfs-site.xml and restart HDFS Your first step is to create a directory, /whdfs, in HDFS Table B.1 shows the URL constructs, and optional parameters that can be supplied Table B.1 WebHDFS optional arguments for directory creation Option Description permission The octal code of the directory permission For example, the default in HDFS is the three digit octal 755, equivalent to -rwxr-xr-x Hadoop cluster NameNode The new WebHDFS web interface Jetty HTTP server WebHDFS data Client host curl HDFS client HTTP fsck HTTP HTTP Existing web interface to support secondary Name Node, HFTP and other HTTP clients DataNode HTTP Jetty HTTP server WebHDFS StreamFile FileChecksum Figure B.6 WebHDFS architecture www.it-ebooks.info Existing web interfaces 480 APPENDIX B Hadoop built-in ingress and egress tools BASH SHELL, URLS, AND AMPERSANDS Take care when working with URLs in bash The ampersand (&) in bash is a control character that’s used to launch a process in the background Because URLs frequently contain ampersands, it’s best to always enclose them in double quotes You’ll create your directory without specifying the optional permissions: $ curl -i -X PUT "http://localhost:50070/webhdfs/v1/whdfs?op=MKDIRS" HTTP/1.1 200 OK Content-Type: application/json Transfer-Encoding: chunked Server: Jetty(6.1.26) {"boolean":true} Next you’ll create a file called test.txt under your newly created /whdfs directory You should quickly examine the options you have available when creating the file in table B.2 Table B.2 WebHDFS optional arguments for file creation Option Description overwrite What the action should be if a file already exists with the same name Valid values are true or false blocksize The HDFS block size for the file, in bytes replication The replication count for the file blocks permission The octal code of the file permission For example, the default in HDFS is the three digit octal 755, equivalent to -rwxr-xr-x buffersize The internal buffer size when streaming writes to other DataNodes Again, you’ll run the command without any optional arguments Creation of a file is a two-step process You first need to communicate your intent to create a file with the NameNode The NameNode replies with an HTTP redirect to a DataNode URL, which you must use to actually write the file content: Notify the NameNode of your intent to create a file $ curl -i -X PUT \ "http://localhost:50070/webhdfs/v1/whdfs/test.txt?op=CREATE" HTTP/1.1 307 TEMPORARY_REDIRECT Location: http://localhost.localdomain:50075/ webhdfs/v1/whdfs/test.txt ?op=CREATE&user.name=webuser&overwrite=false www.it-ebooks.info The response is an HTTP temporary redirect with a location field containing the DataNode URL to be used for the actual write of the file 481 WebHDFS Content-Type: application/json Content-Length: Server: Jetty(6.1.26) Create a small file on your local filesystem $ echo "the cat sat on the mat" > /tmp/test.txt $ url="http://localhost.localdomain:50075/webhdfs/v1/whdfs/test.txt" $ url=$url"?op=CREATE&user.name=webuser&overwrite=false" Construct the DataNode URL (it’s too long to fit on a single line in the book) $ curl -i -X PUT -T /tmp/test.txt $url HTTP/1.1 100 Continue HTTP/1.1 201 Created Location: webhdfs:// Content-Type: application/json Content-Length: Server: Jetty(6.1.26) $ hadoop fs -cat /whdfs/test.txt the cat sat on the mat Write the file content to the DataNode Use the HDFS concatenate command to view the contents of the file APPEND works in the same way, first getting the DataNode URL from the NameNode, and then communicating the appended data to the DataNode The options for APPEND are the same as for the creation operation; refer to table B.2 for more details: $ curl -i -X POST \ "http://localhost:50070/webhdfs/v1/whdfs/test.txt?op=APPEND" HTTP/1.1 307 TEMPORARY_REDIRECT Location: http://localhost.localdomain:50075/webhdfs/v1/whdfs/test.txt ?op=APPEND&user.name=webuser Content-Type: application/json Content-Length: Server: Jetty(6.1.26) $ url="http://localhost.localdomain:50075/webhdfs/v1/whdfs/test.txt" $ url=$url"?op=APPEND&user.name=webuser" $ curl -i -X POST -T /tmp/test.txt $url HTTP/1.1 100 Continue HTTP/1.1 200 OK $ hadoop fs -cat /whdfs/test.txt the cat sat on the mat the cat sat on the mat Your next operation is to perform a directory listing in your directory: $ curl -i "http://localhost:50070/webhdfs/v1/whdfs?op=LISTSTATUS" { "HdfsFileStatuses": { "HdfsFileStatus": [ { "accessTime":1322410385692, "blockSize":67108864, "group":"supergroup", www.it-ebooks.info 482 APPENDIX B Hadoop built-in ingress and egress tools "isDir":false, "isSymlink":false, "len":23, "localName":"test.txt", "modificationTime":1322410385700, "owner":"webuser", "permission":"644", "replication":1 } ] } } A file status operation returns some statistics around a file or directory: $ curl -i \ "http://localhost:50070/webhdfs/v1/whdfs/test.txt?op=GETFILESTATUS" { "HdfsFileStatus": { "accessTime":1322410385692, "blockSize":67108864, "group":"supergroup", "isDir":false, "isSymlink":false, "len":23, "localName":", "modificationTime":1322410385700, "owner":"webuser", "permission":"644", "replication":1 } } Finally, you’ll recursively remove the whdfs directory: $ curl -i -X DELETE \ "http://localhost:50070/webhdfs/v1/whdfs?op=DELETE&recursive=true" HTTP/1.1 200 OK Content-Type: application/json Transfer-Encoding: chunked Server: Jetty(6.1.26) {"boolean":true} WebHDFS is a big step forward for HDFS in allowing rich client-side access to HDFS via HTTP B.9 Distributed copy Hadoop has a command-line tool for copying data between Hadoop clusters called distCp It performs the copy in a MapReduce job, where the mappers copy from one filesystem to another www.it-ebooks.info MapReduce 483 The following example shows a copy within the same cluster To copy between clusters running the same Hadoop version, change the URLs to point to the source and destination NameNode URLs: $ hadoop fs -mkdir /distcp-source $ echo "the cat sat on the mat" | hadoop \ fs -put - /distcp-source/test.txt $ hadoop distcp hdfs://localhost:8020/distcp-source \ hdfs://localhost:8020/distcp-dest $ hadoop fs -cat /distcp-dest/test.txt the cat sat on the mat One of the useful characteristics of distCp is that it can copy between multiple versions of Hadoop To support this, it uses the NameNode and DataNode HTTP interfaces to read data from the source cluster Because the Hadoop HTTP interfaces don’t support writes, when you’re running distCp between clusters of differing versions, you must run it on the destination cluster Notice in the following example that the source argument uses hftp as the scheme: $ hadoop distcp hftp://source-nn:8020/distcp-source \ hdfs://localhost:8020/distcp-dest Because Hadoop version 1.x and 2.x offer the WebHDSF HTTP interfaces that support writes, there will no longer be any restrictions over what cluster the distCp must run on distCp does support FTP as a source, but unfortunately not HTTP B.10 WebDAV Web-based Distributed Authoring and Versioning (WebDAV) is a series of HTTP methods that offer file collaboration facilities, as defined in RFC 4918 (HTTP Extensions for Web Distributed Authoring and Versioning (WebDAV)) A JIRA ticket (HDFS-225) was created in 2006 to add this capability to HDFS, but as of yet it hasn’t been committed to any HDFS release A GitHub project at https://github.com/huyphan/HDFS-over-Webdav claims to have WebDAV running against Hadoop 0.20.1 B.11 MapReduce MapReduce is a great mechanism to get data into HDFS Unfortunately, other than distCp, there’s no other built-in mechanism to ingest data from external sources Let’s look at how to write a MapReduce job to pull data from an HTTP endpoint: public final class HttpDownloadMap implements Mapper { public static final String CONN_TIMEOUT = "httpdownload.connect.timeout.millis"; public static final String READ_TIMEOUT = www.it-ebooks.info 484 APPENDIX B Hadoop built-in ingress and egress tools "httpdownload.read.timeout.millis"; Get the job output directory in HDFS @Override public void configure(JobConf job) { conf = job; jobOutputDir = job.get("mapred.output.dir"); Get the job’s task ID, which is unique across all the tasks taskId = conf.get("mapred.task.id"); if (conf.get(CONN_TIMEOUT) != null) { connTimeoutMillis = Integer.valueOf(conf.get(CONN_TIMEOUT)); } if (conf.get(READ_TIMEOUT) != null) { readTimeoutMillis = Integer.valueOf(conf.get(READ_TIMEOUT)); } } Get the read timeout or use a default if not supplied @Override public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException { Path httpDest = new Path(jobOutputDir, taskId + "_http_" + (file++)); InputStream is = null; OutputStream os = null; try { URLConnection connection = new URL(value.toString()).openConnection(); connection.setConnectTimeout(connTimeoutMillis); connection.setReadTimeout(readTimeoutMillis); is = connection.getInputStream(); Create a connection object Get the connection timeout or use a default if not supplied Create the path to the file that you’ll use to write the URL contents You create the file in the job output directory, and use the unique task ID in conjunction with a counter (because the map task can be called with multiple URLs) Set the read timeout os = FileSystem.get(conf).create(httpDest); IOUtils.copyBytes(is, os, conf, true); } finally { IOUtils.closeStream(is); IOUtils.closeStream(os); } Copy the contents of the HTTP body into the HDFS file output.collect(new Text(httpDest.toString()), value); } } You emit the location of the URL file in HDFS, as well as the URL that was downloaded, so that they can be correlated You can run the MapReduce job and examine the contents of HDFS after it completes: $ echo "http://www.apache.org/dist/avro/KEYS http://www.apache.org/dist/maven/KEYS" | \ hadoop fs -put - /http-lines.txt $ hadoop fs -cat /http-lines.txt http://www.apache.org/dist/avro/KEYS www.it-ebooks.info Create a file in HDFS containing a list of URLs you want to download Verify the contents of the file 485 MapReduce http://www.apache.org/dist/maven/KEYS $ bin/run.sh com.manning.hip.ch2.HttpDownloadMapReduce \ /http-lines.txt /http-download List the contents of the job directory after the job completes $ hadoop fs -ls /http-download /http-download/_SUCCESS /http-download/_logs /http-download/part-m-00000 /http-download/task_201110301822_0008_m_000000_http_0 /http-download/task_201110301822_0008_m_000000_http_1 $ hadoop fs -cat /http-download/part-m-00000 /http-download/task_201110301822_0008_m_000000_http_0 /http-download/task_201110301822_0008_m_000000_http_1 Run your MapReduce job, specifying the input file and the output directory View the metadata file of one of the mappers The first field in the file is the HDFS location of the URL, and the second field is the URL you downloaded http://www http://www $ hadoop fs -cat /http-download/ pub 1024D/A7239D59 2005-10-12 Key fingerprint = 4B96 409A 098D BD51 1DF2 BC18 DBAF 69BE A723 9D59 uid Doug Cutting (Lucene guy) View the contents of one of the filenames contained in part-m-00000 A few notes about your implementation: It’s speculative-execution safe, as opposed to distCp, because you always write output based on the task attempt If you want multiple mappers to be run, then simply create separate input files, and each one will be processed by a separate mapper The connection and read timeouts can be controlled via the httpdownload.connect.timeout.millis and httpdownload.read.timeout.millis configuration settings, respectively We’ve gone through a number of Hadoop built-in mechanisms to read and write data into HDFS If you were to use them, you’d have to write some scripts or code to manage the process of the ingress and egress because all the topics we covered are low-level www.it-ebooks.info appendix C HDFS dissected If you’re using Hadoop you should have a solid understanding of HDFS so that you can make smart decisions about how to manage your data In this appendix we’ll walk through how HDFS reads and writes files to help you better understand how HDSF works behind the scenes C.1 What is HDFS? HDFS is a distributed filesystem modeled on the Google File System (GFS), details of which were published in a 2003 paper.1 Google’s paper highlighted a number of key architectural and design properties, the most interesting of which included optimizations to reduce network input/output (I/O), how data replication should occur, and overall system availability and scalability Not many details about GFS are known beyond those published in the paper, but HDFS is a near clone2 of GFS, as described by the Google paper +')6ILOH HDFS is an optimized filesystem for %ORFN %ORFN %ORFN streaming reads and writes It was +RVW designed to avoid the overhead of network I/O and disk seeks by introducing the notion of data locality (the ability to +RVW read/write data that’s closest to the client), and by using large block sizes Files in HDFS are stored across one or +RVW more blocks, and each block is typically 64 MB or larger Blocks are replicated across multiple hosts in your clusFigure C.1 An example of a single file occupying ter (as shown in figure C.1) to help with three blocks and the block distribution and replication across multiple HDFS storage hosts availability and fault tolerance See http://research.google.com/archive/gfs.html They differ from an implementation language perspective; GFS is written in C and HDFS is written mostly in Java, although critical parts are written in C 486 www.it-ebooks.info 487 How HDFS writes files HDFS is also a checksummed filesystem Each block has a checksum associated with it, and if a discrepancy between the checksum and the block contents is detected (as in the case of bit rot), this information is sent to the HDFS master The HDFS master coordinates the creation of a new replica of the bad block as well as the deletion of the corrupted block BEYOND 0.20.X The 2.x release will eventually include High Availability (HA) support for the NameNode Also included in 2.x are the additions of a Backup Node and Checkpoint Node, which serve as replacements of the SecondaryNameNode (although the SNN still exists) The Checkpoint Node performs the same functions as the SecondaryNameNode; it downloads from the NameNode the current image file and subsequent edits, merges them together to create a new image file, and then uploads the new image to the NameNode The Backup Node is a superset of the Checkpoint Node, also providing that checkpointing mechanism, as well as acting as a NameNode in its own right This means if your primary NameNode goes down, you can immediately start using your Backup NameNode C.2 How HDFS writes files A look at how HDFS writes files will bootstrap your HDFS knowledge and help you make smart decisions about your data and cluster, and how you work with your data The first step is to use the command-line interface (CLI) to copy a file from local disk to HDFS: $ hadoop -put /etc/hadoop/conf/hdfs-site.xml /tmp/hdfs-site.xml Now let’s look at how to achieve the same effect using Java: Initialize a new configuration object By default it loads public class StreamToHdfs { public static void main(String args) throws Exception { core-default.xml Configuration config = new Configuration(); and core-site.xml from the classpath FileSystem hdfs = FileSystem.get(config); Gets a handle to the filesystem, using the default configuration This is most likely to be HDFS OutputStream os = hdfs.create(new Path(args[0])); Create a stream for the file in HDFS This involves a round trip communication with the NameNode to determine the set of DataNodes that will be used to write to the first block in HDFS IOUtils.copyBytes(System.in, os, config, true); IOUtils.closeStream(os); } } Copy the contents of the local file to the file in HDFS As each block is filled, the NameNode is communicated with to determine the next set of DataNodes for the next block www.it-ebooks.info 488 APPENDIX C HDFS dissected The previous code example is for illustrative purposes only; in real life you’d probably replace this code with a function call to Hadoop’s utility class, org.apache.hadoop fs.FileUtil, which contains a copy method for copying files, in addition to a number of other common filesystem operations DON’T FORGET TO SET YOUR CLASSPATH If you run the previous example and don’t include the Hadoop configuration directory in your classpath, Hadoop uses default settings for all of its configuration By default fs.default.name is set to file:///, which means the local filesystem will be used for storage, not HDFS Now that you know how to copy a file using the CLI and Java, let’s look at what HDFS is doing behind the scenes Figure C.2 shows the components and how they interact when you write a file in HDFS DETERMINING THE FILESYSTEM I mentioned earlier that the filesystem is abstracted, so Hadoop’s first action is to figure out the underlying filesystem that should be used to perform the write This is determined by examining the configured value for fs.default.name, which is a URI, and extracting the scheme In the case of an HDFS filesystem, the value for fs.default.name would look something like hdfs://namenode:9000, so the scheme is hdfs When the scheme is in hand, an instance of the concrete filesystem is created by reading the configuration value for fs.[scheme].impl, where [scheme] is replaced by the scheme, which in this example is hdfs If you look at core-default.xml, you’ll see that fs.hdfs.impl is set to org.apache.hadoop.hdfs.DistributedFileSystem, so that’s the concrete filesystem you’ll use Connect to the NameNode to determine block placement Application NameNode HDFS client Writing to the DataNodes Block write completed A B ack Block bytes ack DataNode Block bytes ack DataNode Figure C.2 C HDFS write data flow www.it-ebooks.info Block bytes DataNode .. .Hadoop in Practice www.it-ebooks.info www.it-ebooks.info Hadoop in Practice ALEX HOLMES MANNING SHELTER ISLAND www.it-ebooks.info For online information and ordering of this and other Manning... www.it-ebooks.info www.it-ebooks.info Hadoop in a heartbeat This chapter covers ■ Understanding the Hadoop ecosystem ■ Downloading and installing Hadoop ■ Running a MapReduce job We live in the age... Diagnosing and tuning performance problems 6.1 6.2 194 Measuring MapReduce and your environment 195 Determining the cause of your performance woes 198 TECHNIQUE 28 Investigating spikes in input
- Xem thêm -

Xem thêm: 1028 hadoop in practice , 1028 hadoop in practice , 4 Rhipe—Client-side R and Hadoop working together, 5 RHadoop—a simpler integration of client-side R and Hadoop

Mục lục

Xem thêm