Machine learning with spark develop intelligent machine learning systems with spark 2 x 2nd edition

Machine Learning with Spark Second Edition Develop intelligent machine learning systems with Spark 2.x Rajdeep Dua Manpreet Singh Ghotra Nick Pentreath BIRMINGHAM - MUMBAI Machine Learning with Spark Second Edition Copyright © 2017 Packt Publishing All rights reserved No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews Every effort has been made in the preparation of this book to ensure the accuracy of the information presented However, the information contained in this book is sold without warranty, either express or implied Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals However, Packt Publishing cannot guarantee the accuracy of this information First published: February 2015 Second edition: April 2017 Production reference: 1270417 Published by Packt Publishing Ltd Livery Place 35 Livery Street Birmingham B3 2PB, UK ISBN 978-1-78588-993-6 www.packtpub.com Credits Authors Copy Editors Rajdeep Dua Manpreet Singh Ghotra Nick Pentreath Safis Editing Sonia Mathur Reviewer Project Coordinator Brian O'Neill Vaidehi Sawant Commissioning Editor Proofreader Akram Hussain Safis Editing Acquisition Editor Indexer Tushar Gupta Francy Puthiry Content Development Editor Production Coordinator Deepika Naik Rohit Kumar Singh Technical Editors Nirant Carvalho Kunal Mali About the Authors Rajdeep Dua has over 16 years of experience in the Cloud and Big Data space He worked in the advocacy team for Google's big data tools, BigQuery He worked on the Greenplum big data platform at VMware in the developer evangelist team He also worked closely with a team on porting Spark to run on VMware's public and private cloud as a feature set He has taught Spark and Big Data at some of the most prestigious tech schools in India: IIIT Hyderabad, ISB, IIIT Delhi, and College of Engineering Pune Currently, he leads the developer relations team at Salesforce India He also works with the data pipeline team at Salesforce, which uses Hadoop and Spark to expose big data processing tools for developers He has published Big Data and Spark tutorials at http://www.clouddatalab.com He has also presented BigQuery and Google App Engine at the W3C conference in Hyderabad (htt p://wwwconference.org/proceedings/www2011/schedule/www2011_Program.pdf) He led the developer relations teams at Google, VMware, and Microsoft, and he has spoken at hundreds of other conferences on the cloud Some of the other references to his work can be seen at http://yourstory.com/2012/06/vmware-hires-rajdeep-dua-to-lead-the-deve loper-relations-in-india/ and http://dl.acm.org/citation.cfm?id=2624641 His contributions to the open source community are related to Docker, Kubernetes, Android, OpenStack, and cloudfoundry You can connect with him on LinkedIn at https://www.linkedin.com/in/rajdeepd Manpreet Singh Ghotra has more than 12 years of experience in software development for both enterprise and big data software He is currently working on developing a machine learning platform using Apache Spark at Salesforce He has worked on a sentiment analyzer using the Apache stack and machine learning He was part of the machine learning group at one of the largest online retailers in the world, working on transit time calculations using Apache Mahout and the R Recommendation system using Apache Mahout With a master's and postgraduate degree in machine learning, he has contributed to and worked for the machine learning community His GitHub profile is https://github.com/badlogicmanpreet and you can find him on LinkedIn at https://in.linkedin.com/in/msghotra Nick Pentreath has a background in financial markets, machine learning, and software development He has worked at Goldman Sachs Group, Inc., as a research scientist at the online ad targeting start-up, Cognitive Match Limited, London, and led the data science and analytics team at Mxit, Africa's largest social network He is a cofounder of Graphflow, a big data and machine learning company focused on usercentric recommendations and customer intelligence He is passionate about combining commercial focus with machine learning and cutting-edge technology to build intelligent systems that learn from data to add value to the bottom line Nick is a member of the Apache Spark Project Management Committee About the Reviewer Brian O'Neill is the principal architect at Monetate, Inc Monetate's personalization platform leverages Spark and machine learning algorithms to process millions of events per second, leveraging real-time context and analytics to create personalized brand experiences at scale Brian is a perennial Datastax Cassandra MVP and has also won InfoWorld’s Technology Leadership award Previously, he was CTO for Health Market Science (HMS), now a LexisNexis company He is a graduate of Brown University and holds patents in artificial intelligence and data management Prior to this publication, Brian authored a book on distributed computing, Storm Blueprints: Patterns for Distributed Real-time Computation, and contributed to Learning Cassandra for Administrators All the thanks in the world to my wife, Lisa, and my sons, Collin and Owen, for their understanding, patience, and support They know all my shortcomings and love me anyway Together always and forever, I love you more than you know and more than I will ever be able to express www.PacktPub.com For support files and downloads related to your book, please visit www.PacktPub.com Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy Get in touch with us at service@packtpub.com for more details At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks https://www.packtpub.com/mapt Get the most in-demand software skills with Mapt Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career Why subscribe? Fully searchable across every book published by Packt Copy and paste, print, and bookmark content On demand and accessible via a web browser Customer Feedback Thanks for purchasing this Packt book At Packt, quality is at the heart of our editorial process To help us improve, please leave us an honest review on this book's Amazon page at https://goo.gl/5LgUpI If you'd like to join our team of regular reviewers, you can e-mail us at customerreviews@packtpub.com We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback Help us be relentless in improving our products! Table of Contents Preface Chapter 1: Getting Up and Running with Spark Installing and setting up Spark locally Spark clusters The Spark programming model SparkContext and SparkConf SparkSession The Spark shell Resilient Distributed Datasets 10 11 12 13 14 14 17 18 18 21 22 24 24 25 28 32 34 34 36 37 42 46 48 54 56 56 57 60 63 Creating RDDs Spark operations Caching RDDs Broadcast variables and accumulators SchemaRDD Spark data frame The first step to a Spark program in Scala The first step to a Spark program in Java The first step to a Spark program in Python The first step to a Spark program in R SparkR DataFrames Getting Spark running on Amazon EC2 Launching an EC2 Spark cluster Configuring and running Spark on Amazon Elastic Map Reduce UI in Spark Supported machine learning algorithms by Spark Benefits of using Spark ML as compared to existing libraries Spark Cluster on Google Compute Engine - DataProc Hadoop and Spark Versions Creating a Cluster Submitting a Job Summary Chapter 2: Math for Machine Learning Linear algebra Setting up the Scala environment in Intellij 64 66 66 ... Cross-validation Summary [v] 22 3 22 5 22 6 22 8 22 9 23 0 23 1 23 3 23 7 23 9 24 2 24 5 24 5 24 7 24 9 25 1 25 2 25 2 25 4 25 4 25 5 25 5 25 7 25 8 26 1 26 1 26 5 26 8 26 9 27 0 27 1 27 2 27 2 27 4 27 5 27 6 27 7 28 0 Chapter 7: Building... Iterations MaxBins Summary Chapter 8: Building a Clustering Model with Spark Types of clustering models k-means clustering 28 1 28 2 28 3 28 4 28 5 28 5 28 6 28 6 28 7 28 7 28 7 29 2 29 2 29 4 29 4 29 6 300 304... 1 72 173 174 175 175 177 178 181 1 82 183 188 188 189 190 193 194 195 196 197 198 20 0 20 0 20 5 20 6 20 6 20 8 21 0 21 5 21 5 21 6 21 7 21 7 22 0 Summary 22 2 Chapter 6: Building a Classification Model with Spark

Machine learning with spark develop intelligent machine learning systems with spark 2 x 2nd edition

Thông tin tài liệu

Từ khóa liên quan

Mục lục

Cover

Credits

About the Authors

About the Reviewer

www.PacktPub.com

Customer Feedback

Table of Contents

Preface

Chapter 1: Getting Up and Running with Spark

Installing and setting up Spark locally

Spark clusters

The Spark programming model

SparkContext and SparkConf

SparkSession

The Spark shell

Resilient Distributed Datasets

Creating RDDs

Spark operations

Caching RDDs

Broadcast variables and accumulators

SchemaRDD

Spark data frame

The first step to a Spark program in Scala

Tài liệu cùng người dùng

Tài liệu liên quan