data science in the cloud with microsoft azure machine learning and r 2015 update

86 78 0
data science in the cloud with microsoft azure machine learning and r 2015 update

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

Data Science in the Cloud with Microsoft Azure Machine Learning and R: 2015 Update Stephen F Elston Data Science in the Cloud with Microsoft Azure Machine Learning and R: 2015 Update by Stephen F Elston Copyright © 2015 O’Reilly Media Inc All rights reserved Printed in the United States of America Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472 O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles ( http://safaribooksonline.com ) For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com Editor: Shannon Cutt Production Editor: Nicholas Adams Proofreader: Nicholas Adams Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Rebecca Demarest September 2015: First Edition Revision History for the First Edition 2015-09-01: First Release 2015-11-21: Second Release The O’Reilly logo is a registered trademark of O’Reilly Media, Inc Data Science in the Cloud with Microsoft Azure Machine Learning and R: 2015 Update, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc While the publisher and the author(s) have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author(s) disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work Use of the information and instructions contained in this work is at your own risk If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights 978-1-491-93634-4 [LSI] Chapter Data Science in the Cloud with Microsoft Azure Machine Learning and R: 2015 Update Introduction This report covers the basics of manipulating data, constructing models, and evaluating models in the Microsoft Azure Machine Learning platform (Azure ML) The Azure ML platform has greatly simplified the development and deployment of machine learning models, with easy-to-use and powerful cloud-based data transformation and machine learning tools In this report, we’ll explore extending Azure ML with the R language (A companion report explores extending Azure ML using the Python language.) All of the concepts we will cover are illustrated with a data science example, using a bicycle rental demand dataset We’ll perform the required data manipulation, or data munging Then, we will construct and evaluate regression models for the dataset You can follow along by downloading the code and data provided in the next section Later in the report, we’ll discuss publishing your trained models as web services in the Azure cloud Before we get started, let’s review a few of the benefits Azure ML provides for machine learning solutions: Solutions can be quickly and easily deployed as web services Models run in a highly scalable and secure cloud environment Azure ML is integrated with the powerful Microsoft Cortana Analytics Suite, which includes massive storage and processing capabilities It can read data from and write data to Cortana storage at significant volume Azure ML can even be employed as the analytics engine for other components of the Cortana Analytics Suite Machine learning algorithms and data transformations are extendable using the R language, for solution-specific functionality Rapidly operationalized analytics are written in the R and Python languages Code and data are maintained in a secure cloud environment Downloads For our example, we will be using the Bike Rental UCI dataset available in Azure ML This data is also preloaded in the Azure ML Studio environment, or you can download this data as a csv file from the UCI website The reference for this data is Fanaee-T, Hadi, and Gama, Joao, “Event labeling combining ensemble detectors and background knowledge,” Progress in Artificial Intelligence (2013): pp 1-15, Springer Berlin Heidelberg The R code for our example can be found at GitHub Working Between Azure ML and RStudio Azure ML is a production environment It is ideally suited to publishing machine learning models In contrast, Azure ML is not a particularly good development environment In general, you will find it easier to perform preliminary editing, testing, and debugging in RStudio In this way, you take advantage of the powerful development resources and perform your final testing in Azure ML Downloads for R and RStudio are available for Windows, Mac, and Linux This report assumes the reader is familiar with the basics of R If you are not familiar with using R in Azure ML, check out the Quick Start Guide to R in AzureML The R source code for the data science example in this report can be run in either Azure ML or RStudio Read the comments in the source files to see the changes required to work between these two environments As in the Training code, the test data set appears in the Score module with the name dataset This data structure is immutable The first thing we is copy it into a new data frame The vector of predicted values is placed into a data frame named scores, which will be output Note that the functions as.POSIXct and char.toPOSIXct must be in the full code listing since the Create R Module module cannot import from a zip file The code for these functions is simply copied for the utilities.R file into the code for the Create R Model module The box plots of the residuals are shown in Figures 36 and 37 Figure 1-36 Box plot of residuals by hour of the day for the R model Figure 1-37 Box plot of residuals by workTime for the R model We can see from these residual plots that this particular model does not perform well when compared with the native Azure ML Decision Forest model The largest residuals are at peak hours, which is quite undesirable Cross Validation Let’s test the performance of our better model in-depth We’ll use the Azure ML Cross Validation module In summary, cross validation resamples the data set multiple times and recomputes and re-scores the model This procedure provides multiple estimates of model performance These estimates are averaged to produce a more reliable performance estimate The updated experiment is shown in Figure 1-38 An additional Project Columns module has been added This module removes the additional scoring and standard deviation columns, allowing the Execute R Script module, with the evaluation code, to operate correctly Figure 1-38 Experiment with Cross Validation module added After running the experiment, the output of the Evaluation Results by Fold is shown in Figure 1-39 Figure 1-39 The results by fold of the model cross validation The box plots of the residuals by hour and by workTime are shown in Figures 40 and 41 Figure 1-40 Residuals from the model cross validation by hour of the day Figure 1-41 Residuals from the model cross validation by workTime These results look quite good In fact, they seem unexpectedly good It is possible the cross validation is producing overly optimistic results Perhaps, the effect of the outliers is being averaged out Nonetheless, these cross validation results are good Some Possible Next Steps It is always possible to more when refining a predictive model The question must always be: is it worth the effort for the possible improvement? The median performance of the Decision Forest Regression model is fairly good However, there are some significant outliers in the residuals Thus, some additional effort is probably justified before either model is put into production There is a lot to think about when trying to improve the results We could consider several possible next steps, including the following: Understand the source of the residual outliers We have not investigated if there are systematic sources of these outliers Are there certain ranges of predictor variable values that give these erroneous results? Do the outliers correspond to exogenous events, such as parades and festivals, failures of other public transit, holidays that are not indicated as non-working days, etc.? Such an investigation will require additional data Or, are these outliers a sign of overfitting? Perform additional feature engineering We have tried a few obvious new features with some success, but there is no reason to think this process has run its course Perhaps another time axis transformation, which orders the hour-to-hour variation in demand would perform better Some moving averages might reduce the effects of the outliers Prune features to prevent overfitting Overfitting is a major source of poor model performance As noted earlier, we have pruned some features Perhaps, a different pruning of the features would give a better result Change the quantile of the outlier filter We arbitrarily chose the 0.20 quantile, but it could easily be the case that another value might give better performance It is also possible that some other type of filter might help Try some other models Azure ML has a number of other nonlinear regression modules Further, we have tried only one of many possible R models and packages Publishing a Model as a Web Service Now that we have a reasonably good model, we can publish it as a web service Publishing an Azure ML experiment as a web service is remarkably easy As illustrated in Figure 1-42, simply push the Set Up Web Service button on the right hand side of the tool bar at the bottom of the studio window Then select Predictive Web Service Figure 1-42 The Set Up Web Service button in Azure ML studio Once this button has been pushed, a scoring experiment is automatically created, as illustrated in Figure 1-43 Unnecessary modules are eliminated, and the web services input and output models are added automatically Figure 1-43 The scoring experiment with web services input and output modules By clicking on the Web Services icon on the left side of the studio canvas, a page showing a list of published web services appears Click on the line for the web bicycle demand forecasting service and the display shown in Figure 1-44 appears Figure 1-44 Web service page for bike demand forecasting On this page you can see a number of properties and tools: An API key, used by external applications to access this predictive model To ensure security, manage the distribution of this key carefully! A link to a page that describes the request-response REST API This document includes sample code in C#, Python, and R A link to a page that describes the batch API This document includes sample code in C#, Python, and R A test button for manually testing the web service An Excel download Let’s download and open the Excel workbook Once the content is enabled, the workbook will appear, as in Figure 1-45 Note, the workbook will not work unless you enable the content! Figure 1-45 Excel workbook connected to web service API Users can enter values on the left side of the spreadsheet, labeled Parameters The results computed in the Azure cloud appear on the as Predicted Values shown in Figure 1-46 Figure 1-46 The results produced by the web service We can see that the newly computed predicted values are reasonably close to the actual values Publishing machine learning models as web services makes the results available to a wide audience With very few steps, we have created such a web service, and tested it from an Excel workbook Summary I hope this article has motivated you to try your own data science projects in Azure ML To summarize our discussion: Azure ML is an easy-to-use and powerful environment for the creation and cloud deployment of powerful machine learning solutions Analytics written in R can be rapidly operationalized as web services using Azure ML R code is readily integrated into the Azure ML workflow Careful development, selection, and filtering of features is the key to creating successful data science solutions Understanding business goals and requirements is essential to the creation of a valuable analytic solution A clear understanding of residuals is essential to the evaluation and improvement of machine learning model performance About the Author Stephen F Elston, Managing Director of Quantia Analytics, LLC is a big data geek, data scientist, instructor, and O’Reilly author He has over two decades of experience in predictive analytics and machine learning with R, S/SPLUS, and Python Elston has developed, sold, and supported multiple analytics solutions He holds a PhD degree in Geophysics from Princeton University Formally he led R&D for the S-PLUS companies and is a cofounder of FinAnalytica, Inc He creates solutions for financial market and credit risk, trading analytics, wireless telecom, and fraud prevention Elston is an Azure Advisor to Microsoft ... Data Science in the Cloud with Microsoft Azure Machine Learning and R: 2015 Update Stephen F Elston Data Science in the Cloud with Microsoft Azure Machine Learning and R: 2015 Update by... trademark of O’Reilly Media, Inc Data Science in the Cloud with Microsoft Azure Machine Learning and R: 2015 Update, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc... R: 2015 Update Introduction This report covers the basics of manipulating data, constructing models, and evaluating models in the Microsoft Azure Machine Learning platform (Azure ML) The Azure

Ngày đăng: 04/03/2019, 13:43

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan