IT training data science for business what you need to know about data mining provost fawcett 2013 08 19

409 325 0
IT training data science for business  what you need to know about data mining     provost  fawcett 2013 08 19

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

www.it-ebooks.info www.it-ebooks.info Praise “A must-read resource for anyone who is serious about embracing the opportunity of big data.” — Craig Vaughan Global Vice President at SAP “This timely book says out loud what has finally become apparent: in the modern world, Data is Business, and you can no longer think business without thinking data Read this book and you will understand the Science behind thinking data.” — Ron Bekkerman Chief Data Officer at Carmel Ventures “A great book for business managers who lead or interact with data scientists, who wish to better understand the principals and algorithms available without the technical details of single-disciplinary books.” — Ronny Kohavi Partner Architect at Microsoft Online Services Division “Provost and Fawcett have distilled their mastery of both the art and science of real-world data analysis into an unrivalled introduction to the field.” —Geoff Webb Editor-in-Chief of Data Mining and Knowledge Discovery Journal “I would love it if everyone I had to work with had read this book.” — Claudia Perlich Chief Scientist of M6D (Media6Degrees) and Advertising Research Foundation Innovation Award Grand Winner (2013) www.it-ebooks.info “A foundational piece in the fast developing world of Data Science A must read for anyone interested in the Big Data revolution." —Justin Gapper Business Unit Analytics Manager at Teledyne Scientific and Imaging “The authors, both renowned experts in data science before it had a name, have taken a complex topic and made it accessible to all levels, but mostly helpful to the budding data scientist As far as I know, this is the first book of its kind—with a focus on data science concepts as applied to practical business problems It is liberally sprinkled with compelling real-world examples outlining familiar, accessible problems in the business world: customer churn, targeted marking, even whiskey analytics! The book is unique in that it does not give a cookbook of algorithms, rather it helps the reader understand the underlying concepts behind data science, and most importantly how to approach and be successful at problem solving Whether you are looking for a good comprehensive overview of data science or are a budding data scientist in need of the basics, this is a must-read.” — Chris Volinsky Director of Statistics Research at AT&T Labs and Winning Team Member for the $1 Million Netflix Challenge “This book goes beyond data analytics 101 It’s the essential guide for those of us (all of us?) whose businesses are built on the ubiquity of data opportunities and the new mandate for data-driven decision-making.” —Tom Phillips CEO of Media6Degrees and Former Head of Google Search and Analytics “Intelligent use of data has become a force powering business to new levels of competitiveness To thrive in this data-driven ecosystem, engineers, analysts, and managers alike must understand the options, design choices, and tradeoffs before them With motivating examples, clear exposition, and a breadth of details covering not only the “hows” but the “whys”, Data Science for Business is the perfect primer for those wishing to become involved in the development and application of data-driven systems.” —Josh Attenberg Data Science Lead at Etsy www.it-ebooks.info “Data is the foundation of new waves of productivity growth, innovation, and richer customer insight Only recently viewed broadly as a source of competitive advantage, dealing well with data is rapidly becoming table stakes to stay in the game The authors’ deep applied experience makes this a must read—a window into your competitor’s strategy.” — Alan Murray Serial Entrepreneur; Partner at Coriolis Ventures “One of the best data mining books, which helped me think through various ideas on liquidity analysis in the FX business The examples are excellent and help you take a deep dive into the subject! This one is going to be on my shelf for lifetime!” — Nidhi Kathuria Vice President of FX at Royal Bank of Scotland www.it-ebooks.info www.it-ebooks.info Data Science for Business Foster Provost and Tom Fawcett www.it-ebooks.info Data Science for Business by Foster Provost and Tom Fawcett Copyright © 2013 Foster Provost and Tom Fawcett All rights reserved Printed in the United States of America Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472 O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (http://my.safaribooksonline.com) For more information, contact our corporate/ institutional sales department: 800-998-9938 or corporate@oreilly.com Editors: Mike Loukides and Meghan Blanchette Production Editor: Christopher Hearse Proofreader: Kiel Van Horn Indexer: WordCo Indexing Services, Inc July 2013: Cover Designer: Mark Paglietti Interior Designer: David Futato Illustrator: Rebecca Demarest First Edition Revision History for the First Edition: 2013-07-25: First release See http://oreilly.com/catalog/errata.csp?isbn=9781449361327 for release details The O’Reilly logo is a registered trademark of O’Reilly Media, Inc Many of the designations used by man‐ ufacturers and sellers to distinguish their products are claimed as trademarks Where those designations appear in this book, and O’Reilly Media, Inc., was aware of a trademark claim, the designations have been printed in caps or initial caps Data Science for Business is a trademark of Foster Provost and Tom Fawcett While every precaution has been taken in the preparation of this book, the publisher and authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein ISBN: 978-1-449-36132-7 [LSI] www.it-ebooks.info Table of Contents Preface xi Introduction: Data-Analytic Thinking The Ubiquity of Data Opportunities Example: Hurricane Frances Example: Predicting Customer Churn Data Science, Engineering, and Data-Driven Decision Making Data Processing and “Big Data” From Big Data 1.0 to Big Data 2.0 Data and Data Science Capability as a Strategic Asset Data-Analytic Thinking This Book Data Mining and Data Science, Revisited Chemistry Is Not About Test Tubes: Data Science Versus the Work of the Data Scientist Summary 4 12 14 14 15 16 Business Problems and Data Science Solutions 19 Fundamental concepts: A set of canonical data mining tasks; The data mining process; Supervised versus unsupervised data mining From Business Problems to Data Mining Tasks Supervised Versus Unsupervised Methods Data Mining and Its Results The Data Mining Process Business Understanding Data Understanding Data Preparation Modeling Evaluation 19 24 25 26 27 28 29 31 31 iii www.it-ebooks.info Deployment Implications for Managing the Data Science Team Other Analytics Techniques and Technologies Statistics Database Querying Data Warehousing Regression Analysis Machine Learning and Data Mining Answering Business Questions with These Techniques Summary 32 34 35 35 37 38 39 39 40 41 Introduction to Predictive Modeling: From Correlation to Supervised Segmentation 43 Fundamental concepts: Identifying informative attributes; Segmenting data by progressive attribute selection Exemplary techniques: Finding correlations; Attribute/variable selection; Tree induction Models, Induction, and Prediction Supervised Segmentation Selecting Informative Attributes Example: Attribute Selection with Information Gain Supervised Segmentation with Tree-Structured Models Visualizing Segmentations Trees as Sets of Rules Probability Estimation Example: Addressing the Churn Problem with Tree Induction Summary 44 48 49 56 62 67 71 71 73 78 Fitting a Model to Data 81 Fundamental concepts: Finding “optimal” model parameters based on data; Choosing the goal for data mining; Objective functions; Loss functions Exemplary techniques: Linear regression; Logistic regression; Support-vector machines Classification via Mathematical Functions Linear Discriminant Functions Optimizing an Objective Function An Example of Mining a Linear Discriminant from Data Linear Discriminant Functions for Scoring and Ranking Instances Support Vector Machines, Briefly Regression via Mathematical Functions Class Probability Estimation and Logistic “Regression” * Logistic Regression: Some Technical Details Example: Logistic Regression versus Tree Induction Nonlinear Functions, Support Vector Machines, and Neural Networks iv | Table of Contents www.it-ebooks.info 83 85 87 88 90 91 94 96 99 102 105 ... of data science It is important to understand data science even if you never intend to apply it yourself Data- analytic thinking enables you to evaluate pro‐ posals for data mining projects For. .. and Data- Driven Decision Making Data Processing and “Big Data From Big Data 1.0 to Big Data 2.0 Data and Data Science Capability as a Strategic Asset Data- Analytic Thinking This Book Data Mining. .. Tom Fawcett www .it- ebooks.info Data Science for Business by Foster Provost and Tom Fawcett Copyright © 2013 Foster Provost and Tom Fawcett All rights reserved Printed in the United States of America

Ngày đăng: 05/11/2019, 15:00

Từ khóa liên quan

Mục lục

  • Copyright

  • Table of Contents

  • Preface

    • Our Conceptual Approach to Data Science

    • To the Instructor

    • Other Skills and Concepts

    • Sections and Notation

    • Using Examples

    • Safari® Books Online

    • How to Contact Us

    • Acknowledgments

  • Chapter 1. Introduction: Data-Analytic Thinking

    • The Ubiquity of Data Opportunities

    • Example: Hurricane Frances

    • Example: Predicting Customer Churn

    • Data Science, Engineering, and Data-Driven Decision Making

    • Data Processing and “Big Data”

    • From Big Data 1.0 to Big Data 2.0

    • Data and Data Science Capability as a Strategic Asset

    • Data-Analytic Thinking

    • This Book

    • Data Mining and Data Science, Revisited

    • Chemistry Is Not About Test Tubes: Data Science Versus the Work of the Data Scientist

    • Summary

  • Chapter 2. Business Problems and Data Science Solutions

    • From Business Problems to Data Mining Tasks

    • Supervised Versus Unsupervised Methods

    • Data Mining and Its Results

    • The Data Mining Process

      • Business Understanding

      • Data Understanding

      • Data Preparation

      • Modeling

      • Evaluation

      • Deployment

    • Implications for Managing the Data Science Team

    • Other Analytics Techniques and Technologies

      • Statistics

      • Database Querying

      • Data Warehousing

      • Regression Analysis

      • Machine Learning and Data Mining

      • Answering Business Questions with These Techniques

    • Summary

  • Chapter 3. Introduction to Predictive Modeling: From Correlation to Supervised Segmentation

    • Models, Induction, and Prediction

    • Supervised Segmentation

      • Selecting Informative Attributes

      • Example: Attribute Selection with Information Gain

      • Supervised Segmentation with Tree-Structured Models

    • Visualizing Segmentations

    • Trees as Sets of Rules

    • Probability Estimation

    • Example: Addressing the Churn Problem with Tree Induction

    • Summary

  • Chapter 4. Fitting a Model to Data

    • Classification via Mathematical Functions

      • Linear Discriminant Functions

      • Optimizing an Objective Function

      • An Example of Mining a Linear Discriminant from Data

      • Linear Discriminant Functions for Scoring and Ranking Instances

      • Support Vector Machines, Briefly

    • Regression via Mathematical Functions

    • Class Probability Estimation and Logistic “Regression”

      • * Logistic Regression: Some Technical Details

    • Example: Logistic Regression versus Tree Induction

    • Nonlinear Functions, Support Vector Machines, and Neural Networks

    • Summary

  • Chapter 5. Overfitting and Its Avoidance

    • Generalization

    • Overfitting

    • Overfitting Examined

      • Holdout Data and Fitting Graphs

      • Overfitting in Tree Induction

      • Overfitting in Mathematical Functions

    • Example: Overfitting Linear Functions

    • * Example: Why Is Overfitting Bad?

    • From Holdout Evaluation to Cross-Validation

    • The Churn Dataset Revisited

    • Learning Curves

    • Overfitting Avoidance and Complexity Control

      • Avoiding Overfitting with Tree Induction

      • A General Method for Avoiding Overfitting

      • * Avoiding Overfitting for Parameter Optimization

    • Summary

  • Chapter 6. Similarity, Neighbors, and Clusters

    • Similarity and Distance

    • Nearest-Neighbor Reasoning

      • Example: Whiskey Analytics

      • Nearest Neighbors for Predictive Modeling

      • How Many Neighbors and How Much Influence?

      • Geometric Interpretation, Overfitting, and Complexity Control

      • Issues with Nearest-Neighbor Methods

    • Some Important Technical Details Relating to Similarities and Neighbors

      • Heterogeneous Attributes

      • * Other Distance Functions

      • * Combining Functions: Calculating Scores from Neighbors

    • Clustering

      • Example: Whiskey Analytics Revisited

      • Hierarchical Clustering

      • Nearest Neighbors Revisited: Clustering Around Centroids

      • Example: Clustering Business News Stories

      • Understanding the Results of Clustering

      • * Using Supervised Learning to Generate Cluster Descriptions

    • Stepping Back: Solving a Business Problem Versus Data Exploration

    • Summary

  • Chapter 7. Decision Analytic Thinking I: What Is a Good Model?

    • Evaluating Classifiers

      • Plain Accuracy and Its Problems

      • The Confusion Matrix

      • Problems with Unbalanced Classes

      • Problems with Unequal Costs and Benefits

    • Generalizing Beyond Classification

    • A Key Analytical Framework: Expected Value

      • Using Expected Value to Frame Classifier Use

      • Using Expected Value to Frame Classifier Evaluation

    • Evaluation, Baseline Performance, and Implications for Investments in Data

    • Summary

  • Chapter 8. Visualizing Model Performance

    • Ranking Instead of Classifying

    • Profit Curves

    • ROC Graphs and Curves

    • The Area Under the ROC Curve (AUC)

    • Cumulative Response and Lift Curves

    • Example: Performance Analytics for Churn Modeling

    • Summary

  • Chapter 9. Evidence and Probabilities

    • Example: Targeting Online Consumers With Advertisements

    • Combining Evidence Probabilistically

      • Joint Probability and Independence

      • Bayes’ Rule

    • Applying Bayes’ Rule to Data Science

      • Conditional Independence and Naive Bayes

      • Advantages and Disadvantages of Naive Bayes

    • A Model of Evidence “Lift”

    • Example: Evidence Lifts from Facebook “Likes”

      • Evidence in Action: Targeting Consumers with Ads

    • Summary

  • Chapter 10. Representing and Mining Text

    • Why Text Is Important

    • Why Text Is Difficult

    • Representation

      • Bag of Words

      • Term Frequency

      • Measuring Sparseness: Inverse Document Frequency

      • Combining Them: TFIDF

    • Example: Jazz Musicians

    • * The Relationship of IDF to Entropy

    • Beyond Bag of Words

      • N-gram Sequences

      • Named Entity Extraction

      • Topic Models

    • Example: Mining News Stories to Predict Stock Price Movement

      • The Task

      • The Data

      • Data Preprocessing

      • Results

    • Summary

  • Chapter 11. Decision Analytic Thinking II: Toward Analytical Engineering

    • Targeting the Best Prospects for a Charity Mailing

      • The Expected Value Framework: Decomposing the Business Problem and Recomposing the Solution Pieces

      • A Brief Digression on Selection Bias

    • Our Churn Example Revisited with Even More Sophistication

      • The Expected Value Framework: Structuring a More Complicated Business Problem

      • Assessing the Influence of the Incentive

      • From an Expected Value Decomposition to a Data Science Solution

      • Summary

  • Chapter 12. Other Data Science Tasks and Techniques

    • Co-occurrences and Associations: Finding Items That Go Together

      • Measuring Surprise: Lift and Leverage

      • Example: Beer and Lottery Tickets

      • Associations Among Facebook Likes

    • Profiling: Finding Typical Behavior

    • Link Prediction and Social Recommendation

    • Data Reduction, Latent Information, and Movie Recommendation

    • Bias, Variance, and Ensemble Methods

    • Data-Driven Causal Explanation and a Viral Marketing Example

    • Summary

  • Chapter 13. Data Science and Business Strategy

    • Thinking Data-Analytically, Redux

    • Achieving Competitive Advantage with Data Science

    • Sustaining Competitive Advantage with Data Science

      • Formidable Historical Advantage

      • Unique Intellectual Property

      • Unique Intangible Collateral Assets

      • Superior Data Scientists

      • Superior Data Science Management

    • Attracting and Nurturing Data Scientists and Their Teams

    • Examine Data Science Case Studies

    • Be Ready to Accept Creative Ideas from Any Source

    • Be Ready to Evaluate Proposals for Data Science Projects

      • Example Data Mining Proposal

      • Flaws in the Big Red Proposal

    • A Firm’s Data Science Maturity

  • Chapter 14. Conclusion

    • The Fundamental Concepts of Data Science

      • Applying Our Fundamental Concepts to a New Problem: Mining Mobile Device Data

      • Changing the Way We Think about Solutions to Business Problems

    • What Data Can’t Do: Humans in the Loop, Revisited

    • Privacy, Ethics, and Mining Data About Individuals

    • Is There More to Data Science?

    • Final Example: From Crowd-Sourcing to Cloud-Sourcing

    • Final Words

  • Appendix A. Proposal Review Guide

    • Business and Data Understanding

    • Data Preparation

    • Modeling

    • Evaluation and Deployment

  • Appendix B. Another Sample Proposal

    • Scenario and Proposal

      • Flaws in the GGC Proposal

  • Glossary

  • Bibliography

  • Index

  • About the Authors

Tài liệu cùng người dùng

  • Đang cập nhật ...

Tài liệu liên quan