Data Analysis with Open Source Tools docx

533 2.5K 0
Data Analysis with Open Source Tools docx

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

www.it-ebooks.info www.it-ebooks.info Use your data – or lose Save 20% with code EBOOK Register Now Strata Conference Sep 22-23, 2011, NY Strata Summit Sep 20-21, 2011, NY Strata Jumpstart Sep 19, 2011, NY www.it-ebooks.info O’Reilly-5980006 janert5980006˙fm October 28, 2010 22:1 Data Analysis with Open Source Tools www.it-ebooks.info O’Reilly-5980006 janert5980006˙fm October 28, 2010 22:1 www.it-ebooks.info O’Reilly-5980006 janert5980006˙fm October 28, 2010 22:1 Data Analysis with Open Source Tools Philipp K. Janert Beijing • Cambridge • Farnham • Köln • Sebastopol • Tokyo www.it-ebooks.info O’Reilly-5980006 janert5980006˙fm October 28, 2010 22:1 Data Analysis with Open Source Tools by Philipp K. Janert Copyright c  2011 Philipp K. Janert. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc. 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://my.safaribooksonline.com). For more information, contact our corporate/institutional sales department: (800) 998-9938 or corporate@oreilly.com. Editor: Mike Loukides Production Editor: Sumita Mukherji Copyeditor: Matt Darnell Production Services: MPS Limited, a Macmillan Company, and Newgen North America, Inc. Indexer: Fred Brown Cover Designer: Karen Montgomery Interior Designer: Edie Freedman and Ron Bilodeau Illustrator: Philipp K. Janert Printing History: November 2010: First Edition. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Data Analysis with Open Source Tools, the image of a common kite, and related trade dress are trademarks of O’Reilly Media, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O’Reilly Media, Inc. was aware of a trademark claim, the designations have been printed in caps or initial caps. While every precaution has been taken in the preparation of this book, the publisher and author assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein. ISBN: 978-0-596-80235-6 [M] [2011-05-27] www.it-ebooks.info O’Reilly-5980006 janert5980006˙fm October 28, 2010 22:1 Furious activity is no substitute for understanding. —H. H. Williams www.it-ebooks.info O’Reilly-5980006 janert5980006˙fm October 28, 2010 22:1 www.it-ebooks.info O’Reilly-5980006 janert5980006˙fm October 28, 2010 22:1 CONTENTS PREFACE xiii 1 INTRODUCTION 1 Data Analysis 1 What’s in This Book 2 What’s with the Workshops? 3 What’s with the Math? 4 What You’ll Need 5 What’s Missing 6 PART I Graphics: Looking at Data 2 A SINGLE VARIABLE: SHAPE AND DISTRIBUTION 11 Dot and Jitter Plots 12 Histograms and Kernel Density Estimates 14 The Cumulative Distribution Function 23 Rank-Order Plots and Lift Charts 30 Only When Appropriate: Summary Statistics and Box Plots 33 Workshop: NumPy 38 Further Reading 45 3 TWO VARIABLES: ESTABLISHING RELATIONSHIPS 47 Scatter Plots 47 Conquering Noise: Smoothing 48 Logarithmic Plots 57 Banking 61 Linear Regression and All That 62 Showing What’s Important 66 Graphical Analysis and Presentation Graphics 68 Workshop: matplotlib 69 Further Reading 78 4 TIME AS A VARIABLE: TIME-SERIES ANALYSIS 79 Examples 79 The Task 83 Smoothing 84 Don’t Overlook the Obvious! 90 The Correlation Function 91 vii www.it-ebooks.info [...]... analytically, you will need to develop some familiarity with a few mathematical concepts There is simply no way around it (You can work with data without any math skills—look at what any data modeler or database administrator does But if you want to do any sort of analysis, then a little math becomes a necessity.) I have tried to make the text accessible to readers with a minimum of previous knowledge Some college... Intelligence Corporate Metrics and Dashboards Data Quality Issues Workshop: Berkeley DB and SQLite Further Reading 16 448 460 468 CONTENTS www.it-ebooks.info Notation and Basic Math Where to Go from Here Further Reading C 472 479 481 WORKING WITH DATA 485 Sources for Data Cleaning and Conditioning Sampling Data File Formats The Care and Feeding of Your Data Zoo Skills Terminology Further Reading 485... not—make up a large part of actual data analysis and also introduces some data- related terminology What's with the Workshops? Every full chapter (after this one) includes a section titled “Workshop” that contains some programming examples related to the chapter’s material I use these Workshops for two purposes On the one hand, I’d like to introduce a number of open source tools and libraries that may be... VARIABLES: GRAPHICAL MULTIVARIATE ANALYSIS 99 False-Color Plots A Lot at a Glance: Multiplots Composition Problems Novel Plot Types Interactive Explorations Workshop: Tools for Multivariate Graphics Further Reading INTERMEZZO: A DATA ANALYSIS SESSION 127 A Data Analysis Session Workshop: gnuplot Further Reading 6 100 105 110 116 120 123 125 127 136 138 PART II Analytics: Modeling Data viii 142 151 155 158... carefully selected sample may lead to better results than a large, messy data set Big Data makes it easy to forget the basics It is a little early to say anything definitive about Big Data, but the current trend strikes me as being something quite different: it is not just classical data analysis on a larger scale The approach of classical data analysis and statistics is inductive Given a part, make statements... contrast, Big Data (at least as it is currently being used) seems primarily concerned with individual data points Given that this specific user liked this specific movie, what other specific movie might he like? This is a very different question than asking which movies are most liked by what people in general! Big Data will not replace general, inductive data analysis It is not yet clear just where Big Data will... Big data Arguably the most painful omission concerns everything having to do with Big Data Big Data is a pretty new concept—I tend to think of it as relating to data sets that not merely don’t fit into main memory, but that no longer fit comfortably on a single disk, requiring compute clusters and the respective software and algorithms (in practice, map/reduce running on Hadoop) The rise of Big Data. .. (early 2009), Big Data was certainly on the horizon but was not necessarily considered mainstream yet As this book goes to print (late 2010), it seems that for many people in the tech field, data has become nearly synonymous with “Big Data. ” That kind of development usually indicates a fad The reality is that, in practice, many data sets are “small,” and in particular many relevant data sets are small... into your products documentation does require permission We appreciate, but do not require, attribution An attribution usually includes the title, author, publisher, and ISBN For example: Data Analysis with Open Source Tools, by Philipp K Janert Copyright 2011 Philipp K Janert, 978-0-596-80235-6.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to... probably means formal training) in the field that you are working in A book such as this one on general data analysis cannot replace this Formal statistical analysis A different form of data analysis exists in some particularly well-established fields In these situations, the environment from which the data arises is fully understood (or at least believed to be understood), and the methods and models to . 22:1 Data Analysis with Open Source Tools www.it-ebooks.info O’Reilly-5980006 janert5980006˙fm October 28, 2010 22:1 www.it-ebooks.info O’Reilly-5980006 janert5980006˙fm October 28, 2010 22:1 Data. 22:1 Data Analysis with Open Source Tools Philipp K. Janert Beijing • Cambridge • Farnham • Köln • Sebastopol • Tokyo www.it-ebooks.info O’Reilly-5980006 janert5980006˙fm October 28, 2010 22:1 Data Analysis. First Edition. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Data Analysis with Open Source Tools, the image of a common kite, and related trade dress are trademarks of O’Reilly

Ngày đăng: 31/03/2014, 12:20

Từ khóa liên quan

Mục lục

  • Contents

  • Preface

    • Before We Begin

    • Conventions Used in This Book

    • Using Code Examples

    • Safari® Books Online

    • How to Contact Us

    • Acknowledgments

    • Chapter 1: Introduction

      • Data Analysis

      • What's in This Book

      • What's with the Workshops?

      • What's with the Math?

      • What You'll Need

      • What's Missing

      • Part I. Graphics: Looking at Data

        • Chapter 2. A Single Variable: Shape and Distribution

          • Dot and Jitter Plots

          • Histograms and Kernel Density Estimates

          • The Cumulative Distribution Function

          • Rank-Order Plots and Lift Charts

          • Only When Appropriate: Summary Statistics and Box Plots

          • Workshop: NumPy

          • Further Reading

Tài liệu cùng người dùng

Tài liệu liên quan