Classification methods for internet applications, 1st ed , martin holeňa, petr pulc, martin kopp, 2020 1752

290 0 0
  • Loading ...
1/290 trang
Tải xuống

Thông tin tài liệu

Ngày đăng: 08/05/2020, 06:54

Studies in Big Data 69 Martin Holeňa Petr Pulc Martin Kopp Classification Methods for Internet Applications Studies in Big Data Volume 69 Series Editor Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland The series “Studies in Big Data” (SBD) publishes new developments and advances in the various areas of Big Data- quickly and with a high quality The intent is to cover the theory, research, development, and applications of Big Data, as embedded in the fields of engineering, computer science, physics, economics and life sciences The books of the series refer to the analysis and understanding of large, complex, and/or distributed data sets generated from recent digital sources coming from sensors or other physical instruments as well as simulations, crowd sourcing, social networks or other internet transactions, such as emails or video click streams and other The series contains monographs, lecture notes and edited volumes in Big Data spanning the areas of computational intelligence including neural networks, evolutionary computation, soft computing, fuzzy systems, as well as artificial intelligence, data mining, modern statistics and Operations research, as well as self-organizing systems Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution, which enable both wide and rapid dissemination of research output ** Indexing: The books of this series are submitted to ISI Web of Science, DBLP, Ulrichs, MathSciNet, Current Mathematical Publications, Mathematical Reviews, Zentralblatt Math: MetaPress and Springerlink More information about this series at http://www.springer.com/series/11970 Martin Holeňa Petr Pulc Martin Kopp • • Classification Methods for Internet Applications 123 Martin Holeňa Institute of Computer Science Czech Academy of Sciences Prague, Czech Republic Petr Pulc Czech Technical University Prague, Czech Republic Martin Kopp Czech Technical University Prague, Czech Republic ISSN 2197-6503 ISSN 2197-6511 (electronic) Studies in Big Data ISBN 978-3-030-36961-3 ISBN 978-3-030-36962-0 (eBook) https://doi.org/10.1007/978-3-030-36962-0 © Springer Nature Switzerland AG 2020 This work is subject to copyright All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed The use of general descriptive names, registered names, trademarks, service marks, etc in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Preface This book originated from an elective course called Internet and Classification Methods for master and doctoral students, which has been taught since the academic year 2013/14 at the Charles University and the Czech Technical University in Prague The course is intended for students of the study branches Computer Science (Charles University) and Information Technology (Czech Technical University) and its main purpose is to make the students aware of the fact that a key functionality of several very important Internet applications is actually the functionality of a classifier That functionality is explained in sufficient detail to remove any magic from it and to allow competent assessment and competent tuning of such applications with respect to that functionality We expect, and the first years of teaching the course confirm it, that this topic is particularly interesting for those who would like to develop or to improve such applications The Internet applications we consider are: Spam filtering, Recommender systems, Example-based web search, Sentiment analysis, Malware detection, Network intrusion detection We consider them very broadly, including topics that are even loosely related to them, as long as the classification functionality is relevant enough For instance, we consider also example-based search within pictures and other non-textual modalities of data because it is used in many recommender systems The above six kinds of applications are introduced in Chap of the book However, they are not described thoroughly, as the students attending the course have other, specialized courses to this end Similarly, the readers of the book are expected to be familiar with such applications already from elsewhere We provide them with references to relevant specialized monographs or textbooks On the other hand, we deeply discuss the classification methods involved, which are the focus of the remaining five chapters of the book, though it is also there illustrated on v vi Preface examples concerning the considered kinds of Internet applications From the point of view of computer scientists, the classification is treated rather on a graduate than on an undergraduate level And although the book does not use the mathematical style of definitions, theorems and proofs, all discussed concepts are introduced with full formal rigour, allowing interested readers to understand their explanations also in purely statistical or mathematical books or papers In Chap 2, concepts pertaining to classification, in general, are discussed In particular, classifier performance measures, linear separability, classifier learning (both supervised and semi-supervised) and feature selection The chapter also addresses the difference between classification and two other statistical approaches encountered in the considered Internet applications, namely, clustering and regression The introduced concepts are illustrated on examples from spam filtering, recommender systems and malware detection Chapter gives a survey of traditional classification methods that have not been developed with the specific objectives of high predictive accuracy, nor comprehensibility The chapter covers, in particular, k nearest neighbours classification, Bayesian classifiers, the logit method, linear and quadratic discriminant analysis, and two kinds of classifiers belonging to artificial neural networks The methods are illustrated on examples of all considered kinds of Internet applications In Chap 4, support vector machines (SVM) are introduced, a kind of classifiers developed specifically to achieve high predictive accuracy First, the basic variant for binary classification into linearly separable classes is presented, which is then followed by extensions to non-linear classification, multiple classes and noise-tolerant classification SVM are illustrated on examples from spam filtering, recommender systems and malware detection In connection with SVM, the method of active learning is explained and illustrated on an example of SVM active learning in recommender systems The topic of Chap is classifier comprehensibility Comprehensibility is related to the possibility to explain classification result with logical rules Basic properties of such rules in Boolean logic and the main fuzzy logics are recalled This chapter also addresses the possibility to obtain sets of classification rules by means of genetic algorithms, the generalization of classification rules to observational rules and finally the most common kind of classifiers producing classification rules, namely, classification trees Both classification rules, in general, and obtaining rules from classification trees are illustrated on examples from spam filtering, recommender systems and malware detection Chapter of the book deals with connecting classifiers into a team It explains the concepts of aggregation function and confidence, the difference between general teams and ensembles and main methods of team construction Finally, random forests are introduced, which are the most frequently encountered kind of classifier teams The concepts and methods addressed in this chapter are illustrated on examples from spam filtering, recommender systems, search in multimedia data and malware detection Preface vii In spite of its focus on the six kinds of Internet applications in which classification represents a key functionality, the book attempts to present the plethora of available classification methods and their variants in general: not only those that have already been used in the considered kinds of applications, but also those that have the potential to be used in them in the future We hope that in this way, the influence of the fast development in the area of Internet applications, which can sometimes cause a state-of-the-art approach to be surpassed by completely different approaches within a short time, to be at least to some extent eliminated Prague, Czech Republic Martin Holeňa Petr Pulc Martin Kopp Acknowledgement Writing this book was supported by the Czech Science Foundation grant no 18-18080S The authors are very grateful to Jiří Tumpach for his substantial help with proofreading Contents Important Internet Applications of Classification 1.1 Spam Filtering 1.1.1 The Road of Spam 1.1.2 Collaborative Approach 1.1.3 Spam Filters 1.1.4 Image Spam 1.1.5 Related Email Threats 1.1.6 Spam Filtering as a Classification Task 1.2 Recommender Systems 1.2.1 Purpose of Recommender Systems 1.2.2 Construction of a Recommender System 1.2.3 Content Based and Collaborative Recommenders 1.2.4 Conversational Recommender Systems 1.2.5 Social Networks and Recommendations 1.2.6 Recommender Systems and Mobile Devices 1.2.7 Security and Legal Requirements 1.2.8 Recommendation as a Classification Task 1.3 Sentiment Analysis 1.3.1 Opinion Mining—Subjectivity, Affect and Sentiment Analysis 1.3.2 Sentiment Analysis Systems 1.3.3 Open Challenges 1.3.4 Affect Analysis in Multimedia 1.3.5 Sentiment Analysis as a Classification Task 1.4 Example-Based Web Search 1.4.1 Example-Based Search and Object Annotations 1.4.2 Example-Based Search in Text Documents 1.4.3 Example-Based Search in General Objects 1.4.4 Scribble and Sketch Input 1 3 6 10 10 15 16 17 17 17 18 18 20 24 25 25 26 27 29 30 31 ix x Contents 1.4.5 Multimedia and Descriptors 1.4.6 Example Based Search as a Classification Task 1.5 Malware Detection 1.5.1 Static Malware Analysis 1.5.2 Dynamic Malware Analysis 1.5.3 Hybrid Malware Analysis 1.5.4 Taxonomy of Malware 1.5.5 Malware Detection as a Classification Task 1.6 Network Intrusion Detection 1.6.1 A Brief History of IDS 1.6.2 Common Kinds of Attacks 1.6.3 Host-Based Intrusion Detection 1.6.4 Network-Based Intrusion Detection 1.6.5 Other Types 1.6.6 IDS as a Classification Task References 32 36 37 38 40 42 43 46 47 47 49 50 52 59 60 60 Basic Concepts Concerning Classification 2.1 Classifiers and Classes 2.1.1 How Many Classes for Spam Filtering? 2.2 Measures of Classifier Performance 2.2.1 Performance Measures in Binary Classification 2.3 Linear Separability 2.3.1 Kernel-Based Mapping of Data into a High-Dimensional Space 2.4 Classifier Learning 2.4.1 Classifier Overtraining 2.4.2 Semi-supervised Learning 2.4.3 Spam Filter Learning 2.5 Feature Selection for Classifiers 2.6 Classification is Neither Clustering Nor Regression 2.6.1 Clustering 2.6.2 Regression 2.6.3 Clustering and Regression in Recommender Systems 2.6.4 Clustering in Malware Detection References 69 69 71 72 75 77 79 83 84 85 89 92 95 95 96 98 100 102 Some Frequently Used Classification Methods 3.1 Typology of Classification Methods 3.2 Classification Based on k-Nearest Neighbours 3.2.1 Distances Between Feature Vectors 3.2.2 Using k-nn Based Classifiers in Recommender Systems 3.2.3 Using k-nn Based Classifiers in Malware and Network Intrusion Detection 105 105 107 108 112 117 266 A Team Is Superior to an Individual x2
- Xem thêm -

Xem thêm: Classification methods for internet applications, 1st ed , martin holeňa, petr pulc, martin kopp, 2020 1752 , Classification methods for internet applications, 1st ed , martin holeňa, petr pulc, martin kopp, 2020 1752

Mục lục

Xem thêm

Gợi ý tài liệu liên quan cho bạn