Methodology of relational datamining for stock market prediction

92 31 0
Methodology of relational datamining for stock market prediction

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

VIETNAM NATIONAL UNIVERSITY, HANOI COLLEGE OF TECHNOLOGY *** CHU THAI HOA METHODOLOGY OF RELATIONAL DATAMINING FOR STOCK MARKET PREDICTION Major: Code: Information Technology 1.01.10 MASTER'S THESIS Instructor: Prof Dr HO TU BAG DAI HOC QUOC GiA HA NOl TRUNG FAM IHONG TIN THIJ VIEN 000 ^J 000095^ Hanoi, June 2007 ABSTRACT This thesis presents the methodology of relational data mining for stock market prediction by making clear each problem related to the keywords: methodology, relational, data mining, stock market, and prediction, then coming to the methodology of relational data mining with the emphasis on Machine Methods for Discovering Regularities (MMDR) for stock market prediction Stock market prediction has been widely studied in terms of time-series prediction problem Deriving relationships that allow one to predict future values of time series is challenging One approach to prediction is to spot pattems in the past, when we already know what followed them, and to test on more recent data If a pattem is followed by the same outcome frequently enough, we can gain confidence that it is a genuine relationship The purpose of relational data mining (RDM) is to overcome the limitations of attributed-based learning methods (commonly used in finance) in representing background knowledge and complex relations RDM approaches look for pattems that involve multiple tables (relations) from a relational database This approach will play a key role in future advances in data mining methodology and practice MMDR method is one of the few Hybrid Probabilistic Relational Data Mining methods developed and applied to stock market data The method has an advantage in handling numerical data It expresses pattems in First-order Logic (FOL) and assigns probabilities to rules generated by composing pattems This will be made clear through an application of MMDR with computational experiment on price index data of Standard and Poor's 500 The thesis consists of chapters concentrating on relational data mining methodology for stock market prediction Methodology of Relational Data mining for Stock Market Prediction ACKNOWLEDGEMENTS This thesis would not have been completed if there was no help and support of many people I would like to take this opportunity to express my gratitude to the many people who helped me during the time of development leading to the thesis In particular, I would like to thank my instructor Prof Dr HO Tu Bao, for his courage of accepting me as a Master's student, for his enthusiasm, his knowledge and his encouragement in the work throughout I would never been able to finish this Thesis without his encouragement as well as his strict requirement for quality of the research I also enjoyed and appreciated the fruitful exchange of ideas with Dr NGUYEN Trong Dung, to whom I am also grateful for comments on the thesis In the early days of my research Dr HA Quang Thuy, Dr PHAM Tran Nhu and Dr DO Van Thanh stimulated my interest in data mining in financial forecast I am thankful for that and for the many discussions I had with them I am indebted to CFO LE The Anh, CFO NGUYEN Minh Quang for their patience with my questions on financial and stock market forecast I am also grateful to Dr PHAM Ngoc Khoi, Dr NGUYEN Phu Chien, MSc DAO Van Thanh, Mrs LE Thi Hoang My for words of encouragement during months of the thesis efforts and for their style-improving suggestions My thanks also go to everyone who has provided support or advice to me on data mining, stock market, forecast and so on in one way or another My family has been creating good conditions for me to complete the thesis I dedicate the thesis to my father, my mother and my young brother whose love and support are always for me Hanoi, June 2007, CHU Thai Hoa Methodology of Relational Data mining for Stock Market Prediction TABLE OF CONTENTS ABSTRACT i ACKNOWLEDGEMENTS ii TABLE OF CONTENTS iii LIST OF TABLES AND FIGURES v LIST OF ABBREVIATIONS vi INTRODUCTION Problem definition Motivations of the Thesis Objectives of the Thesis Method of the Thesis study Stmcture of the Thesis CHAPTER I: OVERVIEW OF STOCK MARKET PREDICTION IN DM LI Introduction to stock market prediction 1.1.1 Basic concepts of forecast 1.1.2 Prediction tasks in stock market 1.1.3 Stock market time series properties 1.1.4 Stock market prediction with the efficient market theory 1.1.5 Questions in stock market prediction 10 1.1.6 Challenges and Possibilifies on Developing a Stock Market Prediction System 11 1.2 Data mining methodology for stock market prediction 13 1.2.1 Prediction in data mining 13 1.2.2 Parameters 14 1.2.3 Approaches to stock market prediction 15 1.2.4 Data mining methods in stock market 17 CHAPTER II: RELATIONAL DATA MINING FOR STOCK MARKET PREDICTION "22 ILL Introduction 22 II.2 Basic problems 22 11.2.1 First-order logic and rules 22 11.2.2 Representative measurement theory 25 11.2.3 Breadth-first search 29 11.2.4 Occam's razor principle 30 IL3 Theory of RDM 31 11.3.1 Data types in RDM 31 11.3.2 Relational representation of examples 33 11.3.3 Background knowledge and problems of search for regularities 34 IL4 An algorithm for RDM: MMDR 39 II.4.1 Motivations of choice for MMDR 39 Methodology of Relational Data mining for Stock Market Prediction III 11.4.2 Some concepts 40 11.4.3 Algorithm MMDR L'"!" ".^.".^43 CHAPTER III: AN APPLICATION OF MMDR TO STOCK PRICE PREDICTION 47 IILL MMDR model for prediction 47 III.2 Experiment preparation 48 111.2.1 Data description and representation 48 111.2.2 Demo program 50 IIL3 Application of MMDR model 52 111.3.1 Step 1: Generating logical rules 52 111.3.2 Step 2: Learning logical rules 54 IIL3.3 Step 3: Creating intervals 56 IIL4 Results and evaluations 58 111.4.1 Stability of discovered rules on test data 58 111.4.2 Evaluations of forecast performance 61 CONCLUSIONS 70 Contributions of the thesis 70 Limitations of the thesis 71 Future work 72 Summary 73 APPENDICIES .vii Source code vii REFERENCES xii In English xii In Vietnamese xvii Website xvii Methodology of Relational Data mining for Stock Market Prediction IV LIST OF TABLES AND FIGURES Comparison of AVL-based methods and first-order logic methods 20 UpDown predicate 23 Predicates Up and Down 23 Examples of terms 24 Attribute-based data example 34 Partial background knowledge for stock market 37 Figure III.l Flow diagram for MMDR model: steps and techniques 48 Training set and Test set 49 Examples of rule consistent with hypotheses H1-H4 54 Table A.1: Stability checking table 59 Table A.2: Performance matrics for a set of 125 regularities 62 Figure A.l: Performance of 125 found regularities on test data 62 Table A.3: Performance matrics for a set of 292 regularities 63 Figure A.2: Performance of 125 found regularities on test data 63 Table A.5: Performance for regularity with conditional probability of 0.49 66 Figure A.3: Performance of an individual regualrity with conditional probability of 0.49 on test data 66 Table A.6: Performance for regularity with conditional probability of 0.84 67 Figure A.4: Performance of an individual regualrity with conditional probability of 0.84 on test data 67 Table A.7: Forecast result for the day December 1'^ 2006 (the regularity with conditionalprobability of 0.84) 68 Table A.8: Forecast result for the day December 1^ 2006 (the set of 292 regularities with conditional probability not less than 0.65) 69 Methodology of Relational Data mining for Stock Market Prediction LIST OF ABBREVIATIONS AI : Artificial Intelligence AVL(s) : Attribute-value language(s) DM : Data mining FOL : First-order Logic ILP : Inductive Logic Programming ML : Machine Leaming MMDR : Machine Methods for Discovering Regularities MRDM : Multi-Relational Data mining RDM : Relational Data mining RMT : Representative measurement theory Methodology of Relational Data mining for Stock Market Prediction VI INTRODUCTION Problem definition There are four major technological reasons stimulating data mining development, applications and public interest: the emergence of very large databases; advances in computer technology; fast access to vast amounts of data; and the ability to apply computationally intensive statistical methodology to these data Data mining is the process of discovering hidden patterns in data Due to the large size of databases, importance of information stored, and valuable information obtained, finding hidden pattems in data has become increasingly significant The stock market provides an area in which large volumes of data are created and stored on a daily basis Financial forecasfing has been widely studied at a case of time-series prediction problem Times series such as the stock market are often seen as non-stationary which present challenges in predicting fiiture values The efficient market theory states that it is pracfically impossible to predict financial markets long-term However, there is good evidence that short-term trends exist and programs can be written to find them The data miners' challenge is to find the trends quickly while they are valid, as well as to recognize the time when the trends are no longer effective Data mining methods provides thefi-ameworkfor stock market predictions to discover hidden trends and pattems Well-known and commonly used data mining methods in stock market are attributed-based leaming methods but they have some serious drawbacks: limited ability to represent background knowledge and lack of complex relations The purpose of RDM is to overcome these limitations RDM is a learning method that is better suited for stock market mining with a better ability to explain discovered rules than other symbolic approaches However, current relational methods are relatively inefficient and have rather limited facilities for handling numerical data RDM as a hybrid leaming method combines the strength of FOL and probabilistic inference to meet these challenges One of the few Hybrid Probabilistic Relational Data Mining methods, MMDR that handles numerical data efficiently, is developed and applied to stock market data It is believed that now is the time for RDM methods, in particular, MMDR to stock market prediction has advantages in discovering regularities in stock market time series Methodology of Relational Data mining for Stock Market Prediction Motivations of the Thesis In the past few years, Vietnam's stock market was still in early stage of development and thus did not catch attention from investors and researchers Especially, to interested learners, mastering professional methods of stock market analysis and forecast require to have fime and wide background knowledge to study all fields covered Moreover, according to the efficient market theory, it is practically impossible to infer a fixed long-term global forecasting model from historical stock market information Therefore, there have been few Vietnamese interested in and performing research on stock market prediction Two recent years have witnessed the surprising development of the Vietnamese stock market with a host of notable events Especially, after Vietnam became a World Trade Organization (WTO) member, the Vietnamese economy has so many opportunities to develop, leading to the development of many companies and markets including the financial and stock markets It is said that Vietnam's stock market will grow rapidly in the next years, and it will ranlc second in the region, just after China, in terms of growth rate Under the rapid development of Vietnam's financial market, professional activities such as analysis and prediction of financial market should be paid more attention In particular, these activities play a significant role in the task of macro economic forecast at the National Center for Socio-economic Information and Forecast (under the Ministry of Planning and Investment), which helps make sound policies related to socio-economic management and regulation at macro level Data mining provides some methods and techniques that are able to help approach stock market prediction quite effectively In fact, there have been already some studies and successful applications of data mining techniques to stock market forecast However, the capture of loiowledge and application techniques of each approach is quite challenging and consumes time I read some papers and especially paid attention to a research on relational data mining in finance by two researchers, Prof Dr Boris Kovalerchuk and Dr Evgenii Vityaev They reported that, "Mining stock market data presents special challenges For one, the rewards for finding successftil pattems are potentially enormous, but so are the difficulties and sources of conftisions The efficient market theory states that it is practically impossible to predict financial markets long-term However, there is good evidence that short-term trends exist and programs can be written to find them The data miners' challenge is to find the trends quickly while they are valid, to Methodology of Relational Data mining for Stock Market Prediction deal effectively with time series and calendar effects, as well as to recognize the time when the trends are no longer effective" The leaming method RDM is able to leam more expressive rules, make better use of underlying domain knowledge and explain discovered rules than other symbolic approaches It is thus better suited for stock market mining This approach will play a key role in fiiture advances in data mining methodology and practice The earlier algorithms for RDM suffer fi-om a relative computational inefficiency and have rather limited tools for processing numerical data This problem is especially necessary to be considered in stock market analysis where data commonly are numerical time series Therefore, RDM as a hybrid leaming method that combining the strength of FOL and probabilistic inference is developed to meet these challenges One of the few Hybrid Probabilisfic Relational Data Mining methods, MMDR, that handles numerical data efficiently, is developed and applied to stock market forecasting The common question "Can stock market prediction be profitable?" is often made to any research on methods of stock market prediction In fact, there are few people doing research on RDM for stock market forecast, because it requires interested learners to have wide background knowledge to understand all fields covered Much less has been reported publicly on success of data mining in real trading by financial institutions If real success is reported then competitors can apply the same methods and the leverage will disappear, because in essence all ftindamental data mining methods are not proprietary I used to concentrate my study in attempt to end up with a Master's Degree and as a millionaire (kidding), but this is too high risk to take Basing my intention on practical suggestions and requirements, as well as my personal interest, I came to a decision of doing research on stock market forecast Through some school lessons and extra self-learning efforts, I access some data mining techniques to seek a solution to the task Those above motivate the aim of the thesis - to carry out research and experiment on methodology of RDM for stock market prediction Methodology of Relational Data mining for Stock Market Prediction ^s far as I know, the demo version of the discovery software for the financial pphcations has just been introduced by these authors in March 2007 - Evaluation of the methodology and experiment results The conducted experiment following the thesis helps bolster my confidence in ffirmmg the evaluations about the advantages of the methodology presented RDM, specially MMDR, allows one to get human-readable forecasting rules, then a stock larket specialist can evaluate the performance of the forecast as well as forecasting lies MMDR seems to suffer littlefi-omnoise If MMDR captured a "critical mass" f noise, this noise would be a part of statistically significant rule then it selects only atistically significant rules Applications to stock market provide a unique ivironment where efficiency of the methods can be tested instantly This test rocess can be repeated daily for several months collecting quality estimates Computational experiments presented in the thesis have shown the advantages ractically for real stock market data RDM methods and MMDR method, in articular, have unrestricted capabilities for combined use of indicators, which are eeded for real trading systems Moreover, relational methods provide nearly nlimited capabilities to formulate and test hypotheses, because of the power of the OL languages Relational regularities refiise to make any stock market decision for bjects where there is insufficient information for an accurate forecast This pproach seems more rational than other approaches, which deliver forecasts always sing own "universal" formula (rule) for all objects It can be said that MMDR is well suited to stock market applications due to its bility to handle numerical data with high levels of noise imitations of the thesis - Omission of studies and comments on other stock prediction methods I did the research with a view to making it practical, at least experimental, but le research demanded much work, among others, for: Studying areas: data mining, mathematics, and financial and stock market forecast; It is said that the study may take years Designing systems that exploits and implements the ideas of the study • Collecting and preprocessing data for experiments • Programming - hundreds of hours taken Evaluating the programs, adjusting parameters, and evaluating again - the loop possibly taking hundreds of hours 71 Methodology of Relational Data mining for Stock Market Prediction • Writing synthetic report with extended background and papers for the successful attempts - a time consuming task Each task requires much fime, not including the time consumed if the )proaches not work (in this case, the design, the implementation and the initial /aluafion efforts are not publishable) That's why the thesis lacks experiments to ipport some of the thesis ideas Therefore, no evaluation of RDM or MMDR from le is brought out Anyway, the thesis still refers to some researchers' examples, )mparison and evaluation to express my interest and support for RDM and MMDR 3plied to stock market prediction - Lack of trials and evaluations of other software programs So far, there have been quite a few sofhvare that have been introduced to be Die to support and execute pretty well the task of stock market prediction However, nplementation of experiments and usage of these software requires to investigate ow predictions together with related knowledge merge into trading model, then repare and preprocess data for the experiments, etc I have already downloaded and ;amt to use the software Discovery introduced in some papers by Dr B :ovalerchuk and Dr E Vityaev This software seems useful to help understand the iscovery of regularifies Unfortunately, until embarking in wrhing this thesis, I sfill id not achieve enough considerable resuhs to introduce in the thesis and to make valuation of the software That's why the thesis lacks comments on some related oftware •uture work RDM methods have unrestricted capabilities for combined use of indicators, ^hich are needed for real trading systems Moreover, relational methods provide iearly unlimited capabilifies to formulate and test hypotheses, because of the power ,f the FOL languages The class of hypotheses H1-H4 already has shown advantages but this class of hypotheses represents only the very first step in redicate and hypothesis invenfion in finance An intensive growth of a new area ot esearch and applications of relational methods is expected in coming years - Solutions to those above limitations After the thesis, I will gradually tackle the above shortcomings to get more ,racfica and efficient results There now expects an extensive growth of hybr^d : r : that combine different models and provide a ^ - r ^ — - ; ^ ^ ^ tchieved by individuals I desire to investigate some common methods used in stock "pred^^^^ such as neural networks, decision tree, etc., and their hybr^ •^comparisons among the methods I also intend to leam to use some other 72 Methodology of Relational Data mining for Stock Market Prediction oftware, especially the software Discovery introduced in the book "Data Mining in inance: Advances in Relafional and Hybrid Methods" by Kovalerchuk & Vityaev such, I can point out some evaluafions for those, which is remarkably significant )r the study of the thesis - More tests with other prediction tasks in stock market There are still so many tasks in stock market predicfion that I have not enough me to consider and experiment In fact, the thesis focuses on a contemporary and jalizable approach to stock market prediction, which is an important basis for my pplication development Moreover, professional analysis and forecast of stock larket play a more and more remarkable role in the task of macro economic Drecast at my Center I am not reluctant to miss the opportunity to contribute myself ) the trend of development Therefore, I am preparing for more tests with other rediction tasks in stock market - Experiment on Vietnam stock market data Obviously, I intend to carry out some experiments on data from Vietnam's took market As expressed in the motivations of the thesis, I choose to this ^search topic due to practical suggestions and requirements, as well as personal iterest under the rapid development of Vietnam stock market I am clearly aware of ossible challenges I could face, such as problems of data (insufficiency, •audulence, unstableness and noise of listed data in the stock market) and problems f information infi-astructure It is necessary and possible to seek solutions to Vietnam's stock market and to develop software supporting decision-making I herish the hope for applying RDM and MMDR to Vietnam reality in near ftiture lummary To be successftil, a data mining project should be driven by the application eeds and results should be tested quickly It is shown that applications to stock larket provide a unique environment where efficiency of the methods can be tested istantly, not only by using traditional training and tesfing data but making real ;ock forecast and testing it the same day Stock market data are often represented as time series of a variety of attributes such as stock prices and indexes The thesis presents the methodology of RDM for stock market prediction with le highlight on MMDR method To well prepare for navigating and coming close ) stock prediction using RDM, an overview of stock market prediction in data lining is provided in chapter through two parts: Introduction to stock market rediction; and Data mining methodology for stock market prediction The later elps explain the advantages and the reasons for the choice of RDM, which is better Methodology of Relational Data mining for Stock Market Prediction 73 lited for stock market mining It overcomes the limitations of attributed-based aming methods (commonly used in finance) in representing background knowledge id complex relafions Some basic problems, theory of RDM and an algorithm for DM are talked about in chapter Relational methods such as MMDR are luipped with probabilistic mechanism that is necessary for time series with high vel of noise MMDR is one of the few Hybrid Probabilisfic Relafional Data lining methods developed and applied to stock market data An MMDR Dplication to stock market price prediction is presented in chapter Through three eps: rule generating, rule learning and interval creafing, the method is made clear, ome statistic results and evaluations for the experiment conducted to demonstrate le application are also brought out During many years, FOL methods were applied for other areas outside of stock larket RDM based on FOL and probabilistic estimates and MMDR method, in articular, has several important advantages known from theoretical viewpoint, lomputational experiments with simulated trading of SP500C presented in this lesis show that RDM methods are able to discover regularities in stock market time eries In the time frames of the current study I also obtained positive results using 1P500C for target forecast In conclusion, after the thesis, I get quite much knowledge of both data mining nd stock market I understand and pay more attention to daily events relating to inancial and stock market It is the knowledge and concem that help me direct and :ontribute to the task of stock market prediction I expect that in coming years, UDM for finance and stock market will be shaped as a distinct field that blends :nowledge fi-om finance and data mining I also expect RDM applications to stock narket prediction will bring benefits to my work and to Vietoam stock market 74 Methodology of Relational Data mining for Stock Market Prediction APPENDICIES Source code a Function Specializerule This function is used to genpmtf^ •> c^^ -^^ ;$rowR) ""^ """•^ specialized rules from a rule $ek=$rowR['ek']; $predobj=$rowR['predobj']; $ek_write=$ek."1"; $numeR=strlen($ek); for ($i=0;$i= $check){$bool=0;} if($bool==1){ $predobLwrite=$predobj.$i.$j.$k."0"; mysql_query("insert into $tablenamejnitialruleset set eO='$eO'.ek='$ek_write',predobj='$predobj_write"'); $predobj_write=$predobj.$i.$j.$k."1"; mysqLqueryC'insert into $tablenamejnitialruleset set eO='$eO'.ek='$ek_write'.predobj='$predobLwrite'"); } } } } } Function Is_Regularity This function is used to compute the conditional probabiHty of a rule ($rowR) notion ls_Regularity($rowR,$predicate,$Target,$Day){ $eO=$rowR['eO']; $ek=$rowR['ek']; $predobj=$rowR['predobj*]; $numpred=strlen($ek); $objdaynum=count($Day); $total=($objdaynum)*($objdaynum-1)/2; $sql=mysqLquery("select * from subruie where numitems='$numpred"'); ^thodology of Relational Data mining for Stock Market Prediction vii while ($row=mysql_fetch_array($sql)){ $eksub = str_split($row['subrule'lV $leftnum=0; ''' $regnum=0; //regularity for ($m=1 ;$m

Ngày đăng: 13/03/2020, 23:20

Từ khóa liên quan

Tài liệu cùng người dùng

  • Đang cập nhật ...

Tài liệu liên quan