Analysis and semi automated detection of design level similarity patterns in software

ANALYSIS AND SEMI-AUTOMATED DETECTION OF DESIGN-LEVEL SIMILAIRTIES IN SOFTWARE HAMID ABDUL BASIT (B.S. Engg., GIK Institute of Engineering Sciences & Technology, Pakistan) A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DEPARTMENT OF COMPUTER SCIENCE NATIONAL UNIVERSITY OF SINGAPORE 2006 Acknowledgements First of all, I am thankful to Allah, the most Magnificent, for His countless blessings that He continues to shower upon me. I am deeply indebted to my PhD supervisor, Dr. Stanislaw Jarzabek, for all the help that he rendered during the course of this thesis; the guidance, the insight, the continuous encouragement, and the trust. This work would not have been possible without expert guidance from Prof. Bill Smyth and timely help from Simon Puglisi, for which I am truly grateful. Many thanks are due to the thesis advisory committee members; Dr. Jin Song Dong and Dr. Irene Woon for their useful feedback during the course of this project. I also owe special thanks to my colleague Damith Chatura Rajapakse for his wonderful company, help and feed back on my work. I am also very grateful to the HYP and UROP students whom I supervised; Melvin Low Jen Ku, Goh Kwan Kee, Chan Jun Liang, and Zhang Yali, for their hard work and invaluable contribution in this project. Finally, I am thankful to all my family members and especially my wife, Sidra, for being with me through thick and thin. i Table of Contents ACKNOWLEDGEMENTS I TABLE OF CONTENTS II SUMMARY . VII LIST OF TABLES .IX LIST OF FIGURES X CHAPTER 1. INTRODUCTION . 1.1 OPEN CHALLENGES . 1.2 THE GOALS, SCOPE AND CONTRIBUTIONS OF THIS THESIS 1.3 OUTLINE OF THE THESIS CHAPTER 2. CLONING – OVERVIEW AND RELATED WORK 2.1 TYPES OF SIMPLE CLONES . 10 2.2 REASONS FOR CLONES . 11 2.3 NEGATIVE IMPACT OF CLONES 14 2.4 CLONE DETECTION 15 2.4.1 Program Representation . 15 2.4.2 Generality . 17 2.4.3 Granularity of Detected Clones 17 2.5 2.5.1 CLONE MANAGEMENT . 18 Preventive Clone Management . 18 ii Table of Contents 2.5.2 Corrective Clone Management . 18 2.5.3 Compensatory Clone Management . 21 2.6 SIMPLE CLONE TAXONOMIES 23 2.7 HIGHER LEVEL CLONES AND DESIGN RECOVERY 24 2.8 CONCLUSIONS . 27 CHAPTER 3. STRUCTURAL CLONES – HIGHER LEVEL SIMILARITIES IN PROGRAMS 28 3.1 INTRODUCTION AND MOTIVATION . 29 3.2 FROM SIMPLE CLONES TO STRUCTURAL CLONES 30 3.2.1 Clones . 30 3.2.2 Program Structures 31 3.2.3 Structure Hierarchies . 32 3.2.4 Structural Clones 33 3.3 EXAMPLES OF STRUCTURAL CLONES 34 3.3.1 Acknowledgement . 35 3.3.2 A File-Level Structural Clone . 35 3.3.3 A Module-Level Structural Clone . 36 3.3.4 Multiple Structural Clones in the Same File 37 3.3.5 Crosscutting Structural Clones 38 3.3.6 Heterogeneous Entity Structural Clones 38 3.3.7 Structural Clones Based on Inheritance Hierarchy 39 3.3.8 Structural Clone Spanning Multiple Layers . 40 3.4 TOWARDS CLASSIFICATION OF STRUCTURAL CLONES 41 3.5 CONCLUSIONS . 43 CHAPTER 4. EFFICIENT TOKEN-BASED DETECTION OF SIMPLE CLONES . 44 4.1 ACKNOWLEDGEMENTS 44 4.2 INTRODUCTION 45 4.3 FLEXIBLE TOKENIZATION 46 4.3.1 Tokenization Example 49 iii Table of Contents 4.4 4.4.1 EFFICIENT CLONE DETECTION . 52 Basic Repeat Finding Algorithm 56 4.5 CURBING FALSE POSITIVES . 59 4.6 CONCLUSION . 60 CHAPTER 5. DETECTING STRUCTURAL CLONES WITH DATA MINING . 61 5.1 SCOPE OF THE TECHNIQUE . 62 5.2 RE-ORGANIZING THE DATA . 63 5.3 FINDING RECURRING PATTERNS OF SIMPLE CLONE CLASSES 64 5.4 CLUSTERING HIGHLY CLONED FILES 67 5.5 RAISING THE ABSTRACTION – ANALYZING DIRECTORIES 70 5.6 METHOD LEVEL ANALYSIS 72 5.7 CONCLUSION . 72 CHAPTER 6. TOOL IMPLEMENTATION . 73 6.1 TOOL IMPLEMENTATION 73 6.2 OUTPUT FORMAT . 75 6.3 PERFORMANCE OF SIMPLE CLONE DETECTION 78 6.4 PERFORMANCE OF STRUCTURAL CLONE DETECTION 81 6.5 CONCLUSION . 82 CHAPTER 7. STRUCTURAL CLONE ANALYSIS TECHNIQUES 83 7.1 NEED FOR CLONE ANALYSIS . 83 7.2 CLONE ANALYSIS TECHNIQUES . 86 7.2.1 Ease of Clone Analysis . 86 7.2.2 Overview of Cloning Intensity 88 7.2.3 Clones Manipulation Features . 89 7.2.4 Refocusing the Detection 90 7.2.5 Tool Implementation . 90 7.3 CONCLUSIONS . 91 CHAPTER 8. APPLICATIONS . 92 iv Table of Contents 8.1 PROGRAM UNDERSTANDING 93 8.2 IMPROVING MAINTAINABILITY OF CODE . 95 8.2.1 Refactoring . 95 8.2.2 Creating Generic Representation . 96 8.2.3 Change Impact Analysis . 99 8.3 REENGINEERING FOR REUSE 99 8.4 CONCLUSION . 100 CHAPTER 9. EXPERIMENTATION . 101 9.1 CORRECTNESS VALIDATION 101 9.2 USEFULNESS VALIDATION . 104 9.3 QUALITATIVE ANALYSIS . 110 9.3.1 Eclipse Graphical Editing Framework . 111 9.3.2 Eclipse Visual Editor 113 9.3.3 OpenJGraph 0.9.2 114 9.3.4 J2ME Wireless Toolkit 2.2 115 9.3.5 Java Pet Store 1.3.2 115 9.4 COVERAGE ANALYSIS . 117 9.5 CONCLUSIONS . 121 CHAPTER 10. CONCLUSIONS AND FUTURE WORK . 123 BIBLIOGRAPHY 126 APPENDIX A. SURVEY OF CLONE DETECTION TECHNIQUES . 148 A.1 DUPLOC . 148 A.2 FINGERPRINTING TECHNIQUE 149 A.3 WEB CLONE DETECTOR . 149 A.4 CCFINDER . 150 A.5 DUP . 151 A.6 DOTPLOT . 153 A.7 AST BASED TECHNIQUE 153 A.8 METRICS BASED TECHNIQUE BY MAYRAND ET AL 154 v Table of Contents A.9 METRICS BASED TECHNIQUE BY KONTOGIANNIS ET AL. . 155 A.10 DYNAMIC PROGRAMMING TECHNIQUE BY KONTOGIANNIS ET AL. 157 A.11 DYNAMIC PROGRAMMING TECHNIQUE BY BALAZINSKA ET AL.: . 157 A.12 PDG BASED TECHNIQUE BY KOMONDOOR ET AL 158 A.13 PDG BASED TECHNIQUE BY KRINKE 159 A.14 NEURAL NETWORK BASED TECHNIQUE 159 APPENDIX B. B.1 CASE STUDIES IN TYPE PARAMETERIZATION MECHANISMS . 161 STUDYING JAVA GENERICS WITH BUFFER LIBRARY 162 Acknowledgement 162 Study Overview 162 Buffer Library 162 Can We Have a Generic Buffer Library? 164 B.2 STUDY OF CLONES IN THE STL 171 Acknowledgements . 171 Introduction and Motivation 171 Structure of the STL . 173 Study Methodology 174 Analysis of Clones in the STL 175 Effects of Clones in the STL . 183 XVCL solution . 183 Discussion of Results . 185 B.3 CONCLUSIONS . 186 vi Summary Code clones are similar program structures of any type and granularity recurring in variant forms in a program. Cloning in software systems is known to create problems during software maintenance. Several techniques have been proposed to detect the same or similar code fragments in software, henceforth called simple clones, with some gains in helping to reduce update anomalies and the software size. Further gains, however, can be obtained by elevating the level of clone analysis. We observed that recurring patterns of simple clones may indicate the presence of interesting higher-level similar program structures that often map to design or application domain concepts. We call these high-level similarities structural clones. Detection of these structural clones leads to a better understanding of the design of the system, which helps in day-to-day software maintenance, long-term evolution and re-engineering. Unification of structural clones with generic program structures offers interesting opportunities for program simplification and reuse. In this thesis, we first present an efficient token-based technique for simple clone detection, based on the current advancements in the field of string pattern matching algorithms and data structures. Next, we define a class of useful structural clones and propose a technique to systematically detect them. We consider structural clones formed by groups of highly similar methods, classes or source files and their recurring patterns in various parts of the system. Here, the novelty of our approach is in formulating the concept of structural clone, in applying data mining techniques to detect them, and in applying visualization and analysis vii Summary techniques to further improve effectiveness of structural clone detection with involvement of human experts. We implemented the proposed method for structural clone detection into a tool called Clone Miner. Finally, we validated the usefulness of the proposed method via experimentation, showing that Clone Miner finds many useful structural clones and scales up to big programs. This thesis advances the state-of-the-art in clone detection and design recovery research as follows: First, our technique for simple clone detection is more efficient than other tools described in the literature, due to our choice of suffix arrays as data structure and novel maximal repeats finding algorithm. Clone Miner is also more flexible than other tools in customizing the clone detection process. Second, we introduce the concept of structural clone that extends research on cloning from similar code fragments to similar program structures of any kind and granularity, potentially more meaningful than just similar code fragments. Clone Miner provides practical means to detect structural clones in a semi-automated process that involves data mining techniques at the initial stage, followed up with user-assisted visualization/abstraction/filtering techniques. Third, with the concept of structural clone, we revisit research on reverse engineering and design recovery which have received much attention in last decades. Despite much work, not many practical and scalable techniques have been transferred from labs to the programming practice. It appears that structural clones often represent important concepts from application domain or design. Clone Miner offers a pragmatic and scalable method to recover these concepts, feeding developers with information that is vital in program understanding, evolution and re-engineering. Finally, structural clones offer opportunities for unconventional reuse that reaches beyond reuse rates achievable with architecture-centric, component-based approaches. Unification of structural clones with generic structures also brings reduction of cognitive program complexity. viii List of Tables TABLE 1: A SAMPLE REPRESENTATION OF TOKEN CLASSES WITH TOKEN SYMBOLS ------------------------ 48 TABLE 2: LANGUAGE TOKENS ------------------------------------------------------------------------------------ 74 TABLE 3: CASE STUDY SYSTEMS---------------------------------------------------------------------------------- 78 TABLE 4: PERFORMANCE OF SIMPLE CLONE DETECTION ------------------------------------------------------- 79 TABLE 5: PERFORMANCE OF STRUCTURAL CLONE DETECTION------------------------------------------------ 82 TABLE 6: XVCL COMMANDS ------------------------------------------------------------------------------------- 96 TABLE 7: CLONE CLUSTER ANALYSIS OF J2SE 1.5 ------------------------------------------------------------103 TABLE 8: CLONING ACROSS AND WITHIN MODULES OF CAP-WP -------------------------------------------106 TABLE 9: CLONE DETECTION RESULTS ON INDIVIDUAL MODULES OF CAP-WP [GOH06] ----------------109 TABLE 10: CASE STUDY SYSTEMS -------------------------------------------------------------------------------117 TABLE 11: SIMPLE CLONE CLASSES (SCC) ---------------------------------------------------------------------117 TABLE 12: SIMPLE CLONE STRUCTURES (SCS)----------------------------------------------------------------118 TABLE 13: FILE CLONE CLASSES (FCC) ------------------------------------------------------------------------118 TABLE 14: FILE CLONE STRUCTURES (FCS) --------------------------------------------------------------------119 TABLE 15: METHOD CLONE CLASSES (MCC) ------------------------------------------------------------------119 TABLE 16: METHOD CLONE STRUCTURES ----------------------------------------------------------------------120 TABLE 17: CLONING STATISTICS IN CASE STUDY SYSTEMS ---------------------------------------------------120 TABLE 18: SUMMARY OF CLONING IN THE STL----------------------------------------------------------------175 TABLE 19: FEATURE COMBINATIONS OF ASSOCIATIVE CONTAINERS ----------------------------------------178 ix Appendix B Case Studies in Type Parameterization Mechanisms public int get(int i) { return Bits.swap(unsafe.getInt(ix(checkIndex(i)))); } Figure 65: Method get(int) of DirectIntBufferS public float get(int i) { return Bits.swap(unsafe.getFloat(ix(checkIndex(i)))); } Figure 66: Method get(int) of DirectFloatBufferS To unify these two methods into a generic method, we need to unify getInt() and getFloat() methods as well. Sometimes this is not possible: these two methods can be out of scope or they can be generic-unfriendly. B.2 Study of Clones in the STL Acknowledgements This work was conducted jointly with Damith C. Rajapakse. Introduction and Motivation In class libraries, clones often stem from the well-known “feature combinatorics” problem [BSST93][Big94][JL03]. A proper parameterization mechanism can combat this emergence of clones, increasing software reuse and easing software maintenance. At the language level, generics (in Ada, Eiffel, Java and C# [KS01]) and templates (in C++) are the main parameterization techniques. In our previous case study, we experimented with the proposed generics in Java. We tried to unify classes in the Java Buffer Library that differed in the type of a buffer element. We observed that type variation also triggered many other non-type parametric differences among similar classes, hindering application of generics. As the result, despite striking similarities across library classes, only a small part of the library could be transformed into generic classes. 171 Appendix B Case Studies in Type Parameterization Mechanisms Careful examination revealed that most of the issues that hindered a complete generic solution for the library were specific to Java generics. However, some other issues were of more fundamental nature. We thought further work was needed to draw the fine line between the two. The Standard Template Library (STL) provides a perfect example to strengthen the observations made in the Buffer Library case study. Firstly, parameterization mechanism of C++ templates is more powerful than that of Java generics. Due to light integration of templates with the C++ language core, template parameters are less restrictive than parameters of Java generics. Unlike Java generics, C++ templates also allow constants and primitive types to be passed as parameters. Secondly, the STL not only uses the most advanced template features and design solutions (e.g., iterators), but it is also widely accepted in the research and industrial communities as a prime example of the generic programming methodology. The STL needs genericity for simple and pragmatic reasons: There are plenty of algorithms that need to work with many different data structures. Without generic containers and algorithms, the size and complexity of STL would be enormous. Such simple-minded solution would unwisely ignore similarity among data structures, and also among algorithms applied to different data structures, which offers endless reuse opportunities. Redundant code sparking from unexploited similarities would contribute much to the STL’s size and complexity, hindering its evolution. The object of the STL was to avoid these complications, without compromising efficiency. Still, we found much cloning in the STL. Our study confirmed that these clones varied in certain ways that could not be easily unified by template parameters. To demonstrate that such unification was feasible and beneficial, we built a clone-free representation with a metalevel parameterization supported by XVCL. With meta-level unification of clones, we can Home page of SGI STL, http://www.sgi.com/tech/stl/ . 172 Appendix B Case Studies in Type Parameterization Mechanisms avoid template-unfriendly clones, while still retaining the simple design and the efficiency of the source code, as is the hallmark of the STL. In the chapter, we discuss trade-offs among template-based and meta-level parameterization mechanisms. Structure of the STL The Standard Template Library (STL) is a general-purpose library of algorithms and datastructures. It consists of containers, algorithms, iterators, function objects and adaptors. Algorithms and data structures commonly used in computer science are provided in the STL. All the components of the library are heavily parameterized to make them as generic as possible. A major part of the STL is also incorporated in the C++ Standard Library. A full description of the STL is beyond the scope of this chapter and can be found at the website. We provide enough description here to facilitate the understanding of the experiment that is described next. Generic containers form the root of the STL. These are either sequence containers or associative containers. In sequence containers, all members are arranged in some order. In associative containers, the elements are accessed by some key and are not necessarily arranged in any order. All the STL containers are parameterized by type so that a single implementation of the container template can be used for all types of contained elements. The second major component in the STL is the algorithms that work on the generic containers. Algorithms in the STL are decoupled from the containers, and are implemented as global functions rather than member functions. Further generalization of algorithms is achieved by implementing them to work on a range of elements rather than knowing the container that holds those elements. Iterators are used in the STL to achieve the decoupling of algorithms from containers. Iterators are generalization of pointers in C++. This ensures that all algorithms that take in an iterator as a parameter also work with normal pointers. Iterators provide an abstraction of the containers free of their storage details. For example, the operator ++ of an iterator for a linear 173 Appendix B Case Studies in Type Parameterization Mechanisms container will simply increment a pointer, while the same operator will perform a tree walk on a tree container. Study Methodology We analyzed the STL code from the SGI website. Our analysis went through two stages, namely (1) automatic detection of similar code fragments such as class methods or parts of them (so-called simple clones), and (2) manual domain analysis with the objective of discovering design-level similarities. Group of similar associative container templates is an example of this design-level similarity found in the STL. For clone detection we used CCFinder [KKI02]. CCFinder can find simple clones – code fragments that differ in parametric ways. Since container classes form the backbone of the STL, they were the first to be analyzed for clone detection. CCFinder revealed a lot of clones when the minimum clone size was set at 30 tokens. When it was set at 50 tokens, the smaller clones were filtered out. Examination of clones revealed that cloning in container classes was not an ad-hoc phenomenon. We found extensive cloning in the associative containers and in the container adaptors - stack and queue. We did not find significant cloning in the algorithms (in file ‘stl_algo.h’). Some clones were observed in the set functions, e.g., set union, set intersection, set difference and set symmetric difference, but they were restricted to the checking of pre-conditions rather than the actual implementation of the algorithm. Iterators were also relatively clone-free, but the supporting files ‘type_traits.h’ and ‘valarray’ exhibited excessive cloning. Having identified clones, we studied the nature of variations among them, and tried to understand the reasons why cloning occurred. Heavily cloned areas led us to identifying groups of templates that exposed enough similarity to become candidates for generic design solutions. We also analyzed the impact of both simple and structural clones on understanding and evolution of the STL. 174 Appendix B Case Studies in Type Parameterization Mechanisms Finally, we built a clone-free representation for the STL templates under study. For this, we applied the meta-level parameterization technique of XVCL. Analysis of Clones in the STL In this section, we give examples of clones we found, and possible causes for their presence in the STL. Then, we comment on the problems such clones may cause. Cloning in Containers CCFinder detected a substantial amount of cloning in the container classes as shown in Table 18. Table 18: Summary of cloning in the STL FILE GROUP NO. OF FILES NO. OF CLONE PAIRS CLONES >= 50 CLONES >= 30 TOKENS TOKENS Associative Containers 616 94 All Containers 21 1051 171 All Analyzed Files 481 1273 204 Table 18 shows that of the total number of clone pairs detected by CCFinder, majority are present in the container classes. Given next are some interesting simple clones that were detected in the containers. Differences in operator symbols were a common variation. Figure 67 shows a generic form of such clones. @op marks the two variation points of this set of clones. template inline bool operator@op ( const set& __x, const set& __y) { return __x._M_t @op__y._M_t; } Figure 67: A clone that varies by operators Figure 68 shows two clone examples where @op is ’==’ and ‘[...]... In Proceedings of the 28th International Conference on Software Engineering (ICSE), pages 451-459, May 2005 [BRJ05b] Basit, H A., Rajapakse, D C., and Jarzabek, S An empirical study on limits of clone unification using generics In Proceedings of the 17th International 6 Chapter 1 Introduction Conference on Software Engineering and Knowledge Engineering (SEKE), pages 109-114, July 2005 1.3 Outline of. .. Basit, H A., and Jarzabek, S Detecting higher -level similarity patterns in programs In Proceedings of the European Software Engineering Conference and ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC-FSE), pages 156-165 Lisbon, Portugal, September 2005 ACM Press [BRJ05a] Basit, H A., Rajapakse, D C., and Jarzabek, S Beyond templates: a study of clones in the STL and some general... existing code and do the required changes instead of putting in more time and thought in designing a generalized solution Cloning comes naturally: When adding functionality similar to an existing logic in the system, the natural instinct of a programmer is to copy, paste and modify the existing code to meet new requirement Coding style: Following a coding style leads to the appearance of clones in the... legacy software systems By cloning code sections, files and designs, programmers end up maintaining software that is overly complex, error-prone and difficult to change Cloning complicates software by increasing program size It also increases the risk of update anomalies as the location of cloned structures may not be known However, there are positive aspects of clones too Sometimes cloning is done intentionally... of computing in all aspects of life, more and more software is being created A large part of the software costs go to the maintenance of a system rather than its initial development [Som96] A significant amount of legacy code developed many years ago is still operational and plays critical role in businesses and industries Even when new software is developed, the pressure to meet schedule is intense... copy and paste Difficulty in understanding software: Cloning makes the system larger in size and complexity, making it harder to understand Difficulty in bug fixing: If a bug fix is required in a fragment of code that is cloned at several places, an analysis of all the other copies is necessary to avoid update anomalies Study of multiple releases of a large software system showed that programmers often... aliasing, resulting in hidden bugs that show up later Increase in code size: There is considerable increase in size of source code because of cloning This, in turn, increases compile time and the size of the executable 2.4 Clone Detection Several tools and techniques have been proposed and applied in practical situations to detect the clones in real software systems The characteristic features of clone... Refactoring based on design techniques (design patterns, inheritance with dynamic binding) is a clone unification option that is closely tied with the design of the program To eliminate the redundant code in a Java software system, Balazinska et al [BMD+99a][BMD+00] applied the refactoring based on ‘strategy’ and ‘template’ design patterns, by factoring out the commonalities of methods and parameterizing... exists in almost all kind of software systems because of the presence of certain inherent similarities Similar design solutions are repeatedly applied to solve similar problems Programmers often find themselves solving similar design problems by copying existing code or writing similar code all over again Architecture-centric and pattern-driven development further encourages standardization of program... improve design modularity as found in the Java Buffer Library case study by Jarzabek et al [JL03] Cloning may also be used to enhance performance by avoiding function calls and inlining the functions Cloning can also help in better program understanding, by keeping the software architectures simple and avoiding complicated abstractions [KG06] Clone detection and analysis is currently an active area of research, . A., and Jarzabek, S. Detecting higher-level similarity patterns in programs. In Proceedings of the European Software Engineering Conference and ACM SIGSOFT Symposium on the Foundations of Software. ever increasing role of computing in all aspects of life, more and more software is being created. A large part of the software costs go to the maintenance of a system rather than its initial. performance by avoiding function calls and inlining the functions. Cloning can also help in better program understanding, by keeping the software architectures simple and avoiding complicated

Analysis and semi automated detection of design level similarity patterns in software

Thông tin tài liệu

Từ khóa liên quan

Mục lục

Acknowledgements

Table of Contents

Summary

List of Tables

List of Figures

Introduction

Open Challenges

The Goals, Scope and Contributions of this Thesis

Outline of the Thesis

Cloning – Overview and Related Work

Types of Simple Clones

Reasons for Clones

Negative Impact of Clones

Clone Detection

Program Representation

Raw Text

Lexical Tokens

Parse Tree

Program Dependence Graphs

Metrics

Visual Representation

Generality

Granularity of Detected Clones

Arbitrary granularity

Block granularity

Tài liệu cùng người dùng

Tài liệu liên quan