a comparison of neural network architectures for

5 442 0
a comparison of neural network architectures for

Đang tải... (xem toàn văn)

Thông tin tài liệu

A Comparison of Neural Network Architectures for Handwritten Digit Recognition Eman M. El-Sheikh 1 , Bradley A. Swain 1 , and Mohamed A. Khabou 2 1 Department of Computer Science, University of West Florida, Pensacola, FL, USA 2 Department of Electrical & Computer Engineering, University of West Florida, Pensacola, FL, USA Abstract - In this paper, we describe the development and use of an artificial neural network architecture for recognizing handwritten digit data. The feed-forward neural network, which was implemented in Java, was designed to be modular and parameterized so that various parameters and settings of the network could be easily modified. The network was used to recognize a large set of data representing handwritten digits. Various configurations of the network were run on the data set to determine the best settings for this type of data set. A correct classification rate of 93.77% was achieved on a testing set containing 3000 images not used in the training phase. We present a summary and analysis of the results. Keywords: Machine learning, neural network, handwritten digit recognition, classification, Java application. 1 Introduction The focus of this work is the development, use, and analysis of an artificial neural network architecture for the recognition of handwritten digit data. Neural networks have a long track record of successful use for pattern recognition problems, including character recognition. The application of recognizing handwritten digit data was selected for this study due to the numerous potential applications, including mail sorting, automated check reading, and data entry for hand-held devices. In this paper, we describe the development of a software neural network architecture. The system, which was implemented using the Java programming language, was designed to be modular and parameterized to allow easy modification of the network’s size, structure, and parameters. The overall motivation for our study is the analysis of various machine learning techniques for real-world applications with the purpose of determining the appropriateness and usefulness of each technique. Specifically, for the study described in this paper, we focused on the use of neural network learning techniques for handwritten digit recognition. Our objective was two-fold: to test the neural network implementation on a large-scale, real-world data set using a variety of network parameters and sizes, and to determine the best network structure and settings for the handwritten digit data set. The results provide evidence for the use of neural network learning techniques for this type of application and data set, and insight about the appropriate design, use, and parameterization of the network to achieve acceptable results. In the next section, we provide a brief summary of the literature relevant to neural network techniques and applications. Section 3 describes the design, implementation, and use of the neural network. Section 4 describes the methods used for our tests, including a description of the data set, experiments we conducted, and results of our study. The last section includes the conclusions derived and lessons learned from analyzing the results, as well as plans for future work. 2 Literature Review The use of neural networks in pattern recognition problems dates back to the early 1950s when the early perceptron models were introduced by Rosenblatt. In the 1960s and 1970s the progress in neural networks slowed due to limited computing capabilities and the unfounded perceptions concerning the limitations of neural networks. The enhancement of computing capabilities in the 1980s and the emphatic support by Hopfield, Rumelhart, and other researchers accelerated the development of new neural network models and theories. Today numerous neural network models and algorithms are available including back propagation learning, competitive learning, Kohonen learning, Hopfield networks, Boltzman machines, shared-weight neural networks and self-organizing neural networks [4]. Neural networks have been used in a variety of applications ranging from stock price prediction to automatic target detection and recognition. They are especially useful as pattern classifiers and/or recognizers because of their robustness, fault tolerance, and universal function approximation capability. In theory, a multi-layer feed-forward neural network with enough nodes and hidden layers can approximate any bounded function and its derivative to any arbitrary accuracy given a “large enough” training set [5]. Results published in [1] state that a network with N inputs, one hidden layer of H units, and a total of W weights will require in the order of W/ ε training patterns to yield an error less than ε on the test set. 3 Neural Network Architecture The neural network application used for this paper was written using the Sun Java 2 Runtime Environment (v. 1.5.0), mainly on Mac OS X 10.5.1, but has been successfully used on several Linux distributions as well. Theoretically, the program should work on any platform for which a Java virtual machine exists, as it uses no platform-specific features or calls. The design of the application attempted to make as few assumptions as possible. All attributes, input, and output values are stored as real numbers, allowing for both continuous and discrete attributes. Missing attribute values, however, are not handled by the implementation. The application uses a simple feed- forward back-propagation-trained network, using the sigmoid activation function, and is designed to allow the use of a different activation function. The system was designed to be modular and easily re-configured. Configurable attributes of the network include the number of input neurons, the number of hidden layers, the number of neurons in each hidden layer, the number of output neurons, the learning rate (α), the threshold at which the value of the sigmoid function will cause the neurons to fire, and the number of epochs for which to train the network, all of which are passed into the application by way of command-line parameters. There is a complete connection between nodes in layers, with every node in the layer being connected to every node in the following layer, and the initial weights for these connections are initialized to random values in the [-1, 1] range. 4 Methods and Results In this section, we describe the data set that we used for the study, the configurations of the neural network architecture that we used for our experiments, and the result of our test runs. 4.1 Data The problem of handwritten character recognition has been studied intensively for over thirty years [7]. This research has been driven by the great number of potential applications including the recognition of handwritten addresses for mail sorting purposes. Unlike the problem of machine-printed character recognition, the difficulties in the handwritten character recognition come from the fact that people write differently, as can be seen in Figure 1. Numerous methods have been proposed to capture the distinctive features of handwritten characters [2, 3, 6, 8]. These approaches can be classified into two categories: global analysis and structural analysis. As an example of the first category, we find techniques such as template matching, moment invariant, mathematical transforms (Fourier, Walsh, Hadamard, etc). In the second category, the main goal is to capture the essential shape features of the characters mainly from their skeletons or contours. Such features include loops, endpoints, junctions, arcs, concavities, convexities, and strokes. The main challenges to create a classifier based on features are: (1) what type of features to use, (2) how many features to use, (3) how to select the “best” features, and (4) how to define criteria for selecting the “best” features. The use of feature vectors as input to handwritten digit classifiers has many advantages over the use of the two-dimensional digit images. The most important advantage is the reduction of the dimension of the input space. The smallest “reasonable” dimensions of a digit image are 16×16, which corresponds to an input vector of dimension 256. Such input dimensions would produce a huge neural network if multiple hidden layers are used. The use of features in this specific case would reduce the dimension of the input space significantly. Figure 1. Representative samples from the data set. In our experiments we used a data set that consisted of 6000 unconstrained binary images of handwritten digits (600 images from each digit class) from actual USPS mail pieces. The data was collected at the Environmental Research Institute of Michigan (ERIM) and was used by many researchers to test their handwriting recognition systems [3]. All images were moment-normalized to a size of 24×18 as described in [2]. A subset of this data set (2000 images) was used for training our system and a different subset (3500 images) was used for testing. The results we report in this paper were achieved with the testing data set. Three sets of features were used to train and test the neural networks. Each set contains 60 features, which have values in the [0, 1] range and represent the degree of correlation between a feature’s template and the input image. A value of 0 indicates that a feature template did not match the digit image; a value of 1 indicates a complete match; and values in between these two extremes indicate a partial match. These 60 “best” features were selected using three measures: an information measure, an orthogonality measure, and a combination of the two. The information measure is based on Shannon’s entropy. It is a statistical measurement of the capability of a feature to separate digit classes. If the information measure of a feature is high, it indicates that the feature is able to separate at least some of the classes well since the conditional probabilities of the classes given that the feature is either present or not present are significantly different from each other. The orthogonality measure is based on the known mathematical property that the best basis to represent an n-dimensional space is using n orthogonal vectors (i.e. their dot product equals 0). The orthogonality measure yields features that do not respond similarly to an input image, i.e. the response of one feature would be high when the other is low on some classes and vice-versa on all the other classes. Thus the features selected based on the orthogonality measure provide discrimination capability since they behave differently on each class. The third measure we used is a combination of the information and orthogonality measures. More details about the measures and selection process can be found in [3]. 4.2 Experiments We ran the networks on the handwritten digit data, training on the first 2000 digits in each file and testing on the final 3500 digits. For a comparison between architectures, we ran tests using two hidden layers with 25 and 20 nodes, two hidden layers with 25 and 15 nodes, one hidden layer with 25 nodes, and one hidden layer with 15 nodes, respectively. The other parameters to the network were fixed for all runs of all architectures. After trial runs with different values, a learning rate of 0.05 was used, as it was small enough to avoid saturating the network, but large enough to provide ample learning. The number of training epochs was set at 500, as experimentation proved this to provide the best testing results, compared to 200 (under-trained), and 750 and 1000 epochs (over- trained). Finally, for all tests the firing threshold for the activation function of the hidden neurons was set at 0.9, which experimentally gave the best results. Table 1. Summary of performance results for each architecture configuration and each feature data set. Hidden Layer(s) Info Features Time (mm:ss) Orthogonal Features Time (mm:ss) Combination Features Time (mm:ss) 25/20 90.63% 04:18 93.77% 04:20 93.74% 04:15 25/15 90.83% 03:56 93.34% 03:59 93.34% 03:59 25 90.54% 03:18 93.57% 03:19 93.43% 03:20 15 89.91% 02:08 92.86% 02:08 92.83% 02:08 4.3 Results The results for the experiments are summarized in Table 1. The table shows the percentage of handwritten digit samples correctly classified for different configurations of the architecture. For each row, the first column shows the architecture of the network used. The first two runs both used two hidden layers; the last two used only one hidden layer. The remaining columns consist of data set and time pairs, with the percentage of the test data correctly classified and the amount of time that the network took to train and test, respectively. Each architecture and data set combination was run 5 times, and the best result presented for each configuration of the architecture that was tested. Based on previous work done on the data sets, the two hidden layer architecture with 25 nodes in the first layer and 15 nodes in the second layer was expected to perform best. For the information features data set, this held true, but for the other data sets, it actually performed worse than both the 25/20 two hidden layer architecture and the single hidden layer with 25 nodes architecture. For the other data sets, the 25/20 architecture outperformed the other network architectures, with only a minor increase in time required over the 25/15 architecture. With a slightly reduced performance, however, the single hidden layer with 25 nodes performs nearly as well as the 25/20 architecture, but reduces the running time by as much as a minute. 5 Conclusions Overall, the implemented system was able to successfully classify the test data using each of three feature sets. The single hidden layer network with 15 nodes performed better on the orthogonal and combination features data set, achieving a prediction accuracy of 92.86% and 92.83%, respectively. The information set did not perform as well due to the fact that some features were almost identical and hence did not provide more information to the network. Adding more nodes to the single hidden layer improved the accuracy slightly. With 25 nodes in the hidden layer, the accuracy improved to 90.54% for the information features, 93.57% for the orthogonal features, and 93.43% for the combination features data set. This slight improvement in accuracy comes at a price of an increase in time requirements, since the running time increased by over a minute. It is interesting to note that expanding the architecture to include two hidden layers with 25 nodes in the first layer and 15 nodes in the second layer increases the running time but does not improve the classification accuracy for the orthogonal and combination features data set, and only has a slight benefit for the information features data set, improving the accuracy from 90.54% to 90.83%. Increasing the number of nodes in the second hidden layer from 15 to 20 provides the best classification accuracy for the orthogonal features data set and the combined features data set, namely 93.77% and 93.74% respectively, but does not improve the prediction rate for the information features data. The orthogonal features data set and the combined features data set provided better results for all configurations of the network than the information features data set. This was somewhat expected since the selection process can theoretically yield very similar information features which do not provide extra class discrimination to the neural network. This possibility is not present in the orthogonality and combined features since the feature selection process using these two measures inherently discourages similar features. Both the orthogonal and combination features data sets provided fairly comparable results, in terms of both accuracy and running time. The single hidden layer with 25 nodes performs nearly as well as the two hidden layer architecture with 25 and 20 nodes on all three data sets, but reduces the running time by as much as a minute. We would like to continue testing the developed system with other large data sets for handwritten digit and character recognition. We are also interested in exploring the combination of different machine learning techniques for handwritten digit recognition and similar character recognition problems. More specifically, we are interested in developing a system that integrates a decision tree learning technique with a neural network architecture and testing it with the three data sets used in this study to compare the performance of such a system with the results of using a decision tree or neural network separately. We anticipate that such a technique that fuses both machine learning techniques will yield better results than using each technique individually, and will allow the recognition of larger, more complex data sets. 6 References [1] E. Baum and D. Haussler, “What Size Net Gives Valid Generalization?” Neural Computation, vol. 1, no. 1, pp. 151-160, 1989. [2] P. D. Gader, B. Forester, M. Ganzberger, A. Gillies, B. Mitchell, M. Whalem, and T. Yocum, “Recognition of Handwritten Digits Using Template and Model Matching,” Pattern Recognition, vol. 24, pp. 421-432, 1991. [3] P. D. Gader and M. A. Khabou, “Automatic Feature Generation for Handwritten Digit Recognition,” IEEE Trans. Pattern Analysis Machine Intelligence, vol. 18, no. 12, pp. 1256-1262, 1996. [4] S. Haykin, Neural Networks, a Comprehensive Foundation, MacHillan Publishing Co., 1994. [5] K. Hornik, M. Stinchrombe and H. White, “Universal Approximation of an Unknown mapping and its Derivatives Using Multilayer Feed-forward Networks,” Neural Networks, vol. 3, pp. 551-560, 1990. [6] C.Y. Suen, “Distinctive Features in the Automatic Recognition of Handprinted Characters,” Signal Processing, vol. 4, pp. 193-207, 1982. [7] C.Y. Suen , “Character Recognition by Computer and Application,” Handbook of Pattern Recognition and Image Processing, Academic Press, pp. 569-586, 1986. [8] C.Y. Suen, C. Nadal, R. Legault, T.A. Mai, and L. Lam, “Computer Recognition of Unconstrained Handwritten Numerals.” Proc. of IEEE, vol. 80, no 7, pp. 1162-1180, 1992. . for the use of neural network learning techniques for this type of application and data set, and insight about the appropriate design, use, and parameterization of the network to achieve acceptable. development and use of an artificial neural network architecture for recognizing handwritten digit data. The feed-forward neural network, which was implemented in Java, was designed to be modular and. degree of correlation between a feature’s template and the input image. A value of 0 indicates that a feature template did not match the digit image; a value of 1 indicates a complete match; and

Ngày đăng: 28/04/2014, 10:10

Từ khóa liên quan

Mục lục

  • 1 Introduction

  • 2 Literature Review

  • 3 Neural Network Architecture

  • 4 Methods and Results

    • 4.1 Data

    • 4.2 Experiments

    • 4.3 Results

    • 5 Conclusions

    • 6 References

Tài liệu cùng người dùng

Tài liệu liên quan