Comparing convolutional neural networks in Vietnamese scene text recognition

Thông tin tài liệu

In the paper, two different convolutional network architectures for recognising Vietnamese text in natural scenes are presentd. Experiments are conducted to compare the performance of two networks in reading Vietnamese restaurant signs. Experimental results show that the deeper network outperforms the other in recognising accuracy and computational time.

Information technology & Applied mathematics COMPARING CONVOLUTIONAL NEURAL NETWORKS IN VIETNAMESE SCENE TEXT RECOGNITION Le Ngoc Thuy* Abstract: Scene text recognition is a challenging task for research community, especially with the scripts with diacritical marks such as Vietnamese In the paper, two different convolutional network architectures for recognising Vietnamese text in natural scenes are presentd Experiments are conducted to compare the performance of two networks in reading Vietnamese restaurant signs Experimental results show that the deeper network outperforms the other in recognising accuracy and computational time Keywords: Scene text recognition, Optical character recognition, Convolutional neural networks INTRODUCTION Reading text in natural scene images refers to the problem of converting image regions into strings Scene text recognition is a crucial issue in many useful applications including: automatic sign translation, text detection system for the blind, intelligent driving assistance, content-based image/video retrieval Hence, scene text recognition has received increasing interests from the research and industry community in recent years Although scene text recognition seems similar to optical character recognition (OCR), reading text in natural scene images is much more challenging One of the leading commercial OCR engines, ABBYY FineReader, claims that it has capability of transforming scanned documents, such as graphics and images, into texts with the accuracy of 99.8% However, its accuracy in character recognition is as low as 21% for scene text applications [1] The difficulty in scene text recognition results from three following facts Firstly, the appearances of characters often vary drastically in fonts, colors and sizes, even in the same image Secondly, the text in captured images is affected by various factors, such as blur, distortion, non-uniform illumination, occlusion and complex backgrounds Lastly, there are other objects in the captured image which make the problem more challenging The numerous of studies have dealt with the scene text detection and recognition during the last two decades but most of the existing methods and benchmarks have focused on texts in English There were a few efforts addressing the scene text detection and recognition for language scripts with diacritics [2] The results of the ICDAR 2013 Robust Reading Competition shown that the participating methods were usually fail to detect the dot of the letters “i” and “j” [3] Therefore, there is a potential that most of the current methods in scene text detection and recognition not recognise tiny atoms of the language scripts with diacritics such as Vietnamese, Thai, Arabic (Figure 1) if we applied the current methods to other languages directly For instances, commercial OCR softwares work well with scanned English documents but they still have significant errors in transforming the scanned Vietnamese documents to text The errors are mainly due to letters with diacritics Moreover, some Vietnamese words may consist of one 36 Le Ngoc Thuy, “Comparing convolutional neural networks text recognition.” Research letter with two diacritics above or below it This distinctive characteristic makes the task of Vietnamese script recognition more challenging than most other scripts As numerous researchers devoted to detecting and recognising the scene text, many papers have provided comprehensive surveys on these problems [4-11] The most broadly reviewed paper [4] addresses more than 200 papers which are classified in two groups The first group include stepwise methodologies which address the problem of reading scene text in four separate steps: localization, verification, segmentation, and recognition The advantages of stepwise methodologies are computational efficiency and the capability of processing the oriented text However, their disadvantages are the complexity in integrating different techniques from all four steps and the difficulty in optimizing parameters for all steps at the same time The other group include integrated methodologies which are to identify specific words in images with character and language models While the integrated methodologies have a clear advantage in optimize parameters for the whole solution, they are often computationally expensive and limited to a small lexicon of words Figure The same sentence in different languages: English, Arabic, Slovakian, Vietnamese, Urdu, Japanese and Thai Another valuable survey [5] gives the overview of recent advances in scene text detection and recognition for static images by referring to around 100 papers Y Zhu et al [5] address the related works on scene text detection as three types of methods: texture based methods, component based methods and hybrid methods The paper not only analyses the strength and weakness of comparative methods but also gives the useful discussion about state-of-the-art algorithms and the future trend in scene text detection and recognition The above papers emphasize the well performance of deep learning methods in scene text detection and recognition They also suggests that the further improvement in detection and recognition accuracy can be achieved, if the deep learning framework is employed and combined with the language knowledge Among the studies using deep learning and big data, Google PhotoOCR [12] is a remarkably successful work which won the ICDAR Robust Reading Competition in 2013 It takes advantage of substantial progress in deep learning and large scale language modeling Its deep neural network (DNN) character classifier is trained on two million examples while its language model is built by utilizing a corpus of more than a trillion tokens Many other methods using DNNs has achieved the top scores Journal of Military Science and Technology, Special Issue, No 51A, 11 - 2017 37 Information technology & Applied mathematics in ICDAR Robust Reading Competitions To the best of our knowledge, there has not been any study of word-level scene text recognition for Vietnamese Hence, this paper will explore this area by comparing the performance of two neural networks in recognising Vietnamese words on the restaurant signs The concept of convolutional neural networks (CNNs) is introduced in the next section Then, two network architectures are represented with different complexity levels Section discusses experimental results when using the presented networks for Vietnamese text recognition SCENE TEXT RECOGNITION USING CNNs 2.1 Background theory Convolutional neural networks are specific feed-forward multilayer neural networks which combines three following architectural ideas: (i) local receptive fields used to detect elementary visual features in images, such as oriented edges, end points or corners; (ii) shared weights to extract the same set of elementary features from the whole input image and to reduce the computational cost; (iii) sub-sampling operations to reduce the computational cost and the sensitivity to affine transformations such as shifts and rotations [3] A convolutional neural network consists of many layers, including the input layer, the output layer, and hidden layers The hidden layers of convolutional networks include convolutional layers and pooling layers Each unit in a convolutional layer is locally connected to a set of units located in a small neighborhood of the previous layer The output of convolutional layers are called feature maps because they help to extract the visual features in images The output features at a layer may be used to build the higher-order features in the next layers Unfortunately, no algorithm is able to automatically determine the optimal architecture of a CNN for a given classification task The architecture of network such as the number of layers, the number of units in each layer, and the network parameters must be determined through experiments This section present two convolutional network architectures which are used for the experiments of Vietnamese scene text recognition in Section 2.2 Network architecture The first proposed network consists of three convolution layers as shown in Figure The input of network is the coloured image with the size of 32x32x3 The first convolutional layer has 32 feature maps corresponding to 32 convolutional filters The size of each convolutional filter in the first layer is 5x5x3 The second and third convolutional layers have 32 and 64 feature maps, respectively The outputs of convolutional layers are sub-sampled using the max pooling function and normalised by the rectifier linear unit ReLU The receptive field of pooling layers is a 3x3 matrix with the stride of The last two layers are fully connected to combine the features learned from the previous convolution and pooling layers The number of filters in the last layer is the number of classes to be recognised This architecture has totally 12,399,306 connections while having only 145,578 parameters thanks to the weight sharing characteristic 38 Le Ngoc Thuy, “Comparing convolutional neural networks text recognition.” Research Conv layer 32x32x32 Input Image 32x32x3 Pooling layer 16x16x32 Conv layer 16x16x32 Pooling layer Pooling layer 8x8x32 4x4x64 Conv layer 8x8x64 1x1x64 1x1x10 Figure The first convolution network architecture 2.3 Network architecture The second network architecture is simpler than the first one It consists of only one convolution layer and one pooling layer (Figure 3) To get more information from the input data, the larger size of input images is used (64x64x3) The convolutional layer is created by utilizing 400 kernel filters whose size is 8x8x3 The outputs of convolutional layers are sub-sampled using the average pooling function and normalised by the sigmoid function The receptive field of pooling layers is a 3x3 matrix with the stride of so that the sub-sampled areas are nonoverlapping This architecture has totally 250,822,800 connections while having only 77,200 parameters Input Image 64x64x3 Conv layer 57x57x400 Pooling layer 19x19x400 Figure The second convolution network architecture EXPERIMENTS AND RESULTS 3.1 Training dataset Since there is no labeled datasets of Vietnamese scene text found on the internet, a dataset of Vietnamese restaurant signs was built by collecting the images on the internet and by capturing the shop signs on the street (Figure 4) The collected dataset consisted of 1,301 images containing 464 words of “bún” (rice noodle), 409 words of “phở”, 428 words of “cơm” (rice) This dataset was split in two subsets Two thirds of images were used for training the network The rest were used for validation Journal of Military Science and Technology, Special Issue, No 51A, 11 - 2017 39 Information technology & Applied mathematics Figure Images of dataset The convolutional neural networks often require a larger number of data so that the networks can learn the features of objects by themselves Hence, the images of other objects were added to the training datasets The final training set consists of about 3000 resized images of 10 objects 3.2 Experimental results Our experiments utilised the softmax classifier, which is a known multiclass classification method, for recognising text The output of the above neural networks was used as the input of softmax classifier It should be noted that the input of neural networks in our experiments was produced directly from origin captured image Hence, the networks did not need the pre-prosessing step to crop words from the origin images as some other networks The accuracy of recognising each word (noodle, phở, rice) and the average accuracy for Vietnamese words were shown in Table Although the input image resolution of network had the resolution with four times greater than that of network 1, the accuracy of network in recognising words was higher than that of network This was thanks to the deeper architecture of network Table The recognising accuracy Network Network Noodle 81,3% 70,9% Pho 89,7% 70,8% Rice 89,1% 67,4% All classes Vietnamese words 84,98% 79,6% 86,7% 69,7% Figure and shown some randomly selected images which were recognised correctly and incorrectly The recognised results were promising because the networks can correctly recognise the blurred words in the images with non-uniform illumination and complex background 40 Le Ngoc Thuy, “Comparing convolutional neural networks text recognition.” Research Figure Correctly recognised words Another remarks in comparing these two networks is about the computational complexity Although the number of parameters in network are about double those in network 2, the number of connections in network are twenty times greater than those in network Hence, the second network needed much more time for calculating the forward propagation in the network This fact makes the first network faster in the recognising task Figure Incorrectly recognised words Journal of Military Science and Technology, Special Issue, No 51A, 11 - 2017 41 Information technology & Applied mathematics CONCLUSIONS Two convolutional neural networks in Vietnamese scene text recognition have been compared The results pointed out that the deeper network shown better performance in recognising accuracy and computational time The current results are obtained by using the image pixels as the input of CNNs To achieve the higher accuracy, further investigation should be focused on using some specific image features as the input of CNNs The performance of above CNNs on Vietnamese scene text recognition should be slightly improved with a larger labeled dataset REFERENCES [1] Wang K., Babenko B., Belongie S., “End-to-End Scene Text Recognition”, IEEE International Conference on Computer Vision (ICCV), Barcelona, Spain, 2011 [2] Le N T., “Các giải thuật phát chữ viết ngơn ngữ có dấu”, Journal of Military Science and Technology, vol 46 (2016), pp 163-169 [3] Karatzas D., Shafait F., Uchida S., Iwamura M., Bigorda L., Mestre S., Mas J., Mota D., Almaz J., Heras L., “ICDAR 2013 robust reading competition”, Proceedings of the ICDAR (2013) [4] Q Ye and D Doermann, “Text detection and recognition in imagery: A survey”, IEEE Trans Pattern Anal Mach Intell., vol 37, no (2014), pp 1480-1500 [5] Y Zhu, C Yao and X Bai, “Scene text detection and recognition: Recent advances and future trends”, Frontiers of Computer Science, Vol 10, Issue (2015), pp 19-36 [6] Chongmu Chen, Da-Han Wang, Hanzi Wan, “Scene Character and Text Recognition: The State-of-the-Art”, Chapter Image and Graphics in Volume 9219 of the series Lecture Notes in Computer Science (2015), pp 310-320 [7] Karanje Uma B., and Rahul Dagade, “Survey on Text Detection, Segmentation and Recognition from a Natural Scene Images” International Journal of Computer Applications 108.13 (2014) [8] Patil Priyanka, and S I Nipanikar, “A Survey on Scene Text Detection and Text Recognition”, International Journal of Advanced Research in Computer and Communication Engineering, Vol 5, Issue (2016), pp 887-889 [9] Cun-Zhao Shi, Song Gao, Meng-Tao Liu, Cheng-Zuo QiA, “Stroke Detector and Structure Based Models for Character Recognition: A Comparative Study”, IEEE Transactions on Image Processing, Vol 24, Issue: 12 (2015), pp 4952-4964 [10] Kaur Tajinder, and Nirvair Neeru, “Text Detection and Extraction from Natural Scene: A Survey”, International Journal of Advance Research in Computer Science and Management Studies, Vol 3, Issue (2015), pp 331336 42 Le Ngoc Thuy, “Comparing convolutional neural networks text recognition.” Research [11] N Sharma , U Pal and M Blumenstein, “Recent advances in video based document processing: A review”, Proc DAS (2012), pp 63-68 [12] A Bissacco, M Cummins, Y Netzer, H Neven, “PhotoOCR: Reading Text in Uncontrolled Conditions”, IEEE International Conference on Computer Vision, 2013, pp, 785-792 TÓM TẮT SO SÁNH CÁC MẠNG NƠ RON TÍCH CHẬP TRONG VIỆC NHẬN DẠNG CHỮ TIẾNG VIỆT TRONG CẢNH Vấn đề nhận dạng chữ viết cảnh nhiệm vụ thách thức nhà nghiên cứu, đặc biệt nhận dạng chữ viết có dấu tiếng Việt Bài báo giới thiệu hai kiến trúc mạng nơ ron tích chập ứng dụng việc nhận dạng chữ viết tiếng Việt cảnh vật tự nhiên Tác giả tiến hành thử nghiệm để so sánh hiệu hai mạng nơ ron việc đọc bảng hiệu nhà hàng tiếng Việt Kết thử nghiệm cho thấy mạng nơ ron có kiến trúc sâu đạt hiệu tốt độ xác q trình nhận dạng thời gian tính tốn Từ khóa: Nhận dạng chữ viết cảnh, Nhận dạng ký tự quang học, Mạng nơ ron tích chập Received date, 13th Jul., 2017 Revised manuscript, 27th Aug., 2017 Published, 1st Nov., 2017 Author affiliation: Posts and Telecommunications Institute of Technology; * Email: thuyln@ptit.edu.vn Journal of Military Science and Technology, Special Issue, No 51A, 11 - 2017 43 ... using the presented networks for Vietnamese text recognition SCENE TEXT RECOGNITION USING CNNs 2.1 Background theory Convolutional neural networks are specific feed-forward multilayer neural networks. .. this area by comparing the performance of two neural networks in recognising Vietnamese words on the restaurant signs The concept of convolutional neural networks (CNNs) is introduced in the next... background 40 Le Ngoc Thuy, Comparing convolutional neural networks text recognition. ” Research Figure Correctly recognised words Another remarks in comparing these two networks is about the computational

Ngày đăng: 11/02/2020, 17:41

Xem thêm: Comparing convolutional neural networks in Vietnamese scene text recognition

Comparing convolutional neural networks in Vietnamese scene text recognition

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan