luanvan abstract english phát hiện và nhận dạng đối tượng 3 d hỗ trợ sinh hoạt của người khiếm thị 3 d object detection and recognition assisting visually impaired people

MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECNOLOGY LE VAN HUNG 3D OBJECT DETECTIONS AND RECOGNITIONS: ASSISTING VISUALLY IMPAIRED PEOPLE IN DAILY ACTIVITIES Major: Computer Science Code: 9480101 ABSTRACT OF DOCTORAL DISSERTATION COMPUTER SCIENCE Hanoi −2018 The dissertation is completed at: Hanoi University of Science and Technology Supervisors: Dr Vu Hai Assoc Prof Nguyen Thi Thuy Reviewer 1: Assoc Prof Luong Chi Mai Reviewer 2: Assoc Prof Le Thanh Ha Reviewer 3: Assoc Prof Nguyen Quang Hoan The dissertation will be defended before approval committee at Hanoi University of Science and Technology: Time , date .month .year The dissertation can be found at: Ta Quang Buu Library Vietnam National Library INTRODUCTION Motivation Visually Impaired People (VIPs) face many difficulties in their daily living Nowadays, many aided systems for the VIPs have been deployed such as navigation services, obstacle detection (iNavBelt, GuideCane products in Andreas et al IROS, 2014; Rimon et al.,2016), object recognition in supermarket (EyeRing at MIT’s Media Lab) The most common situation is that the VIPs need to locate home facilities However, even for a simple activity such as querying common objects (e.g., a bottle, a coffee-cup, jars, so on) in a conventional environment (e.g., in kitchen, cafeteria room), it may be a challenging task In term of deploying an aided system for the VIPs, not only the object’s position must be provided but also more information about the queried object (e.g., its size, grabbing objects on a flat surface such as bowls, coffee cups in a kitchen table) is required Let us consider a real scenario, as shown in Fig 1, to look for a tea or coffee cup, he (she) goes into the kitchen, touches any surrounded object and picks up the right one In term of an aided system, that person just makes a query ”Where is a coffee cup?”, ”What is the size of the cup?”, ”The cup is lying or standing on the table?” The aided system should provide the information for the VIPs so that they can grasp the objects and avoid accidents such as being burned Even when doing 3-D objects detection, recognition on 2-D image data and more information on depth images as presented in (Bo et al NIPS 2010, Bo et al CVPR 2011, Bo et al IROS 2011), only information about the objects label is provided At the same time the information that the system captured from the environment is the image frames of the environment Therefore the data of the objects on the table gives only a visible part of the object like the front of cup, box or fruit While the information that the VIPs need are the information about the position, size and direction for safely grasping From this, we use the ”3-D objects estimation method” to estimate the information of the objects By knowing the queried object is a coffee cup which is usually a cylindrical shape and lying on a flat surface (table plane), the aided system could resolve the query by fitting a primitive shape to the collected point cloud from the object The objects in the kitchen or tea room are usually placed on the tables such as cups, bowls, jars, fruit, funnels, etc Therefore, these objects can be simplified by the primitive shapes The problem of detecting and recognizing the complex objects in the scene is not considered in the dissertation The prior knowledge observed from the current scene such as a Figure Illustration of a real scenario: a VIP comes to the Kitchen and gives a query: ”Where is a coffee cup? ” on the table Left panel shows a Kinect mounted on the human’s chest Right panel: the developed system is build on a Laptop PC cup normally stands on the table, contextual constraints such as walls in the scene are perpendicular to the table plane; the size/height of the queried object is limited, would be valuable cues to improve the system performances Generally, we realize that the queried objects could be identified through simplifying geometric shapes: planar segments (boxes), cylinders (coffee mugs, soda cans), sphere (balls), cones, without utilizing conventional 3-D features Approaching these ideas, a pipeline of the work ”3-D Object Detection and Recognition for Assisting Visually Impaired People” is proposed It consists of several tasks, including: (1) separating the queried objects from table plane detection result by using the transformation original coordinate system technique; (2) detecting candidates for the interested objects using appearance features; and (3) estimating a model of the queried object from a 3-D point cloud Wherein the last one plays an important role Instead of matching the queried objects into 3-D models as conventional learning-based approaches do, this research work focuses on constructing a simplified geometrical model of the queried objects from an unstructured set of point clouds collected by a RGB and range sensor Objective In this dissertation, we aim to propose a robust 3-D object detection and recognition system As a feasible solution to deploy a real application, the proposed framework should be simple, robust and friendly to the VIPs However, it is necessary to notice that there are critical issues that might affect the performance of the proposed system Particularly, some of them are: (1) objects are queried in a complex scene where cluster and occlusion issue may appear; (2) noises from collected data; and (3) high computational cost due to huge number of points in the cloud data Although in the literature, a number of relevant works of 3-D object detection and recognition has been attempted for a long time, in this study, we will not attempt to solve these issues separately Instead of that, we aim to build an unified solution To this end, the concrete objectives are: Figure Illustration of the process of 3-D query-based object in the indoor environment The full object model is the estimated green cylinder from the point cloud of coffee-cup (red points) - To propose a completed 3-D query-based object detection system in supporting the VIPs with high accuracy Figure illustrates the processes of 3-D query-based object detection in an indoor environment - To deploy a real application to locate and describe objects’ information in supporting the VIPs grasping objects The application is evaluated in practical scenarios such as finding objects in a sharing-room, a kitchen room An available extension from this research is to give the VIPs a feeling or a way of interaction in a simple form The fact that the VIPs want to make optimal use of all their senses (i.e., audition, touch, and kinesthetic feedback) By doing this study, informative information extracted from cameras (i.e position, size, safely directions for object grasping) is available As a result, the proposed method can offer an effective way so that the a large amount of the collected data is valuable as feasible resource Context, constraints and challenges Figure shows the context when a VIP comes to a cafeteria and using an aided system for locating an object on the table The input of system is a query and output is object position in a 3-D coordination and object’s information (size, height) The proposed system operates with a MS Kinect sensor version The Kinect sensor is mounted on the chest of the VIPs and the laptop is warped in the backpack as shown in Fig 1-bottom For deploying a real application, we have some constraints for the scenario as the following: ❼ The MS Kinect sensor: – A MS Kinect sensor is mounted on VIP’s chest and he/she moves slowly around the table This is to collect the data of the environment – A MS Kinect sensor captures RGB and Depth images at a normal frame rate (from 10 to 30 fps) with image resolution of 640×480 pixels for both of those image types With each frame obtained from Kinect an acceleration vector is also obtained Because MS Kinect collects the images in a range from 10 to 30 fps, , it fits well with the slow movements of the VIPs (∼ m/s) Although collecting image data via a wearable sensor can be affected by subject’s movement such as image blur, vibrations in the practical situations, there are no specifically requirements for collecting the image data For instance, VIPs are not required to be stranded before collecting the image data – Every queried object needs to be placed in the visible area of a MS Kinect sensor, which is in a distance of 0.8 to meter and an angle of 300 around the center axis of the MS Kinect sensor Therefore, the distance constraint from the VIPs to the table is also about 0.8 to 4m ❼ Interested (or queried) objects are assumed to have simple geometrical structures For instance, coffee mugs, bowls, jars, bottles, etc have cylindrical shape, whereas ball(s) have spherical shape; a cube shape could be boxes, etc They are idealized and labeled The modular interaction between a VIP and the system has not been developed in the dissertation ❼ Once a VIP wants to query an object on the table, he(she) should stand in front of the table This ensures that the current scene is in the visible area of a MS Kinect sensor and can move around the table The proposed system computes and returns the object’s information such as position, size and orientation Sending such information to senses (e.g., audible information, on a Braille screen, or by a vibrating type) is out of the scope of this dissertation ❼ Some heuristic parameters are pre-determined For instance, a VIP’s height, and other parameters of contextual constraints (e.g., size of table plane in a scene, object’s height limitations) are pre-selected The above-mentioned scenarios and challenges are to cope with the following issues: ❼ Occlusion and cluster of the interested objects: In the practical, when a VIP comes to a cafeteria to find an object on the table, the queried objects could be occluded by others At a certain view point, a MS Kinect sensor captured only a part of of an object Therefore, data of the queried objects is missed Other situation is that the data consists of many noises because the depth image of a MS Kinect version often is affected by illumination conditions These issues are challenges for fitting, detecting and classifying the objects from a point cloud ❼ Various appearances of same object type: The system is to support for the VIPs querying common objects In fact that a ”blue” tea/coffee cup and a ”yellow” bottle have same type of a primitive shape (as cylindrical model) These objects have the same geometric structure but are different colors We exploit learningbased techniques to utilize appearance features (on RGB images) for recognizing the queried objects ❼ Computational time: A point cloud of a scene that is generated from an image with size of 640 × 480 pixels consists of hundreds of thousands of points Therefore, computations in the 3-D environment often require higher computational costs than a task in the 2-D environment Contributions Throughout the dissertation, the main objectives are addressed by an unified solution We achieve following contributions: ❼ Contribution 1: Proposed a new robust estimator that called (GCSAC - Geometrical Constraints SAmple Consensus) for estimation of primitive shapes from the point cloud of the objects Different from conventional RANSAC algorithms (RANdom SAmple Consensus), GCSAC selects the uncontaminated (so-called the qualified or good) samples from a set data of points using the geometrical constraints Moreover, GCSAC is extended by utilizing the contextual constraints to validate results of the model estimation ❼ Contribution 2: Proposed a comparative study on three different approaches for recognizing the 3-D objects in a complex scene Consequently, the best one is a combination of deep-learning based technique and the proposed robust estimator(GCSAC) This method takes recent advantages of object detection using a neural network on RGB image and utilizes the proposed GCSAC to estimate the full 3-D models of the queried objects ❼ Contribution 3: Deployed a successfully system using the proposed methods for detecting 3-D primitive shape objects in a lab-based environment The system combined the table plane detection technique and the proposed method of 3-D objects detection and estimation It achieved fast computation for both tasks of locating and describing the objects As a result, it fully supports the VIPs in grasping the queried objects General framework and dissertation outline In this dissertation, we propose an unified framework of detecting the queried 3-D objects on the table for supporting the VIPs in an indoor environment The proposed framework consists of three main phases as illustrated in Fig The first phase is considered as a pre-processing step It consists of point cloud representation from the Acceleration vector Microsoft Kinect RGB-D image Pre-processing step Point cloud representation Objects detection on RGB image Table plane detection 3-D objects location on the table plane 3-D objects model estimation 3-D objects information Fitting 3-D objects Candidates Figure A general framework of detecting the 3-D queried objects on the table of the VIPs RGB and Depth images and table plane detection in order to separate the interested objects from a current scene The second phase aims to label the object candidates on the RGB images The third phase is to estimate a full model from the point cloud specified from the first and the second phases In the last phase, the 3-D objects are estimated by utilizing a new robust estimator GCSAC for the full geometrical models Utilizing this framework, we deploy a real application The application is evaluated in different scenarios including data sets collected in lab environments and the public datasets Particularly, these research works in the dissertation are composed of six chapters as following: ❼ Introduction: This chapter describes the main motivations and objectives of the study We also present critical points the research’s context, constraints and challenges, that we meet and address in the dissertation Additionally, the general framework and main contributions of the dissertation are also presented ❼ Chapter 1: A Literature Review: This chapter mainly surveys existing aided systems for the VIPs Particularly, the related techniques for developing an aided system are discussed We also presented the relevant works on estimation algorithms and a series of the techniques for 3-D object detection and recognition ❼ Chapter 2: In this chapter, we describe a point cloud representation from data collected by a MS Kinect Sensor A real-time table plane detection technique for separating the interested objects from a certain scene is described The proposed table plane detection technique is adapted with the contextual constraints The experimental results confirm the effectiveness of the proposed method on both self-collected and public datasets ❼ Chapter 3: This chapter describes a new robust estimator for the primitive shapes estimation from a point cloud data The proposed robust estimator, named GC6 SAC (Geometrical Constraint SAmple Consensus), utilizes the geometrical constraints to choose good samples for estimating models Furthermore, we utilize the contextual information to validate the estimation’s results In the experiments, the proposed GCSAC is compared with various RANSAC-based variations in both synthesized and the real datasets ❼ Chapter 4: This chapter describes the completed framework for locating and providing the full information of the queried objects In this chapter, we exploit advantages of recent deep learning techniques for object detection Moreover, to estimate full 3-D model of the queried-object, we utilize GCSAC on point cloud data of the labeled object Consequently, we can directly extract the object’s information (e.g., size, normal surface, grasping direction) This scheme outperforms existing approaches such as solely using 3-D object fitting or 3-D feature learning ❼ Chapter 5: We conclude the works and discuss the limitations of the proposed method Research directions are also described for future works CHAPTER LITERATURE REVIEW In this chapter, we would like to present surveys on the related works of aid systems for the VIPs and detecting objects methods in indoor environment Firstly, relevant aiding applications for VIPs are presented in Sec 1.1 Then, the robust estimators and their applications in the robotics, computer vision are presented in Sec 1.3 Finally, we will introduce and analyses the state-of-the-art works with 3-D object detection, recognition in Sec 1.2 1.1 Aided systems supporting for visually impaired people 1.1.1 Aided systems for navigation service 1.1.2 Aided systems for obstacle detection 1.1.3 Aided systems for locating the interested objects in scenes 1.1.4 Aided systems for detecting objects in daily activities 1.1.5 Discussions 1.2 3-D object detection, recognition from a point cloud data 1.2.1 Appearance-based methods 1.2.2 Geometry-based methods 1.2.3 Discussions 1.3 Fitting primitive shapes: A brief survey 1.3.1 Linear fitting algorithms 1.3.2 Robust estimation algorithms 1.3.3 RANdom SAmple Consensus (RANSAC) and its variations 1.3.4 Discussions CHAPTER POINT CLOUD REPRESENTATION AND THE PROPOSED METHOD FOR TABLE PLANE DETECTION A common situation in activities of daily living of visually impaired people (VIPs) is to query an object (a coffee cup, water bottle, so on) on a flat surface We assume that such flat surface could be a table plane in a sharing room, or in a kitchen To build the completed aided-system supporting for VIPs, obviously, the queried objects should be separated from a table plane in current scene In a general frame-work that consists other steps such as detection, and estimation full model of the queried objects, the table plane detection could be considered as a pre-processing step Therefore, this chapter is organized as follows: Firstly, we introduce a representation of the point clouds which are combined the data collected by Kinect sensor in Section 2.1 We then present the proposed method for the table plane detection in Section 2.2 2.3 Separating the interested objects on the table plane 2.3.1 Coordinate system transformation 2.3.2 Separating table plane and interested objects 2.3.3 Discussions CHAPTER PRIMITIVE SHAPES ESTIMATION BY A NEW ROBUST ESTIMATOR USING GEOMETRICAL CONSTRAINTS 3.1 3.1.1 Fitting primitive shapes: By GCSAC Introduction The geometrical model of an interested object can be estimated using from two to seven geometrical parameters as in (Schnabel et al 2007) A Random Sample Consensus (RANSAC) and its paradigm attempt to extract as good as possible shape parameters which are objected either heavy noise in the data or processing time constraints In particular, at each hypothesis in a framework of a RANSAC-based algorithm, a searching process aims at finding good samples based on the constraints of an estimated model is implemented To perform search for good samples, we define two criteria: (1) The selected samples must ensure being consistent with the estimated model via a roughly inlier ratio evaluation; (2) The samples must satisfy explicit geometrical constraints of the interested objects (e.g., cylindrical constraints) 3.1.2 Related work 3.1.3 The proposed new robust estimator 3.1.3.1 Overview of the proposed robust estimator (GCSAC) To estimate parameters of a 3-D primitive shape, an original RANSAC paradigm, as shown in the top panel of Figure 3.2, selects randomly an (Minimum Sample SubsetMSS) from a point cloud and then model parameters are estimated and validated The algorithm is often computationally infeasible and it is unnecessary to try every possible sample Our proposed method (GCSAC - in the bottom panel of Figure 3.2) is based on an original version of RANSAC, however it is different in three major aspects: (1) At each iteration, the minimal sample set is conducted when the random sampling procedure is performed, so that probing the consensus data is easily achievable In other words, a low pre-defined inlier threshold can be deployed as a weak condition of the consistency Then after only (few) random sampling iterations, the candidates 12 No A point cloud Randomly sampling a minimal subset Geometrical parameters estimation M Randomly sampling a minimal subset Geometrical parameters Estimation M Model evaluation M; Update the best model Update the number of iterations K adaptively (Eq 3.2) Terminate ? yes RANSAC/ MLESAC paradigm Proposed Method (GCSAC) Randomly sampling a minimal subset Searching good samples using geometrical constraints Geometrical parameters estimation M Model evaluation M via (inlier ratio or Negative log-likelihood); Update the best model Update the number of iterations K adaptively (Eq 3.2) Model evaluation M via Negative Log-likehood; Update the best model Update the number of iterations K adaptively (Eq 3.2) Estimated Model RANSAC Iteration A point cloud Search good sampling based on Geometrical constraint based on (GS) Random sampling Estimation model; Compute the inlier ratio w Yes k=0: MLESAC k=1:w≥ wt: Yes k=1:w≥ wt: No As MLESAC Good samples (GS) w≥wt Compute Negative loglihood L, update the best model No k≤K No Estimated mode Figure 3.2: Top panel: Over view of RANSAC-based algorithm Bottom panel: A diagram of the GCSAC’s implementations of good samples could be achieved (2) The minimal sample sets consist of qualified samples which ensure geometrical constraints of the interested object (3) Termination condition of the adaptive RANSAC algorithm of (Hartley et al 2003) is adopted so that the algorithm terminates as soon as the minimal sample set is found for which the number of iterations of current estimation is less than that which has already been achieved To determine the termination criterion for the estimation algorithm, a well-known calculation for determining a number of sample selection K is as Eq 3.2 K= log(1 − p) log(1 − ws ) (3.2) where p is the probability to find a model describing the data, s is the minimal number of samples needed to estimate a model, w is percentage of inliers in the point cloud 13 PlaneY γc p2 L1 p1 (a) γ n2 p2 γ1 (d) L2 n1 γ2 n3 n1 (c) (b) p1 p3 Ic n2 n1 n2 p1 Estimated cylinder p2 (e) (f) Figure 3.3: Geometrical parameters of a cylindrical object (a)-(c) Explanation of the geometrical analysis to estimate a cylindrical object (d)-(e) Illustration of the geometrical constraints applied in GCSAC (f) Result of the estimated cylinder from a point cloud Blue points are outliers, red points are inliers 3.1.3.2 Geometrical analyses and constraints for qualifying good samples In the following sections, the principles of 3-D the primitive shapes are explained Based on the geometrical analysis, related constraints are given to select good samples The normal vector of any point is computed following the approach in (Holz et al 2011) At each point pi , k-nearest neighbors kn of pi are determined within a radius r The normal vector of pi is therefore reduced to analysis of eigenvectors and eigenvalues of the covariance matrix C, that is presented as in Sec 2.2.3.2 a Geometrical analysis for cylindrical objects The geometrical relationships of above parameters are shown in Fig 3.3 (a) A cylinder can be estimated from two points (p1 , p2 ) (two blue-squared points) and their corresponding normal vectors (n1 , n2 ) (marked by green and yellow line) Let γc be the main axis of the cylinder (red line) which is estimated by: γc = n1 × n2 (3.3) To specify a centroid point I, we project the two parametric lines L1 = p1 + tn1 and L2 = p2 + tn2 onto a plane specified by P laneY (see Figure 3.3(b)) The normal vector of this plane is estimated by a cross product of γc and n1 vectors (γc × n1 ) The centroid point I is the intersection of L1 and L2 (see Figure 3.3 (c)) The radius Ra is set by the distance between I and p1 in P laneY A result of the estimated cylinder from a point cloud is illustrated in Figure 3.3 (f) The height of the estimated cylinder is normalized to 14 Figure 3.4: (a) Setting geometrical parameters for estimating a cylindrical object from a point cloud as described above (b) The estimated cylinder (green one) from an inlier p1 and an outlier p2 As shown, it is an incorrect estimation (c) Normal vectors n1 and n∗2 on the plane π are specified We first built a plane π that is perpendicular to the plane P laneY and consists of n1 Therefore its normal vector is nπ = (nP laneY ×n1 ) where nP laneY is the normal vector of P laneY , as shown in Figure 3.4 (a) In the other words, n1 is nearly perpendicular with n∗2 where n∗2 is the projection of n2 onto the plane π This observation leads to the criterion below: cp = arg 3.1.4 3.1.4.1 {n1 · n∗2 } p2 ∈{Un \p1 } (3.4) Experimental results of robust estimator Datasets for evaluation of the robust estimator The first one is synthesized datasets These datasets consists of cylinders, spheres and cones In addition, we evaluate the proposed method on real datasets For the cylindrical objects, the dataset is collected from a public dataset [1] which contains 300 objects belonging to 51 categories It named ’second cylinder’ For the spherical object, the dataset consists of two balls collected from four real scenes Finally, point cloud data of the cone objects, named ’second cone’, is collected from dataset given in [4] 3.1.4.2 Evaluation measurements of robust estimator To evaluate the performance of the proposed method, we use following measurements: - Let denote the relative error Ew of the estimated inlier ratio The smaller Ew is, the better the algorithm is Where wgt is the defined inlier ratio of ground-truth; w is the inlier ratio of the estimated model - The total distance errors Sd is calculated by summation of distances from any point pj to the estimated model Me 15 Table 3.2: The average evaluation results of synthesized datasets The synthesized datasets were repeated 50 times for statistically representative results Dataset/ Measure RANSAC PROSAC Method Ew 23.59 28.62 (%) 1528.71 1562.42 ’first Sd 89.54 52.71 cylinder’ (ms) Ed (cm) 0.05 0.06 EA (deg.) 3.12 4.02 Er (%) 1.54 2.33 Ew (%) 23.01 31.53 Sd 3801.95 3803.62 ’first t (ms) 10.68 23.45 sphere’ p Ed (cm) 0.05 0.07 Er (%) 2.92 4.12 Ew (%) 24.89 37.86 Sd 2361.79 2523.68 (ms) 495.26 242.26 ’first cone’ EA (deg.) 6.48 15.64 E r(%) 20.47 17.65 MLESAC MSAC LOSAC NAPSAC GCSAC 43.13 10.92 9.95 61.27 8.49 1568.81 70.94 0.17 5.87 7.54 85.65 3774.77 1728.21 1.71 203.60 68.32 2383.01 52525 11.67 429.44 1527.93 90.84 0.04 2.81 1.02 33.43 3804.27 9.46 0.08 5.15 40.74 2388.64 227.57 15.64 17.31 1536.47 536.84 0.05 2.84 2.40 23.63 3558.06 31.57 0.21 17.52 30.11 2298.03 1258.07 6.79 20.22 3168.17 52.03 0.93 7.02 112.06 57.76 3904.22 2.96 0.97 63.60 86.15 13730.53 206.17 14.54 54.44 1495.33 41.35 0.03 2.24 0.69 19.44 3452.88 6.48 0.05 2.61 24.40 2223.14 188.4 4.77 17.21 Table 3.3: Experimental results on the ’second cylinder’ dataset The experiments were repeated 20 times, then errors are averaged Dataset/ Measure ’second cylinder’ (coffee mug) ’second cylinder’ (food can) ’second cylinder’ (food cup) ’second cylinder’ (soda can) Method MLESAC GCSAC MLESAC GCSAC MLESAC GCSAC MLESAC GCSAC w (%) 9.94 13.83 19.05 21.41 15.04 18.8 13.54 20.6 Sd 3269.77 2807.40 1231.16 1015.38 1211.91 1035.19 1238.96 1004.27 (ms) 110.28 33.44 479.74 119.46 101.61 14.43 620.62 16.25 Er (%) 9.93 7.00 19.58 13.48 21.89 17.87 29.63 27.7 - The processing time is measured in milliseconds (ms) The smaller is the faster the algorithm is - The relative error of the estimated center (only for synthesized datasets) Ed is Euclidean distance of the estimated center Ee and the truth one Et 3.1.4.3 Evaluation results of new robust estimator The performances of each method on the synthesized datasets are reported in Tab 3.2 For evaluating the real datasets, the experimental results are reported in Tab 3.3 for the cylindrical objects Table 3.4 reports fitting results for spherical and cone datasets 16 Table 3.4: The average evaluation results on the ’second sphere’, ’second cone’ datasets The real datasets were repeated 20 times for statistically representative results Dataset/ Method ’second sphere’ ’second cone’ 3.1.5 Measure RANSACPROSAC MLESAC MSAC LOSAC NAPSAC GCSAC w(%) Sd (ms) Er (%) w(%) Sd (ms) EA (deg.) Er (%) 99.77 29.60 3.44 30.56 79.52 126.56 10.94 38.11 77.52 99.78 28.77 7.82 31.05 80.21 96.37 96.37 29.42 71.66 99.98 26.62 3.43 26.55 71.89 156.40 7.42 40.35 77.09 99.83 29.38 4.17 30.36 75.45 147.00 13.05 35.62 74.84 99.80 29.37 2.97 30.38 71.89 143.00 9.65 25.39 75.10 98.20 35.55 4.11 33.72 38.79 1043.34 25.39 52.64 76.06 100.00 11.31 2.93 14.08 82.27 116.09 7.14 23.74 68.84 Discussions In this work, we have proposed GCSAC that is a new RANSAC-based robust estimator for fitting the primitive shapes from point clouds The key idea of the proposed GCSAC is the combination of ensuring consistency with the estimated model via a roughly inlier ratio evaluation and geometrical constraints of the interested shapes This strategy aimed to select good samples for the model estimation The proposed method was examined with primitive shapes such as a cylinder, sphere and cone The experimental datasets consisted of synthesized, real datasets The results of the GCSAC algorithm were compared to various RANSAC-based algorithms and they confirm that GCSAC worked well even the point-clouds with low inlier ratio In the future, we will continue to validate GCSAC on other geometrical structures and evaluate the proposed method with the real scenario for detecting multiple objects 3.2 3.2.1 Fitting objects using the context and geometrical constraints Finding objects using the context and geometrical constraints Let’s consider a real scenario in common daily activities of the visually impaired people They come to a cafeteria then give a query ”where is a coffee cup?”, as shown in Fig 3.2.2 The proposed method of finding objects using the context and geometrical constraints In the context of developing object-finding-aided systems for the VIPs (as shown in Fig 1) 17 3.2.2.1 3.2.3 3.2.3.1 Model verification using contextual constraints Experimental results of finding objects using the context and geometrical constraints Descriptions of the datasets for evaluation The first dataset is constructed from a public one used in [3] 3.2.3.2 Evaluation measurements 3.2.3.3 Results of finding objects using the context and geometrical constraints Table 3.5 compares the performances of the proposed method GCSAC and MLESAC Table 3.5: Average results of the evaluation measurements using GCSAC and MLESAC on three datasets The fitting procedures were repeated 50 times for statistical evaluations without the context’s constraint Dataset/ Method Ea (deg.) Er (%) (ms) First MLESAC 46.47 92.85 18.10 GCSAC 36.17 81.01 13.51 dataset Second MLESAC 47.56 50.78 25.89 GCSAC 40.68 38.29 18.38 dataset Third MLESAC 45.32 48.48 22.75 GCSAC 43.06 46.9 17.14 dataset 3.2.4 Discussions CHAPTER DETECTING AND ESTIMATING THE FULL MODEL OF 3-D OBJECTS AND DEPLOYING THE APPLICATION 4.1 4.1.1 3-D object detection Introduction The interested objects are placed on the table plane and the objects have simple geometry structure (e.g coffee mugs, jars, bottles, soda cans are cylindrical, soccerballs are spherical) Our method exploited the performance of YOLO [2] as a state-ofthe-art method for objects detection in the RGB images because it is a method that has the highest performance for objects detection After that, the detected objects are projects into the point cloud data (3-D data) to generate the full objects model for grasping, describing objects 18 Table 4.1: The average result detecting Measure/ First dataset stage Recall Precision Method (%) (%) First PSM 62.23 48.36 Dataset CVFGS 56.24 50.38 DLGS 88.24 78.52 spherical objects on two stages Second stage Recall (%) 60.56 48.27 76.52 Average Processing time Precision (s)/scene (%) 46.68 1.05 42.34 1.2 72.29 0.5 4.1.2 Related Work 4.1.3 Three different approaches for 3-D objects detection in a complex scene 4.1.3.1 Geometry-based method for Primitive Shape detection Method (PSM) This method used the detecting Primitive Shape Method (PSM) of (Schnabel et al) in point cloud of the objects 4.1.3.2 Combination of Clustering objects, Viewpoint Features Histogram, GCSAC for estimating 3-D full object models - CVFGS 4.1.3.3 Combination of Deep Learning based, GCSAC for estimating 3-D full object models- DLGS This network divided the input image into an gird that has the size c × c and used the features from the entire image to predict each bounding box for each cell of this gird 4.1.4 Experimental results 4.1.4.1 Data collection 4.1.4.2 Object detection evaluation 4.1.4.3 Evaluation parameters 4.1.4.4 Results The average result of detecting spherical objects at the first stage of evaluation is presented in Tab 4.1 4.1.5 4.2 Discussions Deploying an aided system for visually impaired people From the evaluations above, they can see that the DLGS method has the best results for detecting 3-D primitive objects that based on the queries of the VIPs Therefore, the complete system is developed according to the frame-work shown in Fig 4.20 To detect objects that based on the query-based of a VIP on the table in the 3-D environment, steps are performed as follows: 19 Acceleration vector Point cloud representation Microsoft Kinect Table plane detection RGB-D image 3-D objects located on the table plane Objects detection on RGB image RGB image 3-D objects model estimation 3-D objects information Detected table plane Point cloud representation 3-D Objects located on the table plane (m) (m) 3-D objects location, description for grasping Depth image Detected Objects Figure 4.20: The frame-work for deploying the complete system to detect 3-D primitive objects according to the queries of the VIPs Generating RGB point cloud from RGB image and depth image (presented in Sec 2.1) that used the calibration matrix and the down-sampling Using acceleration vector and constraints to detect the table plane (presented in Sec 2.2) Separating the table plane and objects (presented in Sec 2.3) Objects detection on RGB image (YOLO) 3-D Object location on the table plane Fitting models by GCSAC (presented in Sec 3.1) for grasping, describing objects 4.2.1 Environment and material setup To build an aiding system to detect and locate 3-D queried primitive objects on the table for the VIPs, we use two types of devices as below The first device is a MS Kinect version Second device is a Laptop 4.2.2 Pre-built script We experiment on the three blind people at three types of table according to the scenarios: ❼ a VIP moves around the table and wants to find the spherical objects or cylindrical objects on the table and there are coffee cup, jar, balls Between them there is a large enough distance ❼ a VIP moves around the table and wants to find the spherical objects or cylindrical objects on the table and there are coffee cup, jar, balls These objects are occluded 20 Table 4.6: The average results of 3-D queried objects detection First stage Second stage Processing Measurement Recall Precision Recall Precision time (%) (%) (%) (%) (frame/s) Average 100 99.27 97.80 90.45 0.86 Results 4.2.3 Experimental results From the experimental setup of system is described in the Sec 4.2.1 and Sec.4.2.2 It includes scenes with different types of table, each scene has about 400 frames, the frame rate of the MS Kinect is about 10 frames per second 4.2.3.1 Evaluation of finding 3-D objects To evaluate the 3-D queried objects detection of the VIPs, we have prepared the ground truth data according to the two phases The first phase is to evaluate the table plane detection, we prepared as Sec 2.2.4.2 and using ’EM1’ measurement for evaluating the table plane detection To evaluate the objects detection, we also prepared the ground truth data and compute T1 for evaluating 3-D cylindrical objects detection and T2 for evaluating 3-D spherical objects detection They are presented in the Sec 4.1.4.2 To detect objects in the RGB images, we utilize the YOLO network for training the object classifier The number of classes, iterations are used as Sec 4.1.4.3 All source code of program is published in the link: We performed the training on 20% data and testing on 80% data All of data is published in the link:2 A true object detection is true table plane detection and satisfy the rate of T1 for 3-D cylindrical objects detection and T2 for evaluating 3-D spherical objects detection The average results of 3-D queried objects detection when using DLGS method is shown in Tab 4.6 The videos demo of the real system are published in the link: http://mica.edu.vn/perso/Le-Van-Hung/videodemo/index-demo.html http://mica.edu.vn/perso/Le-Van-Hung/videodemo/index-demo.html http://mica.edu.vn/perso/Le-Van-Hung/videodemo/index-demo.html 21 4.2.4 Evaluation of usability CHAPTER CONCLUSION AND FUTURE WORKS 5.1 Conclusion In this dissertation, we have proposed a new robust estimator called GCSAC (Geometrical Constraint SAmple Consensus) for estimating primitive shapes (e.g., cylinder, sphere, cone) from a point cloud data that may contain contaminated data This algorithm is a RANSAC variation with improvements of the sampling step Unlike RANSAC and MLESAC, where the samples are drawn randomly, GCSAC selects intentionally good samples based on the proposed geometrical constraints GCSAC was evaluated and compared to RANSAC variations for the estimation of the primitive shapes on the synthesized and real datasets The experimental results confirmed that GCSAC is better than the RANSAC variations for both the quality of the estimated models and the computational time requirements We also proposed to use the contextual constraints which are delivered from the specific context of the environment to significantly improve the estimation results In this dissertation, we also described a completed aided-system for detecting 3-D primitive objects based on VIP’s query This system was demonstrated and evaluated in the real environments The application was developed utilizing the following proposed techniques: ❼ The real-time table plane detection that achieved both high accuracy and low computational time It is a combination of down-sampling, region growth algorithm and the contextual constraints A real dataset of table plane collected in various real scenes is made publicity available ❼ A combination of Deep Learning (YOLO network) for object detection on RGB image and using the proposed robust estimator (GCSAC) for generating the full object models to provide object’s descriptions for the VIPs The evaluations confirmed that YOLO achieved an acceptable accuracy and its computational time is the fastest, while GCSAC could estimate full models with contaminated or occluded data These results ensure the feasibility of the developed application During the experimentations, we also find limitations of the proposed methods, that are listed below: ❼ In table plane detection step, some context constraints are assumed For instance, table plane is flat, lying on the floor and its height is lower than the MS Kinect’s 22 position ❼ In order to detect objects, depth information is utilized to combine with the color information to project to 3-D space However, the resolution of depth images captured by the MS Kinect sensor is not good enough Particularly, at a far distance (more than 4m) or too near (lower than 0.8m), the depth data is unavailable Therefore, the performance of proposed method could be reduced when an user stands too far or too near from the objects ❼ Each primitive shape used a different type of constraint, so the number of objects that can be found are limited In this study, only three types of object (cylindrical, spherical, conical objects) were investigated ❼ The context constraints are only applied for some specific objects whose main axis direction is specified ❼ The proposed system requires a training time when many objects appear in the scene In particular, we have not solved the problem of detecting, recognizing 3-D objects with complex geometry The constraints applying for objects that are composed by many primitive shapes have not been studied ❼ At the moment, the object’s descriptions (e.g, size, position information) of the queried objects are estimated on each separated frame Temporal information has been not attempted in the parameter estimation procedures Therefore, the estimated parameters could consists of noises or uncorrected results To resolve these issues, relevant techniques of time series analysis could be adopted For instance, a Kalman filter can be applied on consecutive frames to eliminate and to correct the estimation results In addition, observations of the estimated parameters in continuous frames could be utilized to calculate statistical measurements which would be more stable and reliable Not only the above limitations, but also the existing challenges of the study suggest us research directions in the future ❼ Short term: – For an improvement of GCSAC to estimate primitive shapes: We need to propose geometrical constraints for estimating many other geometrical structures The combination of the proposed algorithm and the constraints for the the complex shapes can be adopted by work of (Schnabel et al 2007) or composing graph of the primitive shapes as proposed by (Nieuwenhuisen et al 2012) – Evaluating the developed system needs to be deployed on many VIPs with 23 different ages and heights to get their feedbacks The proposed system is currently being deployed in the lab-environment only It should be tested in more environments such as in the classroom, or in any living environments of the VIPs In near future, the proposed system can be evaluated at Nguyen Dinh Chieu School in Hanoi ❼ Long term: – Building a module that guides movement of the visually impaired to the table and combination with the Guide module: The system should be combined with visual odometry module for fully navigating directions and movements of VIPs to the table Addition, a guiding module via a speaker or vibration of a smart device for picking the queried objects is required to complete the system – The system is currently using a MS Kinect sensor to collect data from the environment and processing on a laptop It is bulky and affects the mobility of the VIPs We need to deploy the system on smart-phones or other smartdevices In this case, it is more compact and convenient for the visually impaired people – We will examine recent adaptive learning in deep learning to increase the performances of object recognition and detection tasks Bibliography [1] Lai K., Bo L., Ren X., and Fox D (2011) A large-scale hierarchical multi-view RGB-D object dataset In IEEE International Conference on Robotics and Automation (ICRA), pp 1817–1824 [2] Redmon J., Divvala S., Girshick R., and Farhadi A (2016) You Only Look Once: Unified, Real-Time Object Detection In Computer Vision and Pattern Recognition [3] Richtsfeld A., Morwald T., Prankl J andZillich M., and Vincze M (2012) Segmentation of unknown objects in indoor environments In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp 4791–4796 [4] Scharstein D and Szeliski R (2003) High-Accuracy Stereo Depth Maps Using Structured Light In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1(June):pp 195–202 24 PUBLICATIONS OF DISSERTATION [1] Van-Hung Le, Hai Vu, Thuy Thi Nguyen, Thi Lan Le, and Thanh Hai Tran (2015) Table plane detction using geometrical constraints on depth image, The 8th Vietnamese Conference on Fundamental and Applied IT Research, FAIR, Hanoi, VietNam, ISBN: 978-604-913-397-8, pp.647-657 [2] Van-Hung Le, Hai Vu, Thuy Thi Nguyen, Thi-Lan Le, Thi-Thanh-Hai Tran, Michiel Vlaminck, Wilfried Philips and Peter Veelaert (2015) 3D Object Finding Using Geometrical Constraints on Depth Images, The 7th International Conference on Knowledge and Systems Engineering, HCM city, Vietnam, ISBN 978-1-46738013-3, pp.389-395 [3] Van-Hung Le, Thi-Lan Le, Hai Vu, Thuy Thi Nguyen, Thanh-Hai Tran, TranChung Dao and Hong-Quan Nguyen (2016), Geometry-based 3-D Object Fitting and Localization in Grasping Aid for Visually Impaired People, The 6th International Conference on Communications and Electronics (IEEE-ICCE), HaLong, Vietnam, ISBN: 978-1-5090-1802-4, pp.597-603 [4] Van-Hung Le, Michiel Vlaminck, Hai Vu, Thuy Thi Nguyen, Thi-Lan Le, ThanhHai Tran, Quang-Hiep Luong, Peter Veelaert and Wilfried Philips (2016), Real-time table plane detection using accelerometer and organized point cloud data from Kinect sensor, Journal of Computer Science and Cybernetics, Vol 32, N.3, ISSN: 1813-9663, pp 243-258 [5] Van-Hung Le, Hai Vu, Thuy Thi Nguyen, Thi-Lan Le, Thanh-Hai Tran (2017), Fitting Spherical Objects in 3-D Point Cloud Using the Geometrical constraints Journal of Science and Technology, Section in Information Technology and Communications, Number 11, 12/2017, ISSN: 1859-0209, pp 5-17 [6] Van-Hung Le, Hai Vu, Thuy Thi Nguyen, Thi-Lan Le, Thanh-Hai Tran (2018), Acquiring qualified samples for RANSAC using geometrical constraints, Pattern Recognition Letters, Vol 102, ISSN: 0167-8655, pp 58-66, (ISI) [7] Van-Hung Le, Hai Vu, Thuy Thi Nguyen (2018), A Comparative Study on Detection and Estimation of a 3-D Object Model in a Complex Scene, 10th International Conference on Knowledge and Systems Engineering (KSE 2018), pp.203-208 [8] Van-Hung Le, Hai Vu, Thuy Thi Nguyen, Thi-Lan Le, Thanh-Hai Tran (2018), GCSAC: geometrical constraint sample consensus for primitive shapes estimation in 3D point cloud, International Journal Computational Vision and Robotics, Accepted (SCOPUS) [9] Van-Hung Le, Hai Vu, Thuy Thi Nguyen (2018), A Frame-work assisting the Visually Impaired People: Common Object Detection and Pose Estimation in Surrounding Environment, 5th Nafosted Conference on (NICS 2018), pp.118-223 [10] Hai Vu, Van-Hung Le, Thuy Thi Nguyen, Thi-Lan Le, Thanh-Hai Tran (2019), Fitting Cylindrical Objects in 3-D Point Cloud Using the Context and Geometrical constraints, Journal of Information Science and Engineering, ISSN: 1016-2364, Vol.35, N1, (ISI) ... 40 .35 77.09 99. 83 29 .38 4.17 30 .36 75.45 147.00 13. 05 35 .62 74.84 99.80 29 .37 2.97 30 .38 71.89 1 43. 00 9.65 25 .39 75.10 98.20 35 .55 4.11 33 .72 38 .79 10 43. 34 25 .39 52.64 76.06 100.00 11 .31 2. 93. .. of data is published in the link:2 A true object detection is true table plane detection and satisfy the rate of T1 for 3- D cylindrical objects detection and T2 for evaluating 3- D spherical objects... information Fitting 3- D objects Candidates Figure A general framework of detecting the 3- D queried objects on the table of the VIPs RGB and Depth images and table plane detection in order to separate

luanvan abstract english phát hiện và nhận dạng đối tượng 3 d hỗ trợ sinh hoạt của người khiếm thị 3 d object detection and recognition assisting visually impaired people

Thông tin tài liệu

Từ khóa liên quan

Mục lục

LITERATURE REVIEW

Aided systems supporting for visually impaired people

Aided systems for navigation service

Aided systems for obstacle detection

Aided systems for locating the interested objects in scenes

Aided systems for detecting objects in daily activities

Discussions

3-D object detection, recognition from a point cloud data

Appearance-based methods

Geometry-based methods

Discussions

Fitting primitive shapes: A brief survey

Linear fitting algorithms

Robust estimation algorithms

RANdom SAmple Consensus (RANSAC) and its variations

Discussions

POINT CLOUD REPRESENTATION AND THE PROPOSED METHOD FOR TABLE PLANE DETECTION

Point cloud representation

Capturing data by a Microsoft Kinect sensor

Point cloud representation

The proposed method for table plane detection

Introduction

Related Work

The proposed method

The proposed framework

Plane segmentation

Table plane detection/extraction

Experimental results

Experimental setup and dataset collection

Tài liệu cùng người dùng

Tài liệu liên quan