Incremental structural modeling on sparse visual SLAM
 Rafael A. Roberto^{1}Email author,
 Hideaki Uchiyama^{2},
 João Paulo S. M. Lima^{1, 3},
 Hajime Nagahara^{2},
 Rinichiro Taniguchi^{2} and
 Veronica Teichrieb^{1}
DOI: 10.1186/s4107401700183
© The Author(s) 2017
Received: 21 February 2017
Accepted: 7 March 2017
Published: 15 March 2017
Abstract
This paper presents an incremental structural modeling approach that improves the precision and the stability of existing batchbased ones for sparse and noisy point clouds from visual simultaneous localization and mapping (SLAM). The main idea is to use the generating process of point clouds on SLAM effectively. First, a batchbased method is applied to point clouds that are incrementally generated from SLAM. Then, the temporal history of reconstructed geometric primitives is statistically analyzed to suppress incorrect reconstruction. The evaluation shows that both precision and stability are improved compared to an existing batchbased method, and the proposed method is suitable for realtime structural modeling.
Keywords
Incremental structural modeling Geometric modeling Sparse visual SLAM1 Introduction
The generation of 3D shape has become common thanks to the popularization of lowcost depth sensors. Such sensors acquire the shape described by a dense point cloud that is converted into a continuous surface defined by meshes. Then, the surface is converted into geometric primitives by estimating the types of the shapes and their parameters. This structural modeling is useful in various applications such as 3D scene understanding and reverse engineering. Also, it saves plenty of memory because a dense point cloud is basically redundant for scene description, and geometric primitives can be described by fewer parameters.
Most of the existing methods on structural modeling estimate geometric primitives from a dense point cloud acquired with LiDAR scanners at large outdoor scales [1] and with Kinectlike sensors at indoor enviroments [2]. Their research issue is how to estimate model parameters of particular primitives [3]. For example, there exist methods to determinate the parameters of 3D planes [4] and cylinders [5, 6]. Different types of primitives can be simultaneously detected using iterative model fitting [7, 8]. Hough transformbased approaches are also proposed to be robust to a noisy point cloud [9]. However, they are highly timeconsuming due to the high density of the cloud. Also, the input data is usually a dense point cloud, and the performance with a sparse point cloud is not investigated.
In the existing methods, random sample consensus (RANSAC)based methods can apply to sparse point clouds because they estimate primitives by initially picking a minimal group of points for each shape and detecting the one that approximates the maximum number of points [8]. Besides, they can also work with data containing a large number of outliers [3]. Therefore, the performance of a RANSACbased method [8] for sparse point clouds is first investigated, and its drawbacks are analyzed in Section 2. Note that there exist methods that deal with sparse point clouds, but they can detect planes only such that textured planes are reconstructed [10].
Another aspect of existing methods is that they basically work in a batch and an offline manner. It means that an input data is analyzed all together only once. In recent years, the performance of online and incremental 3D reconstruction such as visual simultaneous localization and mapping (SLAM) has drastically been improved regarding both the accuracy of the reconstruction and the computational cost for realtime applications [11, 12]. Compared with batch acquisition of a point cloud, an incremental approach can provide not only a point cloud but also the generating process of the cloud, which, as far as the authors know, has never been incorporated into structural modeling in the literature.
This paper presents an incremental structural modeling that estimates geometric primitives on sparse keypointbased visual SLAM. The key idea is to use not only a sparse point cloud but also the generating process of the cloud on SLAM. First, an existing batchbased structural modeling is applied to a sparse point cloud that are incrementally updated on SLAM. Then, the types of shapes and their parameters in the cloud are statistically determined from the temporal history of estimated geometric primitives. Since the estimation of geometric primitives on sparse point clouds is sensitive to noisy reconstruction, both precision and stability can be improved by using the generating process. Also, the proposed method is suitable for realtime structural modeling, as discussed in the evaluation.
2 Evaluation of efficient RANSAC
Since Efficient RANSAC has five heuristic parameters, different combinations of these parameters were tested to find the best ones for sparse point clouds. It was searched for the parameters that maximized the number of assigned points, which were the ones from the input point cloud that were modified to be fitted to a shape. These parameters also minimized the distance between these points and the ones on the input point cloud. Figure 1 c is the result of the best set of parameters, which assigned 90.38% of the points with an average distance of 3.77 centimeters for each assigned point. From this result, it was found that geometric primitives were able to be detected even from sparse point clouds if best parameters were selected. However, the instability of Efficient RANSAC was also unveiled.
Figure 1 d shows the shapes detected using the same data as the previous test with different initial values for RANSAC. In the result, 89.75% of the points were assigned with an average distance of 2.01 cm. Compared to the previous test, six of them were differently detected, and five of them were wrong. This unstable result often occurred with noisy and sparse point clouds from visual SLAM. To stably estimate geometric primitives from such clouds, a method to incorporate the generating process of the clouds into structural modeling is proposed in the next section.
3 Incremental structural modeling
Existing structural modeling worked with all the input point cloud available before primitive estimation. In visual SLAM, point clouds are incrementally built, and their generating process is also available. From this process, the temporal history of the primitives detected at each update of the point clouds can be computed. In order to improve both reliability and stability of structural modeling, the history of estimated primitives is statistically analyzed. In other words, the proposed method evaluates the history of estimated primitives to eliminate unreliable results, replace incorrect primitives with correct ones and restore shapes that were not detected at a particular moment.
3.1 Fusion of shapes
One characteristic of sparse point clouds from keypointbased visual SLAM is that keypoints are clustered at highly textured areas. For example, in Fig. 1 a, two different patterns over the table were observed and formed two distinct clusters of points as illustrated in Fig. 1 b. In this case, efficient RANSAC detected them as two separated planes even though they belong to the same plane on the table and should be detected as one. In other cases, one object can be detected as several different objects due to nonuniform distribution of points in the cloud.

Plane: the planes are parallel given an angle threshold α _{ t }, and the distance between them is smaller than the distance threshold d _{ t };

Sphere: the distance between their centers is lower than d _{ t }, and the difference between their radii is less than the radius threshold r _{ t };

Cylinder: the angle between the axis direction of both cylinder is smaller than α _{ t }, the distance between them is less than d _{ t }, and the difference between their radii is smaller than r _{ t }.
While d _{ t } and r _{ t } are controlled for each case such that they are 2% of the largest size of the point cloud bounding box, α _{ t } is always set to 5°.
3.2 Temporal update of shapes
Shape intersection: the intersection ratio between the bounding boxes of every shape on the current keyframe and those on the previous one is computed. Shapes that appear for the first time have no intersection and they are registered for future evaluation.
Shape distance: for primitives on the current keyframe that intersect with other ones on the previous keyframe, the distance between their centers of mass is computed. Shapes with high intersection ratio and small distance in two consecutive moments have more chance to be correspondents, assuming that the scene is stationary.
where n s _{ c } are the indices of the shapes of the given class with intersection on previous keyframes, ψ _{ i } is the number of points of that shape, p _{ i } and d _{ i } are the intersection ratio and distance between centers of mass, respectively, and ψ _{ s } is the number of points in the primitive on the current keyframe.
Shape update: each shape keeps the history of primitive classes since its first appearance. The current shape inherits this record from the one detected as its correspondent, and the shape type is updated to determine the class of primitive detected on the current keyframe. Then, this historical data is evaluated to verify if the current type is compatible with the whole history. The shape type can change if it is different from the dominant class, which is the one that appeared on more than 50% of the time.
Shape recovery: it is verified if there are shapes that were not detected on current keyframe but were on the previous one. These shapes are brought back to the current set of primitives with the same parameters from last appearance. However, the historic of the shape will register the number of keyframes in which it was recovered instead of detected. If this shape is not detected anymore, it will disappear when it is recovered more than detected. Finally, only the stable estimated shapes will be displayed by the Render.
4 Evaluation
The comparison of precision, recall, and F_{1}score between the proposed method (ISM) and efficient RANSAC (RSM)
Precision  Recall  F_{1}Score  

Case1  RSM  0.822  0.819  0.820 
ISM  0.993  0.832  0.906  
Case2  RSM  0.756  0.756  0.756 
ISM  1.000  0.963  0.981  
Case3  RSM  0.795  0.795  0.795 
ISM  1.000  0.900  0.947  
Case4  RSM  0.500  0.489  0.494 
ISM  0.647  0.440  0.524  
Case5  RSM  0.761  0.750  0.755 
ISM  0.942  0.645  0.766 
In two cases, RSM achieved a better recall because it outputs twice more shapes on average. However, some of them were incorrect and the precision became lower. In ISM, such shapes were filtered out because they were not reliable and the precision became higher.
Due to these improvements, ISM achieved more precise and more stable detection. Figure 3 (rows 2 and 4) shows some cases in which the same object was detected as distinct shapes at different moments by RSM. This fact is rarely noticed when using ISM, which usually shows the correct primitive regardless if the one detected does not display it if the shape is not reliable. Moreover, the results become more stable as the time passes because there is more information to decide the correct primitive in ISM.
As for the computational cost, the process to estimate the shape from a point cloud depends on its size. The time was collected on every test and on average takes 29.04 ± 16.21 ms per group of thousand points to retrieve the primitives on a computer with an Intel Core i56300U (2.40 GHz) and 8 GB of RAM. The largest and the smallest point clouds are from the last keyframe of Case1 (16,698 points) and first keyframe of Case2 (940 points). The entire process of combining different shapes and computing the temporal information takes an additional 5.77 ± 1.00 ms per group of thousand points.
5 Conclusions
It was proposed in this work a method that incrementally estimated geometric primitives from the sparse maps generated by visual SLAM. This approach used the history information from every reconstruction to select the correct type of shape and their parameters so that the result was stable even when dealing with noisy point clouds. The evaluation showed that the use of temporal information contributed to more precise and more stable modeling and presents better results regarding precision F_{1}score and, in most cases, recall.
One improvement for future versions of this method is to also use the color information available on the images on both the fusion of shapes as well as on the incremental modeling. Another future work is to develop an application that uses the incremental approach in combination with a realtime visual SLAM system.
6 Endnote
^{1} Source code available at https://goo.gl/XINs6N.
Declarations
Acknowledgements
The authors would like to thank Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) (processes 140898/20140 and 456800/20140) for partially funding this research.
Authors’ contributions
RAR contributed by conceiving, designing, and programming the technique, acquiring, analyzing, and interpreting the data, and drafting the manuscript. HU contributed by conceiving and designing the technique, analyzing and interpreting the data, and drafting and revising the manuscript. JPSML and HN contributed by conceiving and designing the technique, analyzing and interpreting data, and revising the manuscript. RT and VT contributed by conceiving and designing the technique, revising the manuscript and coordinating the research. All authors read and approved the final manuscript.
Authors’ information
Rafael A. Roberto is a Ph.D. candidate in Computer Science at the Federal University of Pernambuco and a researcher at Voxar Labs. His research interests include 3D tracking, computer vision, and augmented reality for mobile devices. Hideaki Uchiyama is an assistant professor at Kyushu University. His research interests include computer vision and augmented reality. João Paulo S. M. Lima is an assistant professor at Statistics and Informatics Department of at the Federal Rural University of Pernambuco and senior scientist at the Voxar Labs. His research interests include 3D tracking, augmented reality, computer vision, and computer graphics. Hajime Nagahara is an associate professor at Kyushu University. His research interests include computer vision and computational photography. Rinichiro Taniguchi is a full time professor at Kyushu University. His research interests include image processing, pattern recognition, and computer vision. Veronica Teichrieb is an associate professor at the Federal University of Pernambuco and head of the Voxar Labs research group. Her research interests include augmented reality, visualization, tracking and interaction.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 Musialski P, Wonka P, Aliaga DG, Wimmer M, Gool L, Purgathofer W (2013) A survey of urban reconstruction. Comput Graph Forum 32(6): 146–177. doi:10.1111/cgf.12077.
 Kim YM, Dolson J, Sokolsky M, Koltun V, Thrun S (2012) Interactive acquisition of residential floor plans In: 2012 IEEE International Conference on Robotics and Automation, 3055–3062.. IEEE Publications, St. Paul,doi:10.1109/ICRA.2012.6224595.
 Roth G, Levine MD (1993) Extracting geometric primitives. CVGIP: Image Underst 58(1): 1–22. doi:10.1006/ciun.1993.1028.
 Nguyen T, Reitmayr G, Schmalstieg D (2015) Structural modeling from depth images. IEEE Trans Vis Comput Graph 21(11): 1230–1240. doi:10.1109/TVCG.2015.2459831.
 Liu YJ, Zhang JB, Hou JC, Ren JC, Tang WQ (2013) Cylinder detection in largescale point cloud of pipeline plant. IEEE Trans Vis Comput Graph 19(10): 1700–1707. doi:10.1109/TVCG.2013.74.
 Qiu R, Zhou QY, Neumann U (2014) Piperun extraction and reconstruction from point clouds. In: Pajdla T, Schiele B, Tuytelaars T (eds)Fleet D, 17–30.. Springer International Publishing, Zurich. doi:10.1007/9783319105789_2.
 Li Y, Wu X, Chrysathou Y, Sharf A, CohenOr D, Mitra NJ (2011) GlobFit: Consistently fitting primitives by discovering global relations. ACM Trans Graph 30(4): 52–15212. doi:10.1145/2010324.1964947.
 Schnabel R, Wahl R, Klein R (2007) Efficient RANSAC for pointcloud shape detection. Comput Graph Forum 26(2): 214–226. doi:10.1111/j.14678659.2007.01016.x.
 Drost B, Ilic S (2015) Local hough transform for 3d primitive detection In: 2015 International Conference on 3D Vision, 398–406.. IEEE Publications, Lyon,doi:10.1109/3DV.2015.52.
 Sinha SN, Steedly D, Szeliski R, Agrawala M, Pollefeys M (2008) Interactive 3d architectural modeling from unordered photo collections. ACM Trans Graph 27(5): 159–115910. doi:10.1145/1409060.1409112.
 MurArtal R, Montiel JMM, Tardós JD (2015) ORBSLAM: a versatile and accurate monocular slam system. IEEE Trans Robot 31(5): 1147–1163. doi:10.1109/TRO.2015.2463671.
 Uchiyama H, Taketomi T, Ikeda S, dM Lima JPS (2015) [POSTER] Abecedary tracking and mapping: a toolkit for tracking competitions In: 2015 IEEE International Symposium on Mixed and Augmented Reality, 198–199.. IEEE Publications, Fukuoka,doi:10.1109/ISMAR.2015.63.