Doxel has won the trust of its customers by developing a data pipeline that quickly turns visual data captured from construction sites into mission critical insights needed to deliver projects on-time and on-budget.
In the past, much of this data capture has been done using a technology called LiDAR (Light Detection & Ranging). A modern mobile LiDAR device can capture a highly-detailed and accurate 3D point cloud of large commercial and industrial spaces with a single walk-through, along with localized panoramic imagery. Using this information, both humans and algorithms have access to a wide variety of contextualized information on the site, such as installed quantities which can be used to generate progress data or deviation analysis which can be used to assess quality.
With centimeter-level accuracy and tens of thousands of square feet captured per hour, LiDAR devices are important tools when it comes to the amount of data they provide in a site walk-through, and will likely continue to be an important part of our sensor portfolio. However, cost is the major downside of LiDAR devices and this makes them a difficult investment on smaller construction sites with smaller budgets. We decided that leveraging a more ubiquitous data capture method we could achieve our mission of bringing AI Powered Project Controls to every job site.
A natural hardware candidate to complement LiDAR was the trusty 360 camera often used for photo documentation. However, to provide accurate schedule or budget predictions requires very accurate Ground Truth data to feed into our Construction Encyclopedia, and validate the ability to meet our strict tolerances customers have come to rely on. Many variables from sensor quality, to localization error, to perception algorithm accuracy might prevent this cheap hardware from actually answering key questions on construction sites.
This article explains how last year we proved out that such a sensor can actually serve as a Ground Truth data capture device and how it became a dominant part of our supported hardware portfolio. We specifically focus on VSLAM which was one of the highest-risk R&D pieces in the stack.
What is VSLAM?
Vision-based simultaneous localization and mapping is the name of the research field that aims to give computers and robots spatial understanding through vision.
You visit a new city for the first time. You go out for a walk, and leave your phone behind. Using only your eyes, your inner ear, and your brain, you discover the nearby convenience store, park, and metro station, and are back in time for bed. How did you do it?
Your eyes track small objects in the world with help from initial estimates coming from proprioception and the movement of the fluid in your ears, a process we can refer to as dead reckoning. But this process alone isn't enough to get you home. As you go around block after block and figure out how to return to where you've been before, you form a list of landmarks in your head. If your dead reckoning ever fails you, you keep moving until you eventually happen upon a place you've already been, and everything snaps into place. You can now close your eyes and imagine the relative locations of all of the points of interest in your new neighborhood. When it's time to wake up in the morning, you can head straight in the direction of the metro station for your appointment.
From a high level, VSLAM does almost the exact same thing. It relies on a motion model combined with visual features to perform local tracking. A separate subsystem applies pattern recognition techniques to build and maintain a database of visual information about salient places in your environment, and continuously match against it. Finally, a nonlinear optimization process is used to aggregate all of the constraints provided by feature detection, local tracking and loop closure into a coherent pose and view graph.
By performing this process, we can produce localized panoramas for every frame in a 360 video. Imagine a 3D view of your construction site. Thanks to VSLAM, our system's understanding of the site looks like this (figure 1.0):
In the right pane, there is a map of the site, with blue markers indicating the trajectory, the red marker indicating camera position and direction. In the left pane of the above image, you can see the model overlaid on the 360 video. As the figure shows, VSLAM gives context to every frame of a video, giving a person or algorithm access to views provided by a walk-through, in all directions.
1. A small technical note: we do not impose the real-time constraint on our system that is typical in VSLAM, so the designation VSLAM isn't 100% accurate, but the technique we use has a lot in common, so it's the best way to get our point across with a single word.
VSLAM first appeared in the journals decades ago. Why is it exciting now? As robotics researchers will tell you, up until recent years, there haven't been a lot of reliable, production-ready systems relying on pure VSLAM (using no other sensors), and those that do exist were won through extensive engineering investment, and often have their usefulness compromised by failures in relatively common corner cases. VSLAM is a challenging problem, so before we even started to invest resources in using it in our pipeline, we had to be sure that it would pan out to be a wise investment. We justified this by considering the market, our position in it, and the amount of technical risk implied by recent successful technologies and research.
First and foremost, we considered our product and technology progress. When Doxel set out to build an AI Powered Project Controls platform we knew we needed the best data possible to avoid the “garbage in, garbage out” problem of many insight engines. Doxel needed to build an end-to-end product that digitized the physical world - breaking the most important “data silo” preventing construction site progress information from flowing into schedules and budgets. While daunting, we’ve proven it was possible, and had a production end-to-end pipeline running for our customers. This meant we could “swap in” a 360 camera and compare it to the best-possible input data to truly know if we had a “garbage in” problem without risking our accuracy.
Secondly, we looked at recent deployments of VSLAM-like technologies throughout the industry as a predictor of the level of robustness that we might achieve if we pursued VSLAM. The research field and the industry has recently made huge strides. With the release of ARKit in 2017, Apple proved that VSLAM technology can be made robust. This triggered an industry renaissance, and the use of VSLAM and localization technologies have grown more numerous, reducing the risk of bringing an application of VSLAM to the market.
Finally, we considered the hardware market, and looked for ways that we could maximize our chances of success by tailoring our application of VSLAM to our business case. VSLAM by definition tries to solve all problems at once. ARKit is an example of the classic case: you have few priors about your environment, and you want to build a map in real time, and use it with zero delay, and you want to use the sensors and compute resources available on a typical smartphone, with few modifications. In contrast, we have the luxury of being able to relax the real-time constraint and throw more computational power at the problem, and opt for the use of 360 cameras. One big weakness of VSLAM is that the amount of information present in a single image can be quite limited, as you can see in the below image. This makes it necessary to be careful about data capture and make sure that a diverse and full set of perspectives is captured to perform VSLAM effectively.
But when the camera can see in every direction at once, this gives more information to the algorithm and relieves the burden on the capture technician to be careful while capturing data. Below, you can see the amount of information captured by a 360 camera, as opposed to the perspective image above.
Bringing VSLAM To The Construction Site
Doxel is uniquely positioned to rapidly bring AI Powered Project Controls leveraging 360 degree cameras and VSLAM to the market. Our customer portfolio gives us access to hundreds of realistic construction environments, which are being concurrently scanned using 360 cameras and LiDAR devices. LiDAR devices give us highly-accurate data that we can regard as ground truth, and this data allowed us to rapidly prove and build a VSLAM data pipeline.
Approaching LiDAR Accuracy
Before investing engineering resources in reliability and scale for VSLAM, we set out to prove that it could provide the accuracy needed to disambiguate two installations on a site, for example, two pipes side-by-side. We captured dozens of environments over months alongside LiDAR and performed VSLAM-based reconstruction of the sites, and compared the quality of registration of the panorama frames to those provided by our ground truth sensor. The results were exhilarating. Across dozens of capture sessions, VSLAM was able to produce much more highly-accurate trajectories than we expected. A visualization from one of these experiments is given below. The blue markers below are from a VSLAM-derived trajectory, and the green markers are samples of ground truth from a LiDAR unit.
This level of accuracy was not only sufficient to provide a general idea of where in the site you were; we overshot our accuracy goal by 4x. With some small hints easily provided by a human in the loop, we were able to provide the overlays shown in figure 1.0, and can consistently do so despite challenging capture conditions. We were able to confidently make the decision to invest in the harder engineering work of making a reliable, scalable pipeline.
Scaling The Pipeline
Any system we push to production needs to work in a wide variety of circumstances, so we set out to build a pipeline that would produce localized panoramas reliably. This pipeline - now a reality - consists of mixing localization - a technique that combines geometric CV and pattern recognition - with manual aspects to minimize labor needed by Data Capture Technicians in post-processing scans. We describe the ingredients of this pipeline in the following sections, and tie it all together in the final one.
A big part of making a scalable VSLAM pipeline is localization; it puts the L in VSLAM. Your brain can easily tell that these two video frames are from the same place, at a different time:
It turns out, so can computers these days - to approximately 0.2 m of accuracy. So if you have a 3D reconstruction of the first video, it can be used to register frames in the second to the same coordinate system; in the below graphic, you see the path in the first video frame seen above in blue, and in the second video frame in gray. Localization was used to register the gray trajectory to the blue one, which had already been registered to the BIM model (purple).
Localization as a technology is relatively mature. According to our experiments, localization is able to associate images taken in the same environment of a construction site over multiple months of construction activity. There is a 2 month gap between the two videos from which the above image pairs and trajectory maps were taken.
Ensuring Data Quality With Humans In The Loop
Part of localization's maturity comes from the fact it's a relatively easy problem; the modality of the data from the reference and source being the same helps us draw reliable and accurate inferences. However, until we have at least one panorama video registered to the BIM model space, we have a much more challenging problem that crosses modalities. In the case where we haven't captured a space before, we have an image we want to register (visible below to the left), but the reference we want to associate our data with is a triangle mesh (visible below on the right). Associating data between these two modalities is actually a very challenging research problem.
Rather than attempting to solve an open research problem, we leveraged the technology alongside humans to ensure accuracy. We developed specialized tools to minimize the complexity of the task in order to keep our operational complexity low. Using the onboarding tool we developed, a person familiar with the site and video can provide the system what it needs for registration in 15 minutes of human annotation. With this single bit of work input, we can register continuous data capture sessions over months. By picking our battles, we were able to rapidly prove the viability of the other pieces of the pipeline.
As part of our R&D testing of this pipeline we put hundreds of videos through it to evaluate the accuracy of them versus ground truth. One by one, we addressed the risks that could jeopardize the quality of our data and therefore our product.
We used our registration tooling over and over again, making sure that it wouldn't break when we needed it, and that we could register datasets to the level of accuracy we needed. We collected 360 video with a rigidly-connected LiDAR to grade the accuracy of our reconstruction algorithms. We developed labeling tooling in 2D that is equivalent to our 3D tooling, so that we could do most of the same things with this pipeline that we can with our 3D pipeline. We obsessively improved this tooling, trying to reduce the time all of our tasks take, so that we could confidently call this pipeline scalable. We subjected this pipeline to many labeling experiments with production data, ensuring that the results were up to par with our existing accuracy and efficiency numbers.
By the time we were done, we felt confident that the processed data from 360 cameras met Doxel’s strict accuracy bar.
Putting It All Together
A technician on the ground uploads data to our processing cloud, and SLAM is performed. Concurrently, data association between the video and the BIM model is taken care of.
Data association is performed in one of two ways. If the environment has never been seen before, the capture technician is paged to give the system a hint in a couple of cases about where panoramas were captured. If a video for the same environment has been processed before, the pipeline is completely hands-off - it uses visual localization to perform the same task. The result of data association is used to register the video frames to the model.
Following registration, a video of registered panoramas is delivered to downstream ML algorithms and labelers. Our customers are making use of our Doxel Schedule and Doxel Cost products today, and these products require data capture pipelines with which it's possible to gather the data needed to mark components installed on the site. Using the localized panorama videos provided by VSLAM, our labelers and ML algorithms have a novel data stream to perform these tasks. This concept is shown with real data below. We've registered a panorama video to our BIM model, and shown specifically how newly-installed objects could be marked as installed using ML or labeling.
In addition to the VSLAM applications around quantity tracking currently in use today, we are in the process of delivering improvements to the metric accuracy, labeling accuracy, and user-friendliness of this system. Stay tuned as we continue to improve this product so it applies to a broader and broader set of use cases within Doxel.