Phase disambiguation using spatio-temporally modulated illumination in depth sensing

Phase ambiguity is a major problem in the depth measurement in either time-of-flight or phase shifting. Resolving the ambiguity using a low frequency pattern sacrifices the depth resolution, and using multiple frequencies requires a number of observations. In this paper, we propose a phase disambiguation method that combines temporal and spatial modulation so that the high depth resolution is preserved while the number of observation is kept. A key observation is that the phase ambiguities of temporal and spatial domains appear differently with respect to the depth. Using this difference, the phase can disambiguate for a wider range of interest. We develop a prototype to show the effectiveness of our method through real-world experiments.


Introduction
Depth measurement is widely used in applications such as augmented reality, factory automation, robotics, and autonomous driving. In the computer vision field, there are two well-known techniques for measuring scene depth using active illumination. One is the time-of-flight camera, which uses temporally modulated illumination to measure the travel time of light; the other is the phase shifting, which uses temporally modulated illumination to find the correspondence between the projector and the camera for triangulation.
A common problem is how to resolve the periodic ambiguity of the phase because either measurement gives the phase that is defined between 0 to 2π. Typical solution is to use multiple frequencies to resolve the phase ambiguity. However, the phase ambiguity still exists in the frequency of the greatest common divisor, which requires several measurements to obtain a wider range of interest. Another possible approach is to use a low frequency that sacrifices *Correspondence: kushida.takahiro.kh3@is.naist.jp 1 Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma 630-0192, Japan Full list of author information is available at the end of the article the depth resolution. The aim of this study is to resolve the phase ambiguity in fewer observations, where both the wider range of interest and the better resolution of the depth are guaranteed.
A key observation of this paper is that the phase ambiguities of the time-of-flight (ToF) and the phase shifting appear differently on the depth domain. Since the temporal phase is proportional to the depth, the depth candidates from the phase appear at equal intervals along with the depth. On the other hand, the spatial phase is defined as the disparity domain; hence, the depth candidates appear at gradually increasing intervals. Based on this difference, the phase ambiguity can be resolved by combining temporal and spatial modulation. Because the candidate depth that satisfies both measured phases seldom appears, the number of phase can be reduced to one for each frequency. In this paper, we discuss ordinary ToF and phase shifting in the same framework. We show that precise depth can be measured in a wide range by combining temporal and spatial modulation. We also reveal the resolution and the range of interest theoretically, analyze the recoverability, and build a prototype to show the effectiveness of our method via real-world experiments. This paper extends its preliminary version [1] with the following differences. Extensions have been made to (1) reveal the depth resolution and the range of interest of our proposed method, (2) develop an efficient implementation, and (3) confirm that the unrecoverable depth due to ambiguity seldom exists by simulation.
The rest of the paper is organized as follows. The related work is discussed in Section 2, a brief review of the ordinary time-of-flight and phase shifting algorithms are provided in Section 3, a spatio-temporal modulation technique is proposed in Section 4, the resolution and range of interest of our method is analyzed in Section 5, experiments with a prototype system is shown in Section 6, and we conclude with some discussions in Section 7.

Related work
Active depth measurements have been widely studied in the computer vision field. Earlier work used a projectorcamera system to convert the projector's pixel index into multiple projection images based on the gray code [2]. The phase shifting approach [3] recovers subpixel correspondences by detecting the phase of the sinusoid. Gupta and Nayer [4] unwrapped the phase from slightly different frequencies so that it became robust to indirect light transport with a small budget of projection numbers. Mirdehghan et al. [5] proposed an optimal code for the structured light technique. The time-of-flight method is another way to measure depth. It emits amplitude modulated light, and a delayed signal is detected that corresponds to the scene depth [6]. Because the range of interest and the depth resolution are tradeoffs, a better resolution is obtained by limiting the range of interest [7]. We combine these techniques to realize both better resolution and wider range of interest.
Another problem regarding the ToF is multi-path interference due to indirect light transport. Recovering the correct depth of multi-path scenes has been broadly studied using a parametric model [8,9], K-sparsity [10,11], frequency analysis [12], and data-driven approaches [13][14][15]. Because the scene depth can be recovered by the first-returning photon, the depth can be obtained after recovering light-in-flight imaging [16][17][18][19][20][21]. Multipath interference is mitigated by combining ToF and projector. Naik et al. [22] combined the ToF camera and a projector-camera system to mitigate a multi-path that uses direct-global separation [23]. Similar ideas are implemented with the ToF projectors that can modulate both spatially and temporally [24,25]. In both cases, direct-global separation is utilized to mitigate multi-path interference. We also use a similar system for phase disambiguation not only for mitigating multi-path.
To obtain fine resolution, Gupta et al. [26] proposes the optimal code for ToF modulation. Gutierrez-Barragan et al. [27] proposes an optimization approach for designing practical coding functions under hardware constraints. Kadambi et al. [28] uses the polarization cue to recover the smooth surface. Our method is more fundamental layer; hence, these techniques can be incorporated with our method to boost the resolution. An interferometer can also obtain micrometer resolution of a small size object. Interferometry gives micrometer resolution [29] in a carefully controlled environment. Li et al. [30] recover micro-resolution ToF using the superheterodyne technique. Maeda et al. [31] leverages the heterodyne technique to the polarization imaging to obtain the accurate depth.
Phase unwrapping is a subproblem in the depth measurement. The phase has to be unwrapped with either the phase shifting or the ToF; otherwise, the estimated depth have 2π ambiguity. The number of observations can be reduced by sacrificing the spatial resolution. The projector's coordinates can be obtained from a single image using a color code [32], a wave grid pattern [33], and a light-field ToF [34]. Our method falls into this class but does not sacrifice the spatial resolution nor require many patterns. Our method leverages the asymmetric relations of spatial and temporal wrapping to solve the ambiguity of the phase.

Depth measurement techniques using modulated illumination
Before explaining our method, we briefly review the ToF and phase shifting methods. We respectively explain them as the phase measurements using temporally or spatially modulated light.

Temporal modulation (time-of-flight)
The ToF camera emits the temporally modulated light as shown in Fig. 1a. It measures the amplitude decay and phase delay of the modulated light, and the phase delay corresponds to the time it takes for the light to make a round trip. The ToF camera measures the correlation between the signals emitted and those received. For each frequency, the phase delay is calculated from the correlations with N T reference signals, which are temporally shifted. For the k-th signal, the correlation i k (x) at the camera pixel x is represented as where g t + 2πk N T is the reference signal with the shifted phase 2πk/N T , s is the returned signal, the * operator represents the correlation, A is the amplitude decay, φ T is the phase delay, and O is the ambient light. In the case of The depth d is obtained as where ω T is the modulation frequency and c is the speed of light.

Spatial modulation (phase shifting)
The phase shifting spatially modulates the projection pattern. Finding the correspondences between the projector and camera pixels is the main part of the spatial phase shifting. The idea is to project the sinusoidal pattern as shown in Fig. 1b and measure the phase of the sinusoid for each pixel, which corresponds to the projector's pixel coordinates.
The observed intensity of the camera I l (x) for l-th shift is represented a where φ S is the spatial phase of the projection pattern due to disparity. There are three unknown parameters, which are the offset O, the amplitude A(x), and the phase φ S (x); therefore, they can be recovered from N S ≥ 3 observations while changing the phase of the pattern. In the case of N S = 4, the spatial phase φ S and the amplitude A can be recovered in the same way as the ToF as From the estimated disparity, the scene depth can be recovered using the triangulation theory. For example, when the parallel stereo is assumed, the depth is inversely proportional to the disparity as is the disparity, ω S is the spatial angular frequency of the projection pattern, f is the focal length, and b is the baseline of the pro-cam system. Here, x represents the horizontal pixel position.

Phase ambiguity and depth resolution
A common problem in both temporal and spatial methods is 2π ambiguity, where the phase is wrapped when the depth exceeds the maximum depth of interest. A naive approach is using a low frequency to avoid the phase ambiguity. However, a tradeoff exists between the range of interest and the depth resolution. While the phase ambiguity does not appear at a lower frequency, the depth resolution becomes low as shown in Fig. 2a. With a higher frequency, the depth resolution improves while the phase ambiguity becomes significant, and the depth cannot be uniquely recovered for a wide range of interest as shown in Fig. 2b.
The phase ambiguity is usually relaxed by using multiple frequencies in either a temporal or a spatial domain. However, multiple captures are required, and it sacrifices realtime possibility as shown in Fig. 2c. We propose a hybrid approach of disambiguation that can take advantage of a different nature in temporal and spatial modulation.

Proposed method
We propose a hybrid method of temporal and spatial modulation as shown in Fig. 1c. The phase ambiguity can be resolved by using both temporal and spatial phases instead of using multiple frequencies in either domain. Intersections of the blue lines and the horizontal red bands (depicted as red circles) are the candidate depth, and the corresponding depth resolution is illustrated as vertical red bands. a, b While the resolution in phase is the same, the corresponded depth resolution vary depending on the frequency. With higher frequency, better depth resolution is obtained; however, there is depth ambiguity. c Using multiple frequencies, the range of interest can be extended to the frequency of the greatest common divisor, and the depth resolution is determined by the highest frequency. d The bottom table summarizes the trade-off

Spatio-temporal phase disambiguation
Our key idea is that the depth candidates from the ambiguity of the temporal and spatial phases are different.
In the case of the temporal phase, the intervals of the depth candidates are constant along the depth because the depth is proportional to the phase, as shown in Eq. (5) . On the other hand, the spatial phase is defined in the disparity domain. Because the depth is inversely proportional to the disparity (as shown in Eq. (9)), the intervals of depth candidates increase along with the depth. Figure 3 shows the phase observations along with the scene depth. Multiple depth candidates correspond to a single phase. The depth candidates appear at the same interval for the temporal phase, while the intervals of the spatial phase increase. This difference is a key feature of our method to resolve the phase ambiguity.
Depths that satisfy both temporal and spatial phases seldom appear. The unwrapped phase is not restricted by the greatest common divisor, and the set of temporal and spatial phases is unique for the wider range of interest. The candidate depths can be respectively obtained from the following equations as Fig. 3 Phase observations with the depth. While depth candidates of the temporal phase appear at the same intervals, those of the spatial pattern appear at increasing intervals. This difference is the cue to disambiguate the depth candidate. The unique depth candidate that satisfy both temporal phase and spatial phase can be obtained The integer pair (n T , n S ) that satisfies d T = d S seldom exists. Therefore, the phase ambiguity problem can be resolved using phases of different domains.

Phase recovery and depth estimation
Defining I 0 as the irradiance, the emitted signal from the projector with the k-th temporal shift and the l-th spatial shift I(p, t, k, l) can be expressed as where t is time and p is the projector's pixel. The returned signal r(x, t, k, l) at the camera pixel x is represented as where κ is the reflectance of target object, o(x) is the ambient light, φ T (x) is the phase delay corresponding to the round trip time, and φ S (x) is the phase corresponding to the disparity (x − p). The intensity is the correlation with the reference signal g ω T (t) [35] as where T is the exposure time. The temporal phase φ T and spatial phase φ S are obtained from 8 observations with N T = 4 and N S = 4 as Now, we have two phases: the temporal phase φ T and the spatial phase φ S . Depth estimation from the two phases is similar to the unwrapping problem in both the multi-frequency phase shifting and the ToF, and it can be solved by searching a lookup table [4]. The observed phases should respectively equal to the phases computed from the same depth, the computed phaseφ A lookup table is built for each horizontal pixel position x of the camera because the spatial phase depends on the pixel position. The table T x at the horizontal position x consists of the vector For each pixel, the depth can be estimated by searching the lookup table aŝ Efficient implementation In practice, building the look up table for each horizontal pixel position is not necessary. Although the spatial phase and corresponding depth depends on the position of camera pixel, the disparity does not depend on the position of the camera pixel. The depth of all camera pixels can be obtained by only one look up table by building from the pair of temporal phase and the disparity after converting the measured phase to the disparity. The disparity is obtained from the measured spatial phase φ S and pixel position x as where δ represents the disparity andd is the wrapped depth. The table T consists of the vector whereδ is the computed disparity from candidate depths. For each pixel, the depth can be estimated by searching the lookup table aŝ

Analysis of the proposed method
Depth resolution The resolution is better than ToF in a near range and better than phase shifting in a far range.

Kushida et al. IPSJ Transactions on Computer Vision and Applications
(2020) 12:1 Page 6 of 13 The resolution of ordinary ToF and phase shifting is respectively represented as [6,25] where A and B are the number of photo-electrons that the sensor can accumulate and represents the amplitude and the DC component, respectively. We suppose that A and B are the parameters of the hardware and are independent from the scene. However, the returned light is influenced by the light falloff in real; hence, a future work is expected to include this effect to analyze more accurately. Figure 4 shows the depth resolution of ToF and phase shifting along with the depth according to Eqs. (25) and (26). The resolution of ToF is constant at any depth while the resolution of phase shifting is proportional to the square of the depth. The proposed method achieves the resolution that is close to the better resolution of either phase shifting or time-of-flight as shown in Fig. 4. The depth d cross is defined by the depth where the resolution of ToF is equal to the resolution of phase shifting. In the range near than d cross , the resolution of our method is better than ToF and close to phase shifting. In the range far than d cross , the resolution of our method is better than phase shifting and close to ToF. The depth d cross is given as When we want to improve the resolution of pure ToF, the maximum range of this system should be designated shorter than d cross .  Fig. 4 Depth resolution along with the depth. According to Eqs. (25) and (26), the resolution of ToF is constant (blue) and the resolution of phase shifting is proportional to the square of the depth (orange). The depth d cross is the depth where the lines of the resolution of ToF and the resolution of the phase shifting is crossed. The proposed method can achieve the resolution that is close to the phase shifting in near range before d cross and the resolution that is close to the ToF in far range after d cross (green)

Range of interest
The range of interest (ROI) of the proposed method is determined by the relative relation between the temporal and the spatial frequencies.
Nearest range When the spatial frequency is too high compared with the temporal frequency, the phase ambiguity problem cannot be resolved because multiple candidate depths exist within the resolution of the ToF, as shown in Fig. 5a. The spatial frequency varies depending on the depth because the projection is perspective. As the distance is shorter, the spatial frequency is higher. This property gives the nearest ROI of the proposed method. The nearest ROI d min is where the wrapping distance of spatial phase is equal to the resolution of the ToF at the given temporal and spatial frequencies as where d S | n S =n S is the unwrapped depth and d S | n S =n S −1 is the neighbor depth candidate from Eq. (11). Substituting Eq. (17) and transforming the expression, the minimum depth of the range of interest d min can be obtained as 1 Farthest range When the spatial frequency is too low compared with the temporal frequency, the phase ambiguity problem cannot be resolved because multiple candidate depths exist within the resolution of the spatial phase shifting, as shown in Fig. 5b. Because the resolution of the spatial phase shifting is inversely proportional to the depth, the farthest ROI d max is determined. The farthest ROI d max is where the wrapping distance of temporal phase is equal to the resolution of the phase shifting as where d T | n T =n T is the unwrapped depth and d T | n T =n T −1 is the neighbor depth candidate from Eq. (11). Substituting Eq. (16), Eq. (26), and transforming the expression, the farthest ROI d max can be obtained as 2 Unrecoverable point There are few unrecoverable depths in the proposed method. Figure 6 shows that the pair of temporal and spatial phases corresponding to the depth. The vertical axis is the temporal phase, and the horizontal axis is the spatial phase. The color of the curves represents the depth. The intersections of (2020  Upper and lower bound of the ROI. Orange lines represent the candidate depths of spatial modulation; blue lines represent the candidate depths of temporal modulation. The width of the line shows the resolution. a If the depth is near than d min , several candidate depths from spatial modulation (orange lines) exist within the resolution of temporal modulation (blue band). b On the other hand, if the depth is longer than d max , several candidate depths from temporal modulation (blue lines) exist within the spatial resolution (orange band) the curves are unrecoverable depth because different depths have the same phase pair. This is a limitation of this method; however, these points generally appeared sparsely in the image hence can be estimated by looking at neighbor pixels of the image.
We confirm that the unrecoverable points seldom exists via simulation. We evaluate the percentage of unrecoverable pixels in an image using an indoor dataset [36]. Temporal phases and spatial phases were respectively rendered, and the depth image is estimated by our method from these phase images. The temporal frequency is set to 50 MHz, and spatial frequency is 1/0.6 mm −1 . One hundred scenes were selected from the dataset randomly.
The results are shown in Fig. 7. Depths of some pixels cannot be recovered due to multiple candidates. The average ratio of the uncovered pixel in each image is less than 5%. These points exist sparsely in the image; hence, it is possible to select the candidate by looking around their pixels.
Brightness of the pattern One may think that the temporal phase cannot be obtained if the spatial pattern is Fig. 6 The transition of temporal and spatial phases with respect to the depth. The vertical axis represents the temporal phase and the horizontal axis represents the spatial phase. The color represents the depth. The intersections of the curves have the same phase pair at the different depths. These depths cannot be recovered uniquely (2020) 12:1 Page 8 of 13 Fig. 7 Some results of simulation. Black color means the pixels cannot be recovered due to depth ambiguity. Unrecoverable pixel seldom exists in the image completely black. Because the spatial sinusoidal pattern is projected, all the pixels have a chance to obtain the photons unless the spatial pattern is extremely low. A possible solution is to add the constant value to the spatial pattern so that there are no pixels that are always black. In this case, the observation Eq. (14) is rewritten as where A S and O S (0 < O S − A S and O S + A S ≤ 1) are the amplitude and offset of the spatial modulation, respectively. Analogous to Eq. (14), both phases can be obtained by the same equations as Eq. (15) in the N T , N S = 4 case. So, it is not necessary to increase the number of observations.

Experiment
We demonstrated the effectiveness of our proposed method with real-world experiments.
Hardware prototype We developed a hardware prototype that can illuminate a scene with a spatiotemporal modulated pattern. Our prototype was built onto a ToF camera (Texas Instruments OPT8241-CDK-EVM). The light source was replaced with a laser diode and a DMD system that can project the spatial pattern. The light source was an 830-nm laser diode (Hamamatsu Photonics L9277-42), and its emission was synchronized with the ToF sensor. The light emitted by the diode was collimated and expanded through lenses, and then reflected onto a DMD device (Texas Instruments DLP6500) that had 1920 × 1080 pixels. Finally, the spatio-temporal pattern was projected onto the scene through a projection lens, as shown in Fig. 8. First, the measurement system was calibrated in a standard way for the pro-cam systems using a reference board [37]. The phase of the ToF on each pixel was then calibrated to share the same coordinates as the pro-cam system. A white plane board was captured while its position was moved for the phase calibration. For each measurement of the board, the pair of the raw phase and the ground-truth depth was obtained because the depth of the board was measured by the ordinary phase shifting.
The parameter to recover the depth from the phase was calibrated by line fitting.
Result First, we measured a white planar board and placed it at approximately 350 mm from the camera and slightly slanted it, as shown in Fig. 9a. The temporal frequency was 60 MHz, and the period of the spatial pattern was 60 pixels on the projection image. The baseline between the camera and the projector was approximately Fig. 9 Results with a white planar board. Ordinary ToF, phase shifting (single high frequency), and our method are compared. a The object was placed at a slight slant. b The estimated depth images. Because the depth cannot be identified in the phase shifting, the depth image cannot be visualized. c The cross-section of the red line is shown. While the ordinary ToF is noisy and phase shifting has many candidates, our method recovers a smooth and unique depth candidate 70 mm, and the focal length of the projection lens was 35 mm. The depths were obtained by an ordinary ToF with a single low frequency, phase shifting with single high frequency, and our method for the comparison. Figure 9b shows the estimated depth images. Both the ToF and our method recover the global depth. The depth image with phase shifting cannot be visualized because it has multiple depth candidates. The crosssection of the red line is shown in Fig. 9c. While the depth measured by the ordinary ToF is noisy and there are many depth candidates due to phase ambiguity in the phase shifting, our method recovers a smooth surface while resolving the phase ambiguity. The region near the edge is not correctly disambiguated because the resolution of the temporal measurement exceeds the interval of the phase shifting. The ToF resolution near the edge is lower than what we expected because the illumination is very low near the edge. However, decreasing the spatial frequency might have mitigated it.
Finally, we measured a plaster bust and placed it approximately 400 mm from the camera, as shown in Fig. 10a. The estimated depth images are shown in Fig. 10b. The cross-section of the depth is shown in Fig. 10c. Our method recovers a unique and smooth depth.

Conclusion
We developed a depth sensing method that uses spatiotemporally modulated illumination. We showed that the phase ambiguities of the temporal and spatial modulations are different, so it is possible to effectively resolve the ambiguities while reducing the observations and preserving the depth resolution. Our proposed method inherits not only the strength of time-of-flight camera and active stereo using projectorcamera system but also the weakness of them. While the proposed method can archive better resolution and wider range of interest, it may suffer from occlusion, which scarifies the ToF camera's potential. However, in practice, the current ToF camera is not a co-axial setup and it does not much suffer from occlusion. If the spatial-temporal projector is configured in the micro-baseline setup similarly to a ToF camera, the system does not much suffer from occlusion.
In this paper, depths of the ToF measurement are defined as the distance between a camera and a target; on the other hand, depths of the projector-camera system of phase shifting is defined as the distance between a center of baseline and a target. In practice, the difference should be correct for implementation although this is not affected to our key idea. Indeed, this model mismatch is absorbed by calibration step to build a look up table.
Our hardware prototype has some limitations. Because the DMD produces the sinusoidal pattern by controlling the mirrors on and off, it can make artifacts to the ToF. We ignored this effect, but it should be considered to control the DMD or to use a solid spatial light modulator appropriately. The quality of the spatio-temporally modulated illumination of our prototype is not very high. The temporal phase contains a systematic distortion, and the spatial resolution of the projector is currently limited to 64 pixels on the DMD, corresponding to 4 pixels on the camera, because the pattern is blurred. This might be due to the collimation and the alignment accuracy of the optics or the diffraction on the DMD. The light source cannot emit a spatial pattern that is equal to or less than the camera pixel's size, resulting in diminished phase shifting. In future implementations, we will develop a better light source unit to improve the temporal phase measurements and generate higher spatial resolutions.