I worked briefly with a company that manufactured surveying equipment back in the mid 2000's and they had a similar setup (single sensor though). They could get sub-millimeter precision as well, but they had to 'chirp' the light power modulation from very long (e.g. 1km) to short (sub millimeter) in order to remove any aliasing that would occur when the distance was a multiple of the measurement baseline.
If these are 90 degrees out of phase I don't know how they eliminate that possibility without doing something similar. (e.g. imagine the modulation is at 30MHz and your measurement interval is 10m, how would you differentiate between 20.034m and 30.034m?
The microsoft kinect does 3 frequency measurements to get the higher distance precision of higher modulation frequencies and to extend the maximum distance so the phase wrapping (aliasing) can be solved.
I don’t unfortunately. I was there doing an information security assessment and the guy I was working with was one of the product engineers. When I wrapped my work we spent another hour or two just talking about their products and how they worked.
The article doesn't mention the main drawback of ToF depth sensing: Multipath Errors. They originate from light bouncing around in the scene before coming back to the detector, causing the resulting depth maps to have dents and distortions in the neighborhood of angled surfaces.
They are a big problem in built environments which have lots of 90 deg angles that act as retroreflectors to the signal. To my knowledge none of the ToF sensor manufacturers (MS, Sony, PMD, Samsung, etc..) has solved this problem. If anyone has, please let me know as the topic is of professional interest to me.
There has been some work on resolving multipath errors with indirect ToF, Chronoptics (I'm a cofounder) recently licensed our technology to Melexis in automotive.
There are practical mitigations for the problem, especially if you want to filter away points from these surfaces (and are OK with dropout regions in the depth image). Some of these sensors produce useful per-pixel confidence values which do a reasonable job identifying regions with multipath errors, and various types of spatial/temporal filters work so-so in handling small distortions. The K4A sensors are perhaps a bit overeager in their spatial filter, leading to slightly over-smoothed edges, though.
You can always try combining ToF sensors with other types, like stereo, and hope that the failure modes of the different types are mostly distinct.
The EpiScan3D and EpiToF cameras are probably the closest to "solving" reflective subjects, but they are basically one-off benchtop prototypes and nowhere near products.
The article is also a bit optimistic in regards to outdoors use, with direct sunlight exposure. The sensors I tested in the past just didn't work at all.
There are two issues with sunlight and iToF. The sunlight photons saturate the pixel, and you get no useful measurement. The dominant noise source of iToF is photon shot noise [1], so sunlight photons contribute heavily to noise. Increase your laser power, use better optical notch filters, decrease the sensor integration time, and do more image filtering. I'm a cofounder at Chronoptics and we've developed a sunlight capable ToF camera, https://youtu.be/7vMI37S0w3Q
You can absolutely make ToF sensors work outdoors in direct sunlight, but you need to operate at different frequencies that require more expensive emitters and detectors to avoid being saturated by light from the sun.
It only matters if you need physically accurate data - ie if your brain can't process the multi path error and correct for it. A bat sees in multi path error and has no problem with it. I assume a machine vision system can learn to perform with multipath error and indirectly account for it. ie, it sees an apple with multipath error, and still knows it's an Apple.
There are options to fix multipath and recover the underlying ground truth
1) you can do a reverse raytrace and iteratively correct for the error - somewhat expensive, but there's tricks and shortcuts to accelerate
2) hardware fix to measure the multipath component separately and subtract / correct it - there's several ways to do this - there are some patents on it that I've worked on. The same methods also can remove background signal from ambient light.
If you know that the surfaces in your data should be flat and the corners sharp, it may be possible to filter out the fly pixels pretty well in a post-processing step. Of course, if you can't make these assumptions, then the problem is ill-posed at post-processing time.
Presumably my cursory experience with this from half a decade ago is not news to you if you have professional interest in the topic, but maybe you can elaborate how this is not a feasible solution in your case?
I strongly feel that for computer vision to take the next step, it needs to work on a predictive basis, using directed sampling and disentangled semantic priors, etc.
That would mean that before percepts are updated and available for applications (with knowledge like depth), a lot of information, not just the current frame, has already been integrated. Information such as previous frames, common shapes or surfaces, objects, the current full scene model, etc. That will make the system significantly more efficient and robust. And also enable things like true video understanding including the 3d structure of scenes.
> a lot of information, not just the current frame, has already been integrated. Information such as previous frames, common shapes or surfaces, objects, the current full scene model, etc.
It's a different approach to the same problem. For starters, you don't assume that you can take one sample (or pair of samples) and just use that.
Because these depth readings are being fed to applications as if they were good representations. What I am saying is, for one thing, before you try to really use that type of information, you integrate more data and do a lot more work. And that's the perception level.
> Because these depth readings are being fed to applications as if they were good representations
No they aren't. Depth is always extremely noisy and needs to be managed and filtered in lots of different ways.
> And that's the perception level.
Depth isn't always looked at directly. This is about camera techniques. What you are talking about seems like some pet project ideas that are loosely connected to depth cameras.
True, they are actually kind of pet project ideas now. But they are drawn from popular neuroscience and similar to some existing approaches or beginnings. I believe that this type of predictive and generative system will be the paradigm for perception in the near future.
And of course it's not just the raw data being used at the application level in current real-world applications. But what is fed in doesn't take into account most of the type of information I am talking about.
How do these ToF systems deal with multiple sensors pointing at the same scene? I've seen it work with two ToF sensors, but haven't been able to find a good explanation for how it works.
There are three different methods.
1) TDMA (Time division multiplexing). Some sync system so only one camera is emitting light at a given time. The Microsoft Azure Kinect uses this with the 3.5mm sync cable, this enables up to 3 cameras to illuminate at different times.
2) Frequency domain. iToF cameras use different modulation frequencies, you can set different modulation frequencies and the signals won't interfere. The other cameras photon's will contribute to photon shot noise. Or randomly change modulation frequencies by a small amount during the integration time, which is supported by some sensors.
3) Randomly change timing during integration time, this is more common with pulsed ToF cameras. Analog devices had an example of this in their booth at CES in 2020.
if the measurement time is short enough, (<1us capturing at 60 fps), the probability of interaction is low even with large number of cameras. Even then, some temporal filtering and intelligent time offsetting to separate signals, can usually fix the problem.
If the other emitter/sensor pairs are uncorrelated with our own, and if we integrate over enough transmit/receive cycles, then the other pairs will contribute approximately equally to both of our own two sensor phase detectors. The method descibed here uses the difference between the sensor phase detectors, so any competing interference should cancel out. It will (just) raise the background noise floor a bit.
LIDARs are a ToF sensing system but use a rotating sensor or some other motion to capture many points at a time over a field of view. Imagine taking the sensor in the Microsoft article and having several of them on something that scans back and forth across a field of view.
Some LIDARs do use phase shift as described. Others can use a pulse of light and measure that directly, while others use frequency modulation. Phase shift is just one way to measure distance using light.
Im guessing like Sony, it's hard to support such chips because of the underlying complexities. Melexis has open datasheets and can purchase individual sensors off mouser.
If these are 90 degrees out of phase I don't know how they eliminate that possibility without doing something similar. (e.g. imagine the modulation is at 30MHz and your measurement interval is 10m, how would you differentiate between 20.034m and 30.034m?