Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Understanding Indirect Time-of-Flight Depth Sensing (microsoft.com)
87 points by giuliomagnifico on April 20, 2021 | hide | past | favorite | 39 comments


I worked briefly with a company that manufactured surveying equipment back in the mid 2000's and they had a similar setup (single sensor though). They could get sub-millimeter precision as well, but they had to 'chirp' the light power modulation from very long (e.g. 1km) to short (sub millimeter) in order to remove any aliasing that would occur when the distance was a multiple of the measurement baseline.

If these are 90 degrees out of phase I don't know how they eliminate that possibility without doing something similar. (e.g. imagine the modulation is at 30MHz and your measurement interval is 10m, how would you differentiate between 20.034m and 30.034m?


The microsoft kinect does 3 frequency measurements to get the higher distance precision of higher modulation frequencies and to extend the maximum distance so the phase wrapping (aliasing) can be solved.

Based on some Microsoft patent filings and other research papers I wrote a short article on possible phase unwrapping methods on iToF sensors, https://medium.com/chronoptics-time-of-flight/phase-wrapping...


Fantastic article!!! They definitely need your animations, it's way more intuitive.


Thanks, got any references to how the surveying equipment de-aliasing method worked?


I don’t unfortunately. I was there doing an information security assessment and the guy I was working with was one of the product engineers. When I wrapped my work we spent another hour or two just talking about their products and how they worked.


The kinect does do the "chirp", but this detail is omitted from TFA.


The article doesn't mention the main drawback of ToF depth sensing: Multipath Errors. They originate from light bouncing around in the scene before coming back to the detector, causing the resulting depth maps to have dents and distortions in the neighborhood of angled surfaces.

They are a big problem in built environments which have lots of 90 deg angles that act as retroreflectors to the signal. To my knowledge none of the ToF sensor manufacturers (MS, Sony, PMD, Samsung, etc..) has solved this problem. If anyone has, please let me know as the topic is of professional interest to me.


There has been some work on resolving multipath errors with indirect ToF, Chronoptics (I'm a cofounder) recently licensed our technology to Melexis in automotive.

Here's a blog post I wrote about resolving multipath https://medium.com/chronoptics-time-of-flight/multipath-inte...

And a link to the announcement from Melexis https://www.melexis.com/en/news/2021/4mar2021-melexis-announ...


Could this be used with SAR as well?


Maybe, would be very interesting to investigate.


There are practical mitigations for the problem, especially if you want to filter away points from these surfaces (and are OK with dropout regions in the depth image). Some of these sensors produce useful per-pixel confidence values which do a reasonable job identifying regions with multipath errors, and various types of spatial/temporal filters work so-so in handling small distortions. The K4A sensors are perhaps a bit overeager in their spatial filter, leading to slightly over-smoothed edges, though.

You can always try combining ToF sensors with other types, like stereo, and hope that the failure modes of the different types are mostly distinct.

The EpiScan3D and EpiToF cameras are probably the closest to "solving" reflective subjects, but they are basically one-off benchtop prototypes and nowhere near products.


The article is also a bit optimistic in regards to outdoors use, with direct sunlight exposure. The sensors I tested in the past just didn't work at all.


There are two issues with sunlight and iToF. The sunlight photons saturate the pixel, and you get no useful measurement. The dominant noise source of iToF is photon shot noise [1], so sunlight photons contribute heavily to noise. Increase your laser power, use better optical notch filters, decrease the sensor integration time, and do more image filtering. I'm a cofounder at Chronoptics and we've developed a sunlight capable ToF camera, https://youtu.be/7vMI37S0w3Q

[1] https://en.wikipedia.org/wiki/Shot_noise


You can absolutely make ToF sensors work outdoors in direct sunlight, but you need to operate at different frequencies that require more expensive emitters and detectors to avoid being saturated by light from the sun.


Which sensor was that stefan_? maybe I worked on it


It only matters if you need physically accurate data - ie if your brain can't process the multi path error and correct for it. A bat sees in multi path error and has no problem with it. I assume a machine vision system can learn to perform with multipath error and indirectly account for it. ie, it sees an apple with multipath error, and still knows it's an Apple.

There are options to fix multipath and recover the underlying ground truth

1) you can do a reverse raytrace and iteratively correct for the error - somewhat expensive, but there's tricks and shortcuts to accelerate

2) hardware fix to measure the multipath component separately and subtract / correct it - there's several ways to do this - there are some patents on it that I've worked on. The same methods also can remove background signal from ambient light.


If you know that the surfaces in your data should be flat and the corners sharp, it may be possible to filter out the fly pixels pretty well in a post-processing step. Of course, if you can't make these assumptions, then the problem is ill-posed at post-processing time.

Presumably my cursory experience with this from half a decade ago is not news to you if you have professional interest in the topic, but maybe you can elaborate how this is not a feasible solution in your case?


Would it help to use multiple sensors at different positions and use the common points to filter out the 2nd+ bounces?

I think this would work for, say, a mirror but what about something like brushed metal?

Are those other bounces scattered enough that multiple perspective still produce an error?


I strongly feel that for computer vision to take the next step, it needs to work on a predictive basis, using directed sampling and disentangled semantic priors, etc.

That would mean that before percepts are updated and available for applications (with knowledge like depth), a lot of information, not just the current frame, has already been integrated. Information such as previous frames, common shapes or surfaces, objects, the current full scene model, etc. That will make the system significantly more efficient and robust. And also enable things like true video understanding including the 3d structure of scenes.

Definitely easier said than done of course.


> a lot of information, not just the current frame, has already been integrated. Information such as previous frames, common shapes or surfaces, objects, the current full scene model, etc.

What's the current state of the art on this?


Good question. Have seen some interesting papers and systems. MIT Kimera Semantics is interesting .


What does this have to do with a microsoft page on indirect time of flight?


It's a different approach to the same problem. For starters, you don't assume that you can take one sample (or pair of samples) and just use that.

Because these depth readings are being fed to applications as if they were good representations. What I am saying is, for one thing, before you try to really use that type of information, you integrate more data and do a lot more work. And that's the perception level.


> Because these depth readings are being fed to applications as if they were good representations

No they aren't. Depth is always extremely noisy and needs to be managed and filtered in lots of different ways.

> And that's the perception level.

Depth isn't always looked at directly. This is about camera techniques. What you are talking about seems like some pet project ideas that are loosely connected to depth cameras.


True, they are actually kind of pet project ideas now. But they are drawn from popular neuroscience and similar to some existing approaches or beginnings. I believe that this type of predictive and generative system will be the paradigm for perception in the near future.

And of course it's not just the raw data being used at the application level in current real-world applications. But what is fed in doesn't take into account most of the type of information I am talking about.


How do these ToF systems deal with multiple sensors pointing at the same scene? I've seen it work with two ToF sensors, but haven't been able to find a good explanation for how it works.


There are three different methods. 1) TDMA (Time division multiplexing). Some sync system so only one camera is emitting light at a given time. The Microsoft Azure Kinect uses this with the 3.5mm sync cable, this enables up to 3 cameras to illuminate at different times.

2) Frequency domain. iToF cameras use different modulation frequencies, you can set different modulation frequencies and the signals won't interfere. The other cameras photon's will contribute to photon shot noise. Or randomly change modulation frequencies by a small amount during the integration time, which is supported by some sensors.

3) Randomly change timing during integration time, this is more common with pulsed ToF cameras. Analog devices had an example of this in their booth at CES in 2020.


Time division multiplex the ToF signal.

Azure Kinect uses a 3.5mm jack to sync this between sensors.


if the measurement time is short enough, (<1us capturing at 60 fps), the probability of interaction is low even with large number of cameras. Even then, some temporal filtering and intelligent time offsetting to separate signals, can usually fix the problem.


If the other emitter/sensor pairs are uncorrelated with our own, and if we integrate over enough transmit/receive cycles, then the other pairs will contribute approximately equally to both of our own two sensor phase detectors. The method descibed here uses the difference between the sensor phase detectors, so any competing interference should cancel out. It will (just) raise the background noise floor a bit.


Can solve this using polarized light. If course the sensors have to be calibrated relative to each other.


Do LIDARs work under the same phase shift based ToF sensing?


LIDARs are a ToF sensing system but use a rotating sensor or some other motion to capture many points at a time over a field of view. Imagine taking the sensor in the Microsoft article and having several of them on something that scans back and forth across a field of view.

Some LIDARs do use phase shift as described. Others can use a pulse of light and measure that directly, while others use frequency modulation. Phase shift is just one way to measure distance using light.


Are there any higher than VGA quality ToF sensors? Azure Kinect is not cutting it for us.

We'd like to do real time, full 360 depth capture of actors, but all of the sensors we've found are poorly suited for this task.

We might have to rig up our own optical system and use FPGAs to run them...


Microsoft licensed their sensor to Analog devices, and you can buy the mega-pixel iToF sensor and build a iToF camera. https://www.analog.com/en/products/ADSD3100.html?icid=ToF-ad...

Or build a camera with multiple sensors capturing different FoVs.

Samsung have published papers on a 1.2MegaPixel ToF sensors, but currently only have a VGA sensor available. https://ieeexplore.ieee.org/abstract/document/9365854 https://www.samsung.com/semiconductor/minisite/isocell/visio...


How are Analog Devices selling these chips? Like Sony with Heavy NDA? Or these will be regular catalogue parts in the future?


Im guessing like Sony, it's hard to support such chips because of the underlying complexities. Melexis has open datasheets and can purchase individual sensors off mouser.


Interesting! I was completely unaware of these. Thanks for the info.

I'm glad to see we're steadily climbing in terms of sensor resolution. We're not anywhere close to 4K 120Hz sampling, but the field is progressing.

There's nothing stopping us from doing offline processing for our workflow, I suppose.


At those resolutions and frame rates multiview stereo is probably your best technology for the moment.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: