Improving Sensor Fusion for Neuromorphic Vision

//php echo do_shortcode(‘[responsivevoice_button voice=”US English Male” buttontext=”Listen to Post”]’) ?>

Event cameras—neuromorphic sensors that output asynchronous spikes analogous to those that go from the retina to the brain—may be approaching a point where they stop being considered quite so “exotic.” EE Times has covered them before, especially in articles by those on the front line of the technology, such as Toby Delbruck and Jean-Luc Jaffard. However, in September, theIEEE International Conference on Multisensor Fusion and Integration (MFI 2022) included its first-ever workshop on Event Sensor Fusion. The work presented provides a helpful snapshot of what is now possible using the technology, the success of the field so far, the groups crowding in to exploit perceived opportunities, and the challenges that remain.

To appreciate these sensors, it’s important to remember how conceptually different a camera that produces events is compared to one that produces frames.

If you look at Video 1, you’ll see a visual explanation of the difference from Davide Scaramuzza at the Institute of Neuroinformatics (INI) in Zurich. Not only are events a compressed way of transmitting a dynamic scene—no information is transmitted unless part of the scene is changing—but they also have other advantages. Because they measure intensity change, not intensity, what’s going on in dark parts of the image is not affected by brightness elsewhere. This gives a much broader dynamic range than a conventional camera: even allowing event-based imagers to capture detail in the shadow and the sun at once.

Event cameras capture data only where the intensity is changing, not where it is static. (Source: Davide Scaramuzza, Institute of Neuroinformatics, Zurich)
Because events are generated asynchronously (they happen when they happen—no clock) they also have low latency and motion blur. Except in extreme cases, there is no frame rate to limit the time resolution of the camera, which can be on the order of microseconds. On the other hand, of course, because information is available only for parts of the image that are movingthe event camera sees absolutely nothing in a static scene.

Two cameras are better than one

This is not necessarily a huge disadvantage. The Astrosite project uses events to map the sky as objects move across the field of view of a telescope. This takes advantage of the relatively small amount of movement (stars and planets) versus the large amount of black unchanging sky to minimize bandwidth and keep power low. However, in many situations, capturing a conventional image is useful as well.

In 2014, Delbruck and his colleagues in the Sensor Group at INI Zurich showed that they had an extremely elegant way around this. They invented a new kind of camera (known as a Davis camera) that had the neuromorphic advantage of very low power but captured both frames and events at the same time with the same circuitry. In Video 2, you can see a lovely example of what that combination buys you: safety and peace of mind. The details of the environment (the things that are not changing and so invisible to events) are filled in by the frames. The fast-changing areas of interest in a scene (like a person running out into the road) are made sharply visible by those extremely fast event signals. This produces the best of both worlds.

The output of event-based cameras can be integrated to provide intensity images (like conventional frames), but researchers recognized that using the complementarity of the two types of image provides the best of both worlds. (Source: Cedric Scheerlinck, while at the Australian National University and the Australian Center for Robotic Vision)
The paradigm has proven so successful that one Davis camera (made by INI–Zurich spinoff company iniVation) was launched on the M2 satellite and another is currently on the International Space Station.

The problem of sensor fusion

The Davis camera and others like it—several companies now sell variations on these—have enabled a flurry of interesting work and applications, but they also highlighted a problem. Events are great: They’re cheap and low-power and biologically inspired. The problem is that the signal-processing techniques that you use with events are very different than the ones you use with conventional video frames.

In a recent comprehensive review, and in the MFI Event Sensor Fusion workshop, this problem is dissected in great detail. Without going into too much of that here, there are ways that you can integrate, warp, or otherwise make the square pegs of event streams fit into the round holes of conventional image-processing techniques.

Here’s one concrete example, one for which there are solutions but that demonstrates the problem. How do you do stereo vision on an event camera, when at any instant in time there may be very few active pixels and no way to solve the correspondence problem because there are not enough features in the left and right images?

Many have come up with mathematically elegant solutions, but during the workshop, Delbruck repeatedly pointed out a problem that is not always immediately obvious in watching a presentation or reading a paper.

To paraphrase: there’s no point in having a highly energy-efficient sensor if the meaning can only be extracted by a power-hungry processor. This gets us back to the problem of benchmarking. Essentially, Delbruck says that researchers show lots of nice results at influential events such as the Computer Vision and Pattern Recognition conference (CVPR), but it hasn’t (yet) become standard to present any kind of power budget.

In the next column, I’ll look at some new efforts in event-based cameras that could provide important new opportunities for genuinely low-powered approaches—both in terms of new paths for research and potentially important commercial projects.

Leave a Comment