EV-LayerSegNet: Self-supervised Motion Segmentation using Event Cameras

Advanced Research & Innovation Center (ARIC), Micro Air Vehicle Lab (MAVLab)
Khalifa University, Delft University of Technology
GelBelt on PCB

Abstract

Event cameras are novel bio-inspired sensors that capture motion dynamics with much higher temporal resolution than traditional cameras, since pixels react asynchronously to brightness changes. They are therefore better suited for tasks involving motion such as motion segmentation. However, training event-based networks still represents a difficult challenge, as obtaining ground truth is very expensive, error-prone and limited in frequency. In this article, we introduce EV-LayerSegNet, a self-supervised CNN for event-based motion segmentation. Inspired by a layered representation of the scene dynamics, we show that it is possible to learn affine optical flow and segmentation masks separately, and use them to deblur the input events. The deblurring quality is then measured and used as self-supervised learning loss. We train and test the network on a simulated dataset with only affine motion, achieving IoU and detection rate up to 71% and 87% respectively.

Network Implementation

Our network EV-LayerSegNet is similar to encoder-decoder architectures and is inspired to EV-FlowNet and LayerSegNet. Events are downsampled by 4 encoder layers and passed to 2 residual blocks. We then stack the output of the residual blocks and pass it to the segmentation module and optical flow module.

Network Architecture

The optical flow module contains 6 convolutional layers, each followed by leaky ReLU activation. We then flatten the output of the last convolution layer and pass it to a feed-forward network consisting of 4 layers (512, 256, 64 and 12 output units), followed by tanh activation except the last layer. We then use the output of the feedforward network and split it to two sets of 6 affine motion parameters, and we compute the two flow maps W1 and W2.

In the segmentation module, the output of the residual blocks is bilinearly upsampled by 4 decoding layers. Each decoding layer is connected to the respective encoding layer by skip connection and is followed by leaky DoReLU activation with γ = 100. At the last layer, softmax is applied instead to ensure that the channel values are bounded [0,1] and sum up to 1