So here we are with the last project before the final capstone project in Udacity Self-Driving Car Nanodegree. I'm so excited that I'm still on the board and more-less on time with submissions, but from the other hand it's sad because the course is comming to the end and I already used to it's motivating rythm. Without further ado lets jump into the interesting and challenging project about semantic segmentation.
So there were two paths for completing this stage in the nanodegree: semantic segmentation or functional safety. I felt that first option was closer to my heart as it was connected with area of deep learning, which gaves yet another oportunity to spend some hours with hands on tensorflow.
The lessons supplied by Udacity together with specialists from Nvidia, were also meaningful and gave good overview over existing solutions in that field. In the advanced deep learning lessons we came through canonical models like VGG or ResNet, and concentrated on more adequate solution for this problem - fully convolutional networks. We learned also how to speed up the learning process (inference optimisation) using techniques like fusion, quantization and reduced precision.
So lets clarify what really semantic segmentation is. In simple definition it's a task of advanced understanding about the objects on the image scene (understanding on a pixel level) and clustering parts of the image together with other pixels which belongs to the same object. The visual output of this process used to be represented as separate, colored regions, just like on the image below.
You can look on it as more advanced object recognition comparing to the bounding boxes used in the Vehicle Detection Project in Term 1. You probably already see the advantages right? Semantic segmentation gives the machine ability to more precisely understand the scene and distinguish objects, measure them and then maybe predict their behaviour and adjust it's own behaviour and planning.. okay I know, we're still in 2D here, but still it gives a lot of higher level informations.
In the project we had to cluster only the pixels, which were drivable part of the road.
As always we had starting project template, which consist of barebones of the project in order to concentrate more on the actual deep learning solution instead of thinking how to start and why this way.
The dataset used in the project was taken from Kitti Road dataset - I had to register in order to get my own dataset copy. There was 1550 images in summary for training and testing. That's not a big number but in combination with more advanced convolutional network it was producing a lot of load to my machine. With the dataset there was also supplied VGG model with weight of 537MB - the model was used later as a base for my FCN.
After doing few iterations of my code development, I soon realised that my machine was too weak for that project and CPU is not enough to finish it on time and make fun with checking different solutions. After having some hard times with my AWS machine before (which I have decide to drop after last deep learning project), I decided to setup finally own machine in my office at Polbyte. Udacity, thank you very much for forcing me to do that, that was great decision! So yes, I had to decide about the hardware - Nick Condo thanks for your article, and yes I had to came through the complicated setup, but this time I was lucky, because my setup worked after first installation - this time Vivek Yadav saved my day(s) with his article. I did not plan that move and I had to do all that on the road in just few days - hopefuly there are some gamers always around and they often changes their gear so you can always get something good for half of the price. From now on my setup was complete.
So as it was suggested in the course and as it later turned after reading some papers (check the links at the end of this article), the fully convolutional network architecture was potentially one of the best solutions for the problem.
So why this architecture is so good at this task and what is the trick here. Firstly, it's based on Convolutional Neural Network, which is currently still state of the art in case of image pattern classification - thank you Yann LeCun. Secondly, there is a small confusion in the community about the differences between CNN and FCN. Here we are with 3 special techniques which FCN uses:
I strongly encourage you to read the whole paper about FCN - Fully Convolutional Networks for Semantic Segmentation.
After short introduction to the VGG16 neural network architecture and some code implementations of FCN using tensorflow I started my own work on implementing FCN. I have loaded VGG16 model using Tensorflow and added skip layers, by adding deconvolution or transpose strided layers on top of VGG model. I have used AdamOptimizer for the optimization step with learning rate of 0.0001. There were 2 classification classes - pixel belongs to driveable road or do not belongs to it. Training was done in batches of size 1, because of big memory usage by the FCN.
After many iterations I've achieved satisfying results - with 15 epoch training loss on the level of 0.02. Here are some result images from trained network:
Semantic segmentation is not trivial topic, but using todays tools and developed architectures we can create solutions for real world applications in only 200 lines of code - amazing!