(Deep) Learning to Fly

Krzysztof Kudrynski & Blazej Kubiak

Recorded at GOTO 2018

okay so again
I'm sister kudrinsky and this is a drone
now it's in in the box
and the word consists of two groups of
people people from the first group will
immediately take the drone out and start
to enjoy flying while people from the
second group will immediately download
the API now there is absolutely no shame
to be part of the first group I have
real friends that would actually do this
and they are nice people myself and was
a we are proud to represent the elite
after over two years of dealing with
this drone we have absolutely no idea
how to manually fly that thing in this
project the drone will learn to fly and
it will learn that skill based on what
mono camera so a starting point for our
project will be a calibration chess
board by making a lot of pictures of
such well-structured image you can
quickly find some mathematical
algorithms to correct all the destroy
distortions and calibrate the parameters
of the camera to make everything look
nice and smooth
in fact calibration chess board is a
starting project of almost every project
in computer vision and then you move on
with simple projects like detecting
chessboard in the image then maybe
tracking chessboard in the video these
are the hello world of computer vision
but if you are here to see the hello
world application you might be a little
bit disappointed although we cannot
disagree that tracking chess board is
has so many wonderful real-life
applications ranging from detection of
playing chess on a chess board up to
detection of people walking on the
street with a chess board and the list
does not end here
however our project is Impa inspired by
something completely different and these
are extracts of a few videos of drone
racing and I've I think we can feel that
speed it's so amazing and when you
realize that actually these drones were
controlled manually I'm sure there is
one word screaming in your head and it
is in responsibility let us analyze this
one case again what is a word seen by
drone camera during some race and you
can already know what to do you can turn
a little bit left and go fast through
the tunnel there is still a lot of time
but if you are starting to worry now you
are doing the right thing because
although there is still a lot of frames
to turn I feel you already know how it's
going to end while the world you can see
there was so much time to react so much
time to react before the scylla crashes
so many frames which were just begging
together with wash I we decided to put
an end to that as a user you should only
select where you want to fly and it is
the machine that should calculate all
image and today we will think how we
could develop this we will have a walk
for all of the steps needed to attain
that and we will show you how it worked
on our example but in our example
instead of the ruins like that instead
of the forest and parking lots will show
you our tests from our modern TomTom
office built and open this year but if
you look at it at the right perspective
similar similarity cannot be a chance I
think we were destined to make this test
in the office ok so let's start with
solving our task and for initial
inspiration let's analyze this case
again and we all have to agree that we
humans are just amazing everyone here
immediately and without any effort will
be able to spot the right place that is
worth flying to everyone here
immediately and without any effort will
be able to plan the next next actions
and everyone here immediately and
without any effort will be able to spot
the emerging collisions and act
accordingly this is already amazing but
what is even more amazing is the fact
that with the exactly same brain exactly
at the same time we humans are able to
plan our next meal and perform deep
considerations about life and well such
general intelligence is still not beyond
the reach of science so today will
talk how to make it on cook or
understand life but we will try to
analyze how we humans would perform this
free particular task and because the
machine can be perfectly focused on them
only we will soon find that it's working
faster and will outperform human ok so
let's start with some basics you can
start with whatever equipment you want
as long as you can control it from your
own software and our system was based on
parrot ar.drone tool and there is a lot
of you can communicate with it over
Wi-Fi and there's a lot of software in
the web which you can download we used a
drone framework which is open source
under apache license so we can just
download it and out of thousands of
lines of codes of Yaron we used just one
line code especially move command where
you can specify the tilt angle in both
directions x and y the change of speed
in the z direction and the rotation
speed so this you just apply this and
the drone will move as you'd like to and
these parameters will be derived from
the understanding of the image that is
decoded for us from the from our camera
and for it was the most convenient for
us to use the c++ ffmpeg to decode the
stream and then analyze the image using
python libraries like moon pie and
OpenCV to analyze this image and finally
the control center to Orchestrator that
we initially planned to write it in Java
because the adren was also in Java and
such system is a nice place to start but
obviously as we will move on during this
will evolve our ideas and you will see
that this architecture this design will
change gradually as we speak
but it is a nice backbone so let's start
with the logic so our first level of
autonomy is tracking and let's think how
to implement tracking on the machine we
would like to track this place while we
move and because we are fans of the
human brain let's think how we humans
how our brain works in real-life
situations when we want to do that and
let us think how about a really nice
example when we go to the club and there
is a lot of people and we find something
catchy something that captures our eyes
for example yeah but before we focus on
the heart let's imagine a situation but
you must imagine it really hard that
well this does not happen very often but
it if it was the case we could use our
predefined checking check board chess
board tracking algorithm to accomplish
our task unfortunately sometimes it
happens that we have to track something
else and it is at this moment that this
region gains some description it may be
the distribution of colors inside this
region we may be following the shape of
the heart itself or just try to follow
these approaches can easily be
implemented on the machine and the good
news is that you don't need to do it on
your own because you have frameworks for
example OpenCV library which is a c++
library but with wrappers in Python and
Java so you can
nicely use it and out of quite a lot of
al gore-ism we tried three of three of
them minge in min shift you are just
specifying a region and then you are
trying to shift and scale this region so
that the distribution of colors stays
the same it's done automatically for you
income shift is very similar but you can
also rotate the image and finally an
optical flow you will be analyzing the
movement of individual pixels in the
image using some nice mathematical
tricks around Taylor expansion and so on
and we find out that in our case we made
some experiments and optical flow gave
us best results in this case so let's
see how it works in action what you can
so you can see what the drone sees with
some controls the thing on the left the
arrows on the left and right shows what
movement needs to be taken so that we
emit an obstacle
similarly the bottom arrow shows if we
need to fly higher or lower to meet the
obstacle but in the beginning obviously
they will in the first level of autonomy
not yet emitting obstacles what will be
most interesting for us is the red
circle in the middle our circle of focus
attention it is by default in the middle
but when we start flying we will click
on the table on the left and the drone
will start to track this place and then
finally the arrow on the top shows what
rotation needs to be taken so that the
drone actually is focused centered on
the clicked region so let's start flying
and you can see we clicked on the table
the strong desire of the drone to turn
so that we can focus on this table so
when we in our graphical user interface
autonomy number one and you will see now
the drone is floating freely with
tracking of the object that we have that
that is being trapped so this is our
level autonomy number one so if we want
to increase the autonomy level two which
is just giving action that will fly
towards this place it's quite obvious
we'll just fly forward but in real-life
scenarios in the office it's not only
always a good idea to implement that
straight forward so we decided to
quickly move on to the level three
autonomy which also avoids obstacles and
this is where the fun begins because in
order to avoid risks you need to
understand and identify in 3d space the
obstacles which you can see in a two
dimensional image and that seems
so the question we can ask ourselves
again is how we humans would do it how
do we humans understand the space around
us and probably first answer answer that
comes into our head is through the pair
of our eyes so let's say this is a
simplified diagram of our viewing system
the blue rectangle is the place where
eye and the dashed line is its principal
axis now a single pixel in this plane
would tell you very limited information
about what we see in the world in fact
it will only tell you the Ray on which
the image lies on which the object lies
in the world unless you have another eye
which is quite common and in the another
eye you have a pixel corresponding to
the same object and then using some
simple maths
you can actually find the place in the
world where this object lies so it's
television an answer it maybe we could
just add a stereo vision system to our
drone instead of a mono camera which
would be more expensive and much more
heavier probably so instead before we
all go shopping let's think and asks ask
ourselves another question is it really
necessary are we humans hopeless when we
have just one eye obviously not in fact
when we move our eyes feed the brain
with images with a lot of images and it
can be it can be proven mathematically
that if you have at least seven pixels
corresponding pixels in two images in
some non-critical configuration you are
able to reconstruct both the movement
and the 3d structure that we are looking
at up to scale factor obviously so for
example this will be a video of my chair
using this mathematical tricks and in
this image in this algorithm in each
image there will be some key points
detected by the algorithm and they will
be tracked so we have the correspondence
within the next frames and finally it
will calculate the 3d image on the fly
which you can see now being done the
approach is known under many names
mostly structures promotion multiple
view reconstruction or visuals LOM there
is an open CV
some functions for this which you can do
with which you can do some basic
reconstruction but there are so many
projects in the web or there are many
great books if you would like to
understand it and maybe write your own
better code and if you do it right it
will work amazing especially if you move
horizontally versus the objects that you
are trying to reconstruct because at
every frame you will give a lot of
different perspectives for your image
too for your object to be reconstructed
however if you for example only fly
towards the object you might have a
little bit of a problem because every
frame you add the image will be very
similar there will be no different
perspective of the objects that you have
especially those in the in the center of
the image and projective ly this will
give very few additional information and
in fact because there is some noise in
the sensor the rest are all still some
missed calibration this noise might be
bigger than actually the projective
information that you give and then you
might think that your 3d world is flat
and that is not true okay anyway if we
if we just rotate that would be even
worse because in this case we have only
one point at which all rays meet and
there is no additional projective
information which is the worst case and
in fact when we are racing with our
drone most of the time will fly forward
and rotate which is the worst case
scenario for this method however
beautiful it sounds a similar approach
but a little bit different would be to
use optical flow itself so let's focus
on a door handle and a glass while we
will move and you can see that the glass
which is closer moves faster on the
image than the door which is far away
and we could use this information to say
something about the space which is
further away so we could in fact track
all the object
and create our 3d image but again here
similarly to previous method it is best
if we move vertically horizontally
whatever you know what I mean towards
what we see which is not good of course
we could think of a situation where we
could in fact use that and we would to
our drone would deliberately add some
motions to the sciences we fly and for
us Polish people and our friends in
Russia adding a little bit of ceramic
path would be perfectly excused but we
would not win the race that way so again
are we doomed to fail absolutely not and
again we will draw inspiration from the
power of the human brain because we
humans do not need two eyes to
understand space we don't need even to
move around we can stand still in one
place with one eye open and still have
no doubt what's around us and how do we
know that from experience from billions
of images fed into our brain during our
lives we learn to understand objects
their perspective their relative sizes
with our lifetime experience of looking
at the world
we humans have absolutely no doubts we
have we need no tricks to see the space
and as amazing as it may sound we can
pass this experience to the Machine with
the new advancements of artificial
intelligence so we will try to train our
drone using deep convolutional neural
network in a few minutes bojay will tell
you more about that but if anyone here
is absolutely unfamiliar with the
concept of machine learning the idea is
that instead of designing the set of
precise algorithms formulas and rules as
we do in conventional handcrafted
algorithm we design a mathematical
multi-layer model capable to use all the
even in
put in its row format to give some a
food label and to make it a correct
label we flooded with thousands of
examples so that we will tune the
reactions to each pixels and
combinations to these reactions and
finally you have the output pixel as you
want it so you can read a lot about AI
but today we pre we prepared something
better for you you will be part of an
amazing experiment which will give you a
better feeling of how a I work of how al
works and more importantly putting some
light on a few concerts of its proper
usage so in this experiment you will be
the AI and I will train you to perform a
simple categorization task as every
network in the beginning you have
absolutely no idea what you will what
will you will be learning and I will
just give you examples without any
explanation with labels and after some
examples you will train to shape your
brain so that finally when I give you an
example of which you see for the first
time and you have no idea what it is you
will know what to answer are you ready
image number one label
label image number three
the training time is over
now it's time for the ultimate test
please don't hurry up you have a lot of
time and make sure that allowed answers
are yes and no so the test is come on
wait a second please now raise your
hands who votes for yes okay now please
raise your hands who votes for No
the results are five thousand thirty
five four yes and one for now ladies and
gentlemen this is ladies and gentlemen
we just successfully detected a sofa so
thank you very much for your
contribution without your votes this
would be impossible to make this
experiment we won we are very happy we
are millionaires we won this this
competition however there is something
disturbing there is this one guy who was
absolutely blinded by some other less
important features also available in
this image and if we had more animal
lovers in this room we would miserably
fail to detect an ugly sofa and sibley
see now this is a real concern in AI
preparing a good precisely shaped
training set unambiguous is it's an art
and usually you will still not know
exactly what the what the algorithm
would so you usually need a huge
training set to make that happen and
sometimes as a designer maybe you can
make a choice to afford a much lower at
much lower cost an algorithm which you
can intuitively find some nice
rules and formulas for in our case our
handcrafted algorithm would be a pain
because what we want is attaching a
fourth dimension to an image that has
only two dimensions so let's say a depth
here it is coded by color so if we were
able to make this image out of every
frame that we get then we will
accomplish our task and finally we will
just detect some black regions and act
accordingly to emit them in our case we
will be using a deep convolutional
neural network to achieve that but in
order to train a network we need
training data in other words if we want
the machine to get this depth map for
every incoming frame we first need to
show it 100,000 of examples of this job
done right so where to take this data
from one way to do it would be to take a
ruler and for every pixel in the image
measure the distance between our camera
and the object in this pixel repeated
for every pixel and then repeat it for
every image we have according to our
rough calculation the time needed for
that would be one thousand seven hundred
and twenty nine days so we decided to
use a different approach if we come back
to our structure promotion algorithm you
can remember it works well while moving
sideways but it works a bit worse in
other cases but during training we don't
care we can prepare a nice scenario of
our drone moving sideways to other
objects and then some nice scenario
that move for example from zero to four
meters to and from of the pillar for
example that's why we call it the pillar
learning and then you are able to
actually use structure promotion to
produce death maps for you which will be
your training set for your network now
this will be only up to scale as I told
you and the scale will be floating so
what you can do more you can put a
histogram of every pixel at in time you
can locate the pixels corresponding to
the pillar and then because you know
it's from zero to four meters you can
normalize it and apply this
normalization back to the image so that
every frame gives you a proper metric
distance in your death map and that's
how you can create your training set to
train your machine and in fact we have
to tell you we are in love with our idea
however you have to know that there are
much faster ways to achieve the same
solution and now watch I will tell you
more about these ideas and about deep
convolutional neural networks hello
hello hello everyone so it's great to
migrate things from scratch however
sometimes it's much wiser to use the
existing solution that's why we use deep
neural network already trained it for
death estimation we found that in turn
turned and what was really nice not only
in the network structure but also the
model was available to download from
it was evolving in two formats we use
transfer from model it is more popular
and and easy to use what was very nice
outers of the of the paper also prepared
script that we could read and learn how
to load the model how to fit with our
custom data so after downloading all the
stuff we
dete first test and we were a bit scared
because the mother size was 200
megabytes loading time was 5 minutes and
that estimation took around 40 seconds
so it was not acceptable for our
real-time real-time solution but
fortunately we found out that we have
very powerful machine in our TomTom
office with two gtx titan x graphical
cards and with that equipment we could
achieve hundred milliseconds what was
really acceptable for our our system we
have everything which is necessary for
further development but we are guys that
won't really know how it works what is
under the hood and i will try to explain
it now so this knowledge review
so the solution can be described it in
the sentence deep fully convolutional
neural network with upscaling layers
find out from residual Network 50 and
because it sounds a bit terrible I will
try to explain it in part so let's start
with neural network so probably everyone
has seen such picture is fully connected
in a network where every neuron is
connected to every neuron in previous
layer there are connections that have
weights and weights are adjusted during
training process called back propagation
so why can we just use that network
feeding the the network with image
captured by drone camera so you know you
will see that the picture is done
schedule hundred by hundred
but even though if we use such Network
we have billion weights in that network
which means that every training
we would have to update billion weights
it's simply impossible to to converge to
proper solution with such Network that's
why we are using convolutional neural
networks and to understand that we need
to know a few concept further this
convolution this is simple operation
on the left we have the image and on the
right we have the kernel or filter
sometimes sometimes called filter and
what we do for each part in the in the
frame we calculate sum of products
values between between pixels in kernel
and pixels in in frame so this is
example example kernel has - 0 - 0 and 0
- which means that it will try to expose
edges in the image and after convolution
you see that that the edges are better
visible in the in the image second
concept that is used in conversional
layers in conversion networks is max
pooling it is used for size reduction
for down sampling and it is very stupid
operation that takes maximum value
pixels and put it in the output white
were it is working because we assume
that most valuable information is stored
in maximum value pixel in feature maps
verified in experiments that it works it
works well and such simple operation can
could help us to reduce size of the of
the emitter feature map last concept is
I'm pulling it is also quite stupid
because it is used to up sampling and we
just take the pixel from the input and
put it in the left upper corner for each
4 pixels in the output so the question
is why these all zeros are somehow
necessary in in the output and actually
they are not important but we always use
on pulling layers interleaved with
convolutions so we'll have interpolation
between all these pixels and we have the
small smooth output for further
processing in there in the network
usually in conversional networks who use
certified in our unit as activation
function it is very simple function that
is almost linear but for negative values
negative values are all set to zero and
we can think about this function as
simple as possible a nonlinear function
can be used to add some nonlinear
non-linearity to to the network
so how conversion a layer looks like so
we have the input image or input feature
maps we have a set of filters the same
number of filters as the number of
channels we convolve them and sum
together we apply activation function
and we have the feature map if you want
to have multiple feature maps to extract
many different features we have to have
multiple set of filters so what is the
difference between conventional layer in
fully connected letters I was mentioning
before so main difference is number of
ways so we created feature map using
just 27 weights starting these filters
so we have three filters 3 by 3 so it's
27 weights if you want to have multiple
feature mums like hundred you have like
3000 weights but we are producing
hundred feature maps so many many
features on the other hand in full
connected layer will have 300 million
weights but using just one just one
image so what is important the
conversional layer is very good to
producing feature to creating features
and converge and fully connected layer
is good to aggregating information and
important fully connected layer is not
low color so it consider every pixel
that is in the input and the other hand
conversion layer works locally because
the field
convolutional filter works locally it
can be quite a big so the contest might
be quite quite huge but still it is it
is local so this typical picture of
conversion anyone network so first part
is feature extraction it consists of a
few convolutional layers and max poling
players a second part is classification
which consists of a few fully connected
layers however these days is quite
popular to remove classification part at
all or fully connection part at all and
exchange it with another convolutions
and max pullings to have just one pixel
at the end this is nice trick to even
more reduce number of weights so we have
lighter model but for our case of dev
estimation is not enough because for for
an input image we have just one pixel
output but we want to have full
resolution depth map for our input image
that's why we need upscaling layers so
if we add a few upscaling layers
interleaved with convolutions we have
full resolution that map at the end it
is quite difficult to Train such Network
such a bit network from scratch so it is
nice trick to use which is called
fine-tuning so it works like we take the
image we take the model of the network
that is training it for completely
different tasks like classification cats
and dogs and we cut off some layers at
the end and we had another layers and we
were trained a model for our new for our
new case and it can be trained it
adjusting a little bit all the weights
or we can even froze our freeze the the
layers that our I do the beginning of
the network because in such a big
networks these first layers produce very
Universal features like edgy
and cetera and the outers of the piper
did exactly these things
so they took resin well network 15
training for some other task very move
classification part they added upscaling
layers and retro need the network for
deaf estimation and results are quite
amazing so as you see a few examples
there is RGB image and grunt of picture
in the center and that prediction on the
right so it is maybe a bit Smoove so
edges are not so sharp but for our case
of collision avoidance it's it's really
in offense or even more than enough so
now Chris will tell you more about our
system okay so let's come back for a
while to our initial system design which
is already a little bit complicated with
the components for convenience written
in different languages but now when we
add our deep neural network prediction
it becomes even worse because not only
do we face porting everything to a Linux
machine but also with a CUDA support
obviously but also we cannot just call
Python scripts independently every time
a new frame comes because we would need
to load every time a 200 megabyte model
every 40 milliseconds that would kill us
it's obviously solvable there are people
capable to manage all the possible
pieces and make them work as a perfect
unity for those who cannot the word
brings micro services of course these
are not the perfect example of micro
services but breaking our design into
client-server components made our life
much easier
and now the headquarters of our
application is our Python application it
requests images from the video decode
decoding service
it analyzes them with the help of the
puter server which sent the frame will
respond with the death map then we have
Python app which has the graphical user
interface for the user when you can
decide which level of autonomy you will
be using by just pressing keys and then
depending on that either manual or
automatic commands for the drone are
served to the client which communicates
directly to the drone okay so finally
having all the puzzles solved it's time
for the ultimate demonstration and we're
very happy to be able to show it to you
today because during the day of
recording this presentation this flight
we had a fire if you don't know what a
fire is a fire is a set of failures that
every self-respecting developer should
experience during the day of delivery
and we had that you will not see this in
this demonstration because we are using
the most secure delivery platform
statistically proven to be PowerPoint
now seriously the pieces work perfectly
stable but we had some problems with the
communication that's why the video works
we are going to show you is probably the
best out of 10
but I think you will enjoy it so let's
start okay on the right side you can see
what don't flying on the left bottom you
can see our artificial brain showing us
the death map and depending on if the
blue color emerges than the the arrow
shows that action to taken to actually
emit the obstacle so on the top left you
can see the red circle which we click in
the beginning so that the drone will fly
autonomously to the hallway up until it
reaches the windows so as you can see
everything is controlled automatically
the deep network plus some algorithms
decide how to fly the drone until it
actually finishes its flight at the
window in order to prove how repetitive
is our process we will show you the same
video again but now you can focus on the
so according to the task define fine in
the beginning we have a food system
where a user on the show where they want
to fly and the Machine trash the place
and apply
itself experience everything using mono
camera relation which is conclusion of
our today's talk that's all we had to
show you today but in our final words if
we offended anyone while we try to be
funny please accept the following
rectifications first of all dogs are
more loved leaders of us second not all
chess players are freaks when you
actually take your old good traditional
chess board chess can be a nice game and
finally irresponsibility is fun two
rules to remember our first watch out
for the lamps in your office and second
if you happen to record a video of your
drone for example crashing into the
workspace of your boss never saw it in
thank you very much