Debugging Under Fire: Keep your Head when Systems have Lost their Mind

Bryan Cantrill

Recorded at GOTO 2017

good morning I am I'm Bryan Cantrell I
apparently am still being haunted by
Usenet comments I made over 20 years ago
so everything lasts forever keep that in
so I'm going to be presenting this
morning on a subject near and dear to my
heart even though it is it often induces
cardiac arrest or feels that way and
that is debugging under the most
stressful possible conditions debugging
under fire debugging when systems have
failed debugging during an outage
um and the genesis of this is a very
very very bad day that we had several
years ago so this is an IRC log I've
redacted it slightly and I've obviously
changed the name so this is an internal
IRC log that shows this is right before
disaster strikes here when one of the
operator says I'm going to reboot a
bunch of empty Arby's Arby's of Richmond
B that's a type of compute node we have
at Giant um I'm just getting a list
right now and I should tell you now that
Joanne is a public cloud company so yes
we compete with AWS and and Google and
Azure and so on so we have our own
public cloud and this is an operator of
that public cloud about to reboot some
of these empty nodes to bring them on to
a new platform image not another
operator says alright sounds good let me
know what it to go on and I'll do it
toward the end of my shift and couple of
hours and then you see about seven
minutes later another operator saying um
us east one is being rebooted um
question mark to which someone replies
that is the in parse what does that even
mean like us east one it that is a data
center like we don't that doesn't
we reviewed machines not data centers
that's not a thing
and then operator three helps will
helpfully cuts and pastes in the message
that he just saw on the head node which
is kind of the the administrative node
in a data center saying that the head
node is being shut down now something
that we basically never do log off now
or risk your files being damaged of
course that is the this is the classic
and agent and stupid actually units what
do you still have this message I don't
even know anyway so it just seems like
it's such a it's such a vague threat
that the system is making like I if you
do not log off I will punish you by
randomly corrupting your files um your
actually your files are actually not
going to be damaged and you actually
don't even need to log off by the way
because the system is going to take care
of things by itself so you can actually
just if you want stay logged in and edit
away because we're going down anyway I'm
to which one of the engineers says so
you can see an obstacle timestamps here
says WTF one of the things that we have
perhaps not helpfully done enjoy it we
have this is our pre slack slack those
are our jabber channel here and we got a
bot that has been over the years the bot
has I believe gone past the vendian
singularity the bot actually is we don't
actually paid a bot but it wouldn't
surprise me if the bot actually demands
a salary at some point so we've added
all sorts of weird logic to the bot one
of the things that someone added along
the way is anytime someone says WTF the
bot takes a random sentence from chat
over its entire history in which someone
has used the word and offers it up
on this is one example in the bada has
several behaviors where because of
course like the bot obviously has no
mirror neurons right the bot actually
has no empathy the body does not know
whether this is a good time to say this
or a bad time to say this or an
extremely bad time this is so the bot
will go to add actually it will and that
has no way of gauging if someone and
this is actually in this case we are
just stunned not actually enraged but
another thing that the the bot likes to
do is is correct anyone saying Linux -
you mean good new Linux and fine it's
almond and Linux often comes up in a bit
of a froth or enjoy it so someone will
already be frothy and be like god damn
Linux it's like you mean you do it like
shut up not sure but someone delete the
but I'm deleting the bot I'm deleting
the bot right now I've had it with the
bot so the anyway the bot has no sense
of timing engineer says WTF like oh the
case that blows this office
pushing up large files like yeah ha ha
ha bot shut up I'm another operator is
like what and of course it be the first
engineer says which IV I just love the
fact he typed out what everybody is
thinking during any outage please don't
be me please don't be me please don't be
me which is that I I think you speaking
personally um if I am as I'm travelling
through life nobody is more tolerant of
a systems failure than I am if I go to
check into my flight oh my god I'm sorry
so we're having computer problems I'm
like that's fine take your time please
somebody please don't be me please don't
be me please don't be me please don't be
me please don't be me please don't be me
please don't be me please God don't be
and this is not the engineer who's
saying please don't be me it's the first
operator and actually so III think that
there are a couple of kind of key things
to point out this one is actually um
this is a pretty amazing candor because
accidentally nuked the data center and
there was no attempt to I mean this is
basically the shortest possible way he
could express that right there's no
attempt to like there's no flowery
language here
no like deliberate ambiguity it's just
like it is me I am become death the
destroyer of data centers on Twitch
engineer one who has the kind of
residual paranoia that exists in many if
not most engineers asked the reasonable
question to limit her nod all right is
an act of sabotage are you I don't know
and he says accident I was rebooting a
Richmond be okay few on and forgot to
put minus n we'll talk about that a
little bit I up
I'm sorry I truly truly sorry I'm like
which are held yes sorry yes yes we are
at this point like you're not even like
you're not even upset you're terrified
I'm terrified um
operator five once I don't know why I
think operator five I guess just wants
to hear it directly like so you just got
the head note or you got all of East one
it's excited this and I actually read
actually because it's got a bunch of new
ideas machines but here's what I did
ah and at this point the support
personnel realized like wait a minute
something very bad is going on so this
for personnel coming to the chat alike
so all of yeast is going to reboot and
introduces yet you just read good at all
of us he's one um yikes so um
I I have never had a more visceral
thought that I was going to throw up on
my keyboard um because um I knew that
this was very bad very very very bad and
that we were going into total unknown
land for a lot of reasons um so um
and I'll let me get to I'll get to the
reasons that this is total unknown land
here in a second but this was absolutely
terrifying this is not just like we're
going to reboot the machine I know that
we are which to show you yet exactly so
this is thank you the register has the
the mushroom glad on and it so first of
all just before we talk about kind of
the details of why this was so
terrifying just almost instantly and
actually I think this is so this is we
rebooted it at what we reboot at this
actually happens at what about 2014 so
we're at 2019 only five minutes into
this and one of the operators I think
says hey we don't need a post-mortem on
this of course we're obviously a
post-mortem but I think this was kind of
interesting I've almost done exactly
what he did a number of times
um and then I think this is kind of
interesting in terms of like the
emotional aspects of dealing with one of
these outages is um you know what look
look what just what's work on recovery
um and just you know would help us out
and I have to say the whole outage the
operator that did this was just as
helpful as he could possibly be he or
she could possibly be a guest
um the they were as helpful as they
could be um so um I think that's kind of
you know interesting like look just help
us out I also think that it's it's kind
of it's not saying that hey it doesn't
matter or you know it'll be fine because
like now I won't be fine it does matter
bad but don't worry about it like just
help us recover help us recover um I do
love like I'm not seeing any alerts like
why am I not seeing any or it's like
yeah those are all the
on us east one being up um so that's why
you're not going to see anything out of
the system because the system has now
gone totally 100% dart on and that of
course the tweets start coming and by so
what eight minutes into this the tweets
are coming in and so why were we lucky
well we were oh we were or how could
this have been so bad why was I so
terrified why was I going to throw up
with my keyboard I throw my keyboard
because this is a cloud and we are the
software that we have built for this
called Triton open source software it's
a complicated distributed system and in
particular when we the way we are able
to expand the cloud and build a cloud
out quickly is when we rack stack and
cable a bunch of compute nodes we don't
actually install the operating system
onto any of those nodes they just pixie
boot off of a head node they get a
platform image that they run RAM
resident and they run which is great and
it's been it's a huge win in many many
many dimensions and then there's this
dimension because in order to know even
which platform to boot a compute node
that the head node should Boudicca be
code it talks to a distributed system to
figure out okay what compute node are
you what platform image are you running
and so on and it's going to talk to a
database can do a bunch of things there
are a bunch of components required to
figure out what you should boot of
course the thing that we are about to
find out is how many of those components
are sitting on machines that we can't
boot because they themselves aren't up
and now we had considered this for sure
we had considered it we had tested it
but at nowhere near this scale and the
second this happened actually well the
honest-to-god fight-or-flight reaction
where I actually couldn't think for
about 10 seconds because my brain just
kind of went white but as I realized
that this could be a very serious issue
I actually thought back to a
presentation I had seen by Martin Braco
who's here and was at I believe it was
on a Heroku outage a very very very bad
day at Heroku and an outage that lasted
way more than eight hours as I remember
was that 67 hours and I absolutely
thought like there is a like 40% chance
we're in a 67 we are in minute one of a
67 hour outage um and I immediately
planned for a 67 hour outage mentally um
and in a lot of ways one I took a subset
of engineers and I asked them to assume
that we can't boot any of these and go
figure out how we boot a compute node
with an image that we are going to
handcraft and hand roll and boot just
let's just assume this is not going to
work I also began to immediately plan
fortunately this happened in like
bullseye working hours in California but
one of the very interesting points that
Mark had during that outage was planning
for sleep management because sleep
deprivation is a it is a major judgment
in power and you know one of the neat
things about judgment impairment this
seems to be true like however your
judgment is impaired the very first
impaired from sleep deprivation from
drinking alcohol from anything the very
first judgment to go is that meted
judgment that your judgment is impaired
I just I guess it's like the way
judgment impairment works I guess it's
like a little kind of like a little
twinkle in the eye of judgment impaired
but it's like I this is I just I like to
roll this way kind of I like to mess
with a world like this because the you
are convinced that you're fine it's like
no no your judgment is actually impaired
and you actually and in particular there
are and we've seen them and we've seen
them recently
there are many incidents that happen at
the end of a shift or when people are
already tired and then they stay up for
another 12 hours but it's like you
should be in bed you should be asleep or
two some orbs
someone else to be working from so I was
already my mind was already going to all
sorts of crazy places um and fortunately
the whole thing came up
and because the things that we actually
had to sign it actually works and
everything actually worked the way he
was supposed to work which like never
happened so I would no one was more
shocked than I was and in particular it
was and I should have grabbed basins
attacks it's pretty funny about we among
other things join calm is offline right
I mean everything's us East one is gone
join calm is gone so we are in the
process of failing the website over to
another data center just so we can tell
people like yes we know we're on fires
all right um so we are in the process of
failing that over and then that that is
successfully failed over okay thank God
I go to the website and the entries oh
that's funny because we haven't actually
failed it over mic limited if you
haven't failed it over and I can go to
the website it's like we're gonna love
we're going to live we're going to live
all they God we're going to live and so
we essentially the entire DC due to some
some mechanisms that we had added to the
system a long time ago precautionary
mechanisms on the in event of a
cascading failure effectively the system
came back up thank God and we lived and
it wasn't that bad I mean if people were
down I mean it was bad and that everyone
was was down and it was a news story and
so on but this could have been so much
worse this could have been so so so so
so much worse and it was I would think
was a wake-up call because we always
know that things can be bad but it was
for me it was certainly the most stress
I have had during an outage on where you
had that honest-to-god feeling that like
I might not get out of this um we might
never actually recover I might go full
Magnolias on the cloud remember those
folks who had an outage from which they
did not recover and that was they had a
believe it was on one which they named
technology has get it wrong but I
believe that a database tire and from
which they couldn't recover so Wow how
do we get here how did we like a bunch
of software engineers and
up with such a high-stress activity and
the nature of software has changed over
increasingly being delivered as part of
a service I mean that's that that is how
the vector by which we deliver software
and it's increasingly automated its
other the way we deal with all the
software the way we configure it the way
we deploy it the way we manage it it's
all automated and that's great
right but the that power of the
automation is part of what cut against
us in our outage if that outage had been
10 years prior it would be like ok we're
bouncing all the machines but it's not a
big deal because these machines have all
have their own operating system logon to
boot it's the fact that we had a much
more sophisticated automated system that
made that so so perilous and our
automation is not complete we are not
actually post singularity despite what
the bots will tell you in chat humans
are still in the loop even if they're
developing software the humans are still
in the loop and I'm here to tell you
semi automated systems and they are all
automated systems semi automated systems
are fraught with peril history tells us
this over and over and over again it is
like it is the curse of the intermediate
skier if you will if you ski it as an
intermediate you are most likely to kill
yourself running into a tree because you
now have the confidence to go faster but
not the ability and semi automated
systems give us the confidence to do
incredible things but with humans still
stuck in the loop in ways we understand
in ways that we don't understand and
with that human that human foul ability
in a system that is more manual is
actually safer a human fallibility in a
semi automated system is actually much
more dangerous and look it wouldn't be
me if we were not going talk about
historical accidents right so clearly
there going to be some historical
accidents in here on the Zoomers didn't
recognize this if you're Canadian I'm
you're probably Canadian because this is
it tells you all you need to know are
you Canadian yes are you like you like
nods like yeah yeah yeah I know you guys
okay it's okay it's okay it's okay
welcome welcome Canadian um see um it
tells you all you need to know I love
Canada I spent a lot of time in Canada
it tells you all you need to know about
Canada that this is a source of national
pride for Canada you are looking at a
source of national pride how could that
possibly be the case this is what's
known as the Gimli glider so this is an
aircraft as a 767-200 that had a that
had an impossible failure uh it ran out
of fuel at 30,000 feet on and the reason
it let it ran out of fuel at 30,000 feet
is because of human fallibility and in
particular the airport in Montreal was
cutting over from imperial units to
metric units miscalculated something in
in kilograms are interred something in
kilograms that should have been in
pounds and has had much less fuel on
board than they thought and the fuel
gauges by the way of missus I like
interesting to know by the 737-200 it is
the amount of fuel that you tell it that
it has when you take off okay that's how
much do I have and so that that's the
size of the gas tank so there are like
30,000 feet left oh you have plenty of
gas oh you just lost one engine there I
go god okay both oh you lost the other
engine like oh my god and never ever
conceived of by the 757-200 is 1983 this
is the one of the first glass cockpits
there are no cables in this aircraft you
know an old pilot they'll tell you that
the 707 was that was the greatest
aircraft ever built because it was the
last one that had cables this is all
hydraulics all glass cockpit goes dark
it lose it all they have lost their
auxilary power units they've lost all
power the only power that they are
actually able to have this aircraft some
called a ram turbine that they a little
I mean it's like talk about like the the
your power generator of truly last
it is a propeller that drops out of the
plane and uses the airspeed to provide
power to the air
you're definitely not supposed to fly on
this thing and by the way one of the
interesting aspects of this accident is
as they slow down to glide this thing
into a landing if I like glide it's a
767 it does not guide this is not a
guide er is a brick era aerodynamically
this thing is a brick and it was falling
out of the sky as they were losing
airspeed they were actually beginning to
lose power in the aircraft as they were
landing this thing this lands on a and
old I on an old runway that had been
turned into a drag strip and then a
so you could be imagine being at a flea
market in Winnipeg as a silent 767 falls
towards you and when you think about if
you're Canadian you are now singing o
Canada in your head because and because
it was an amazing act of airmanship that
they landed the aircraft and they put a
bunch of I put other pilots and
simulators in this and they all crash
and diving it was very unusual pilot to
be able to land this this was a scenario
that they had never trained for they had
never seen the engineer say there and
this is their equivalent of rebooting
the data center
it's like didn't really think that one
was possible actually turns out it is um
it is so and this and this again is a
peril of that semi automation you've got
the glass cockpit but still the human
fallibility I'm closer to home right the
s3 outage probably affected many people
in this room the s3 outage from just a
couple months ago right from from
February um if you read this postmortem
they actually bounced the wrong boxes
effectively it was effectively it was a
human in the loop that specified the
wrong machines the key is we're always
going to have a human in the loop don't
think you're going to get the human out
of loop we will always have a human loop
because there's a human writing software
at one end of it
so we're always going to have a human in
the loop we're always going to have
apparel of semi-automated systems how do
we manage that how do we make them as
reliable as robust as debuggable is
again and how do micro services I guess
it's like I guess this year ago too is
like the the odd units and micro
services given that the track is being
blown up next year so I'll send micro
services out by by reflecting on the
complexity they add to a system because
the the paradox of micro services is we
are attracted to them for their
simplicity a developer and many of us
are developers like oh wow I get to like
reason about a much smaller problem
which is true and great but they yield a
more complicated system and which is not
to say that micro services are wrong or
misguided um they certainly can be it
taken to an extreme but it's more that
we need to be cognizant of that we need
to understand you're building a
distributed system and when you're don't
add a micro service to a system because
you don't want to work with another
organization which by the way I am
convinced is like 40 percent or more of
the micro services boom is
inter-organizational strife once I'll
screw those guys we'll build our own
thing it's a micro service yeah yeah the
with micro services no one has to be in
charge and I think that some
organizations have taken this at to do
whatever the hell they want whenever
they want but then they are yielding a
much more complicated system open source
is also exacerbated that so now
obviously I love open source again it's
critical but it allows for many more
kinds of choices it has increased
complexity yet again and at the
abstractions themselves are becoming
more robust they are robust but as the
abstractions become more robust failures
become rare that's great but the
failures become worse and again this is
that curse of the of that intermediate
skier as the fail as the abstractions
become more robust we are tempted to
believe that the systems are infallible
and yet they are not infallible in fact
they are most
dangerous in that we are thinking of
them as as infallible and the fact that
we've got a distributed system here and
these failures when this whole system
fails you are likely going to have
failures in disjoint sub components
disjoint services destroy micro services
one failure in one micro service
inducing a failure in another inducing a
failure and another getting true
cascading failure and incidences of guys
and this has been obviously this was a
tweet from a couple years ago that that
got retweeted a bunch on that every we
replaced our model of micro services but
every outage could be more like a murder
mystery now I was obvious I think it's
funny I think I probably liked it or
retweeted it myself but like look anyone
who's been in an outage knows that an
outage does not feel like a murder
mystery it doesn't feel like Miss Marple
sleuthing around in a quaint English
town to see whodunit that is not what an
outage feels like April this is
production production is war war is hell
an outage in production does not feel
like a murder mystery I mean that's what
we would like to me right and especially
as a big and complicated system begins
to fail it's often payoffs alerts
firing everywhere alerts that are that
are some misguided some confusing
everyone trying to figure out what's
going on total chaos every engineer
saying to themselves please don't be me
please don't be me please don't be me on
so on that is it to me is what mic sirs
more like and the modern software
failure mode looks more like this one
recognize you this is this is the
blackout in 1965 great northeast
what power systems are complicated on so
there was effectively a breaker in
Quebec that was or upstate New York in
remember that was not that was set at a
that was kind of a dated level didn't
realize that actually normal loads could
go that high it's November people are
there's a lot of lighting a lot of
heating people are drawing a lot of
power and this particular inert I
actually shorts out it hits its breaker
it turns itself off one of the
challenges of power distribution systems
is the powers got to go somewhere else
by the way microservices have the same
problem your micro Service decides that
it turns off that load goes somewhere
else and all of that load now goes on
and if you are are going dark because of
load you have now gone dark at the worst
possible time and now those other enter
ties are much more likely to themselves
go dark and generate yet more load for
the rest of the system so this is this
kind of this fallacy like oh I well I've
got a hundred of these it doesn't matter
if one of them dies like well it depends
on why that one died
maybe not but but you should understand
why it died because if it died because
it was overloaded that might be the true
canary indicating that there's actually
a much more serious problem and you've
generated more load and you've got this
runaway failure on where you actually
have total systemic failure in that
regard I feel that this is actually the
most apt metaphor for modern software
deployments systems software deployments
do you know this is Three Mile Island
I heard someone said this is the this is
one of the very few photos taken inside
the control room during the incident at
Three Mile Island um and here is the
architecture so I'm going to give you
some quick education on Three Mile
Island because you're going to see a lot
of parallels so if we look kind of over
here to the right you see that
demineralize ER and that condensate pump
on this all started because of some
routine maintenance where they were
running auto vacuum on a Postgres charge
no no excuse me I'm sorry um they uh
they're they were cleaning out the
demineralize Earth atomizer the
use resins to demineralize it and these
resins can kind of glob together and get
stuck they can't clear this thing out so
they decide to put some air in there to
shoot some air in there to kind of get
these resins unstuck the on that air
ends up getting into one of the control
lines for that condensate pump that
condensate pump goes offline which is
actually okay I mean not ok but it's
it's fine on there are auxiliary pumps
that are actually supposed to start
pumping water at that moment to cool the
reactor all three of the auxiliary pumps
had their valves closed because all
three were offline for maintenance it's
like they had not checked the database
backups yes they had not checked the
database backups right something that we
we've seen as recently as the get lab
outage where this is a very common
failure mode when you have auxilary
systems when you have backup systems
those systems are often not checked and
when they fail they often fail silently
or they fail - yeah we know the
auxiliary pump is offline but that's
fine because auxiliary pump well it's
again it's like that one micro service
that crashes it's like it's fine but you
begun to compromise your reliability
with all three offline and one of the
major findings of this is that the
reactor should have been taken offline
if all three were up we're offline with
all three of those things offline the
reactor begins to heat up again not the
end of the world but but but but you
know getting like this is getting a
little bit exciting and that number
seven towards the top of the containment
building on the what's called the
pilot-operated relief valve that opens
up to let out all the heat from the
reactor and the reactor scrammed control
rods go in on and the reactor stops
ideally but still has a lot of heat okay
so far so good
ah the the pilot-operated relief valve
then and this is we get to the actual I
think this would be the analog for the
way that we do we think about services
the pilot-operated relief valve closes
when the temperature gets low enough
well the temperature begins to get get
relatively low but the pilot-operated
relief valve does not close of note the
that sits on the control panel
indicating that that the pilot-operated
relief valve is open goes dark that this
was a UI disaster that light goes dark
not when the relief valve closes which
is what the operators believed that
light goes dark when power has been
applied to the solenoid to close the
relief valve I told the relief valve to
close not the relief valve is closed the
operators were being lied to and they
had other ways of figuring out what was
going on
but they didn't for reasons that are
actually very understandable and the
final findings of this the operators
were actually held blameless because
this was such a difficult condition to
figure out that the reactor if you knew
that the relief valve was actually stuck
then everything made sense and you were
actually losing all this steam all this
coolant out and and notably you see
there was that drain tank down there and
that ruptured disc all of this coolant
goes into that drain tank that drain
tank overflows the rupture disc breaks
and there's water all over the bottom
the payment unit alarp alerts going off
about that all those alerts are being
ignored because there's so many other
alerts going off and this is the whole
idea of like oh we monitor everything
like we alert everything this one you
monitor everything alert on everything
sure but don't think that your alerts
and your monitoring are a substitute for
understanding the way the system works
sometimes we have this Church of alert
alerting and monitoring and we believe
that simply by monitoring it simply by
alerting upon it
we have averted disaster that's exactly
what the nuclear industry thought in the
late seventies the late nuclear the
nuclear industry by the way the late
was destroyed by through my eyelids and
never recovered even though the incident
was actually not that serious
from a didn't mean there are many many
we've had many incidents and other
domains that are much more serious this
incident was not that serious but it did
destroy nuclear power for a lot of
reasons I love this finding from the
the special report because boy does this
feel current for us it's a difficult
thing to look at a winking light on a
board or he repeating alarm level and
several of them and immediately draw any
sort of rational picture of something
happening and the more alarms and alerts
you have the more you are creating
sensory overload for the operator and
the more likely it is that the operator
will be looking at an alert or a marring
alarm that is symptom more than cause
this is not to say the you should
monitor obviously you should we should
them just don't fool yourself don't
delude yourself into thinking that we
have avoided an outage as a result so we
suffer from many of these same problems
we suffer from from the arrogance of
nuclear power in the 70s on we are
briefly deploying these distributed
systems with this this erroneous idea
that they somehow can't fail that
they're infallible because they're
running on more than one computer so it
turns out there are things that can
affect more than one computer trust me I
know um it is me as they say um so what
does it mean so how do we actually debug
these things when we fail well one of
things we have to wrap our brains around
is that we will need to debug them in
production you will need as a software
developer and I know as if you're a
software developers like you can
probably just go away can I just go
please come and just go back to my
little I don't know this problem can't
go away we need to actually debug the
software in production and that prompts
actually a deeper question how do we
debugging is the process by which we
understand the system and I think that
this is it's a very kind of important
difference in the way we view debugging
the way versus the way it's been viewed
traditionally debugging is not merely
the act of making bugs go away it is the
act of understanding gaining knowledge
new knowledge about the way the system
that's what debugging is and the
behavior that we're understanding is
particular pathological behavior that
that we want to avoid now this whole
idea of like understanding behavior in a
system yeah we've done that before
that's called science science we have
the natural world we want to understand
it we want a reason about it that
science now the natural world is real
pain in the ass we've got it like never
complain about being a software engineer
please because we have it so incredibly
easy um we are things are so cheap and
malleable for us versus in the real
world the real world is such a pain in
the ass we have to do big expensive
experiments and you can't just like go
into an experiment not having no idea of
what you're trying to do you've got to
be almost by definition very hypothesis
centric you have to have a hunch that
you're checking out and that hunch like
probably needs to be right and then you
need to design your experiment to
invalidate that hundreds it's hard it's
expensive we have got an incredible
luxury software is entirely synthetic we
have we live in this unbelievably
luxurious systems of synthetic
mathematical machine and the conclusions
that we draw are often not likely true
but provably true which is a tremendous
luxury and we can perform experiments so
easily there's so cheap to form
experiments and even more importantly
when we have questions of our systems we
can simply ask it we wrote all the
software we need actually write all this
I know sometimes it doesn't feel like
that we actually wrote all of this like
we can reason about this this is not a
system that has been evolving for four
billion years this is a system that is
every line of code you're looking at is
less than what 60 years old and and most
of them are a lot less a lot younger
than that we can understand it and we
can understand software simply by
observing it and I think this is one of
the key things to realize about
debugging the art of debugging because I
think when you're around someone who
understands how to debug very well you
might not understand
and what they're doing and by the way
they're kind of doing that intentionally
there's a little bit of legerdemain on
the act of someone who can debug very
well in that it's very easy to you know
they kind of disappeared for a while
then voila
I present the fix it's like how did you
figure that out oh it departed foretells
Oh with that you have a coin behind your
ears okay there we go Jimmy would you
like a balloon um so we kind of liked to
kind of present this idea that we're
like magicians we are not magicians we
are The Wizard of Oz wedding behind a
curtain frenetically turning a crank
trying to figure out the problem those
that can debug well know this those that
debug poorly don't know this because
they did poorly and it's up from self
poor self-awareness debugging is the act
of asking questions and answering them
not guessing what the answer is you are
playing 20 questions and when I see 20
questions I mean not 20 questions with
my children who are unbelievably
annoying to play 20 questions with I'll
be like okay 20 questions by right okay
is it a badger do you know like how
about like is it is it a mammal or is it
an animal or is it a vegetable like hi
but he's in a badger I'm like no don't
you understand that
is it a badger and you're like it
they're not I'm and I have I will
confess I have changed my answer 20
questions just to like period Keith on
the larger lesson of like asking quite
asking better questions um you want to
form questions not hypotheses because
too often what people do is they for all
I know what it is I know this it's over
there I fixed it it's been it pushed
like whoa whoa whoa whoa well how did we
go from you guessing than what it was
having effects pushing and oh we didn't
see it again what we didn't see it again
that's not actually that is actually
prove your hypothesis correct we should
be asking questions and as those
questions give answers those answers are
facts and those facts constrain
hypotheses and we repeat this process of
answers to questions making more
specific questions more specific answers
more some questions more
specific answers and then that
hypothetical leap is often not a leap at
all it's a step across a puddle and I've
debugged a lot of nasty bugs in my life
that has always been the process and
there have been many many many many many
many bugs where the the even as I was
nanometers away from the final cause the
root cause I still couldn't believe it
and it was only when you actually get
that final answer oh my god this is a
microprocessor bug or what have you
so that is the way we divide we debug by
having the cycle of questions and
answers in this takes many forms it
means that you know how do we how do we
ask questions of software and in
particular those of us that are software
developers how do we design our software
to answer questions about itself that if
the art of debugging is asking questions
is designing your software for someone
to be able to ask questions of it it
means designing for post form debug
ability when this thing dies I generate
a core dump have I left the breadcrumbs
around my address space to make it easy
to figure out or easier to figure out
what happened it's designing for
Institue instrumentation be the ability
to instrument something while what's
running on it's thinking about like how
would someone want to instrument this
and leveraging things like dtrace and so
on it's designing for post hoc debugging
post talked about totally after the fact
logging logging is very very helpful
right and you obviously logging there's
a payroll to logging as well
but logging and knowing what to log and
when to log that's the craft of writing
debuggable software and it's essential
to avoid debugging in production because
you really don't want to bug during an
outage part of doing that is creating a
culture of debugging again when we view
debugging as understanding the system
improving the system not merely as the
process by which problems are made to go
away we can ensure on
in debugging we can think of this as a
first class operation and we can
encourage ourselves to actually root
cause things if an engineer smells smoke
they should root cause that if that is
that a fire in the kitchen is that the
coffeemaker or is that the fire raging
in a coal seam which I've discovered way
too many times my career how could this
ever work you ever said it to yourself
oh my god
how did this how did anything work turns
out it called by accident like oh my god
how does this ever work the reason you
got there is because there was an
anomaly a phenomenon out here that you
were able to complete recalls but we
must have an organizational culture that
encourages that that must not be
perceived as like oh that's your 20%
project it's like I want to punch you in
the face I like that is not like a free
time activity that is a core part of the
software we are delivering and we've got
to take the extra time to build for
debug ability this is part of a very
radical controversial idea I'm calling
do it right the first time
no no no the good right the first time
oh never now we don't even do it right
the 15th time it's like try doing it
right the first time
rewrites are very very expensive and the
cost of the rewrite is never borne by
the technical debt that induced it by
other jobs just to put it bluntly and if
you're wondering why when I look at a
resume and I see 18 months 18 months 18
months 18 months 18 months I want some
pretty good answers there can be good
answers there can be lots of good
failed sort of a crooked founder of
crooked founder cricket okay well
alright okay I've been there I find I am
fine but what what also happens is that
people actually make a bunch of bad
decisions and move on to the next job
and those that bear the cost of those
decisions are left silently screaming in
their wake so we have to have a culture
that encourages us to do things
correctly the first time so how do we do
during an outage well
discover whether they are who they
actually are when we say DevOps and
DevOps hug ups like let's get yeah
that's great I'm all about like hug ups
but at the end of the day I'm sorry you
are a developer or you're an operator we
can get very close together we can have
very tight empathy for one another and
we should but this is the moment that
you'll know what you are the moment
you've got a service the service is up
but impaired you know that restarting
the service will restore service your
customers but it will lose information
says that you cannot debug it if an
inner operator will be like look gotta
restart it and in your developer will be
like look got atom uh get like gotta
restarted really needed a bucket okay so
you need to not get trapped into a false
dichotomy you need to both restarted and
Ibaka and find a ways how do we resume
service out losing information what
degree of service can we resume with
minimal loss of information we do not
want to overemphasize recovery with
respect to understanding the way the
system works we don't want to so believe
in recovery that we that we actually no
longer understand the system and that we
believe that broken software can simply
be made up for things by restarting
everything all the time it's like that's
called Windows and humanity did that
experiment and it didn't work okay so we
actually need to understand the way
these systems work we do if we if we
enshrine recovery in lieu of actually
understanding broken software we
normalize broken software software
should not be broken uniquely among the
things we make software should not be
broken two mathematical machines before
I say it does not wear out get it right
please we uniquely can get it right and
if you ingrain this idea that software
is always broken there are all sorts of
bad things that come from this idea that
software should tolerate that input no
tient software should fail fail
explicitly if you want to troll
certainly any joint engineer into not
being productive for the rest of the day
say hey do that I can do NPM isn't tall
instead of install and it will actually
work it's good
this is one like the bot is cheerfully
offering things and get a word on
sawfish and recover from fatal failures
no it shouldn't if you have an uncaught
exception it is your duty to humanity to
die generate all of your state present
your embalmed carcass to go to Quincy ME
I cannot make a Quincy mu reference I'm
sorry you have no idea Oakland CA me is
not your fault those of you who are
Millennials not your fault not a very
good show I so we can actually go debug
it we don't want to catch our SIG's egg
V handler that is that is for a
sophomore computer science project
that's not the way we should actually
develop software so the idea of software
should not assert the correctness of its
state because it keeps dying as recently
as Linux right Linux getting rid of bug
on because it's causing too many crashes
like these assertion failures are
compromising our reliability it's like
okay do you want me to connect the dots
for you so we're getting rid of them
just like oh god oh god
Oh Lennox you arranged me you mean
Camille in Excel but I'm all right all
of these anti patterns impede debug
ability and I promise I'm hopping up
here in terms of after an outage we need
truly complete understanding in word and
yes postmortem I mean of course you want
write-up of a post-mortem but that
postmortem it's so much more important
that you completely understand what
happened then it is that you write it up
so it forces you to completely
understand it that's why you write it up
the the write up is to force complete
understanding we need to complete
understand and it's very hard especially
with cascading failure to actually
unwind everything that went wrong and
you're going to ask yourself why do
these two things fail they were up but
they were giving errors let's figure out
why um and you often will again discover
yet more coal fires yet more things that
work by accident and you've got to
encourage software engineers to
understand that our own failures to
encourage them to design for dialogue
ability that gets us to true software
robustness we need to differentiate
operational failure something went wrong
in the system at large versus
programmatic failure for programmatic
failure you need to actually die
operational failure needs to actually be
so look it's always going to be
stressful you're always going to want to
throw up on your keyboard when a service
is down this is not going to I'm not
going to here to make this easy it's not
easy and when a service is down we've
got some tough choices to make
I think plan for a long outage on plan 2
you can to debug it be careful when you
step missteps are really really costly
often a failure that I see an out it is
is that they are exacerbated by an early
misstep look at the gitlab outage the
wrong database was deleted in the gitlab
outage why was the wrong database we did
because they were having another problem
they were already in another problem
that was relatively speaking a more
minor problem and the the famous
operator one deleted the wrong database
that was an act that was they're doing
back up yes you want to get the system
it's worthwhile say ok everybody away
from the keyboard can we have a con call
please let's have a con call and can we
just let's get some ideas out there what
are some questions that we can go ask if
in Three Mile Island they had done that
everybody away from the console for a
second let's take a second and really
think they would have actually figured
out that the pilot-operated relief
valves open indeed with the shift change
to figure it out because fresh minds
came into mike wait a minute this was
going on
paralyzed by having different teams seek
different avenues of Investigation and
view these outages as an opportunity
they allow us to understand our software
and more importantly develop a culture
around software that values debug
ability now if somehow you have not
reached your ld50 of me and I take no
offense if you're just like I'm giving
you your own fight-or-flight reaction on
but if you haven't um and you love
debugging um there are definitely lots
of places that do this but you should
come work for joining we um if you love
debugging joining us hiring actually
Julia's hiring even if you don't love
debugging but on the if you love
debug ability this is something that we
value very very much and I hope that you
find an organization wherever you are
enough of me on but you could do a
little more me but not too much more
then we got a session up there with with
Bridget I and with that thank you very