Things I Hate About Developers

Developers! <short pause>

I'm gonna make you think about operations tonight.

Your devops engineers are gonna love me.

<hand to ear> Sorry, it's Site Reliability now?

<smiling, slightly slower, and with satisfaction> Platform Engineering

How many process and operations engineers of different flavors do we have tonight, raise your hands?

You guys already know this, you can go to sleep for the next 13 minutes and 15 seconds

<eat the mic, lower voice conspiratorially> This conference is quarterly, so that's 53 minutes if you come to all 4

Developers, we're gonna talk tonight about a number of action items you can take to make your infrastructure teams' lives a hell of a lot easier and make your software more robust and reliable in the process

Without further ado, here is...

10 Things I Hate About You.

Well, 6, actually - we don't have time for 10

How many of you work in Kubernetes shops?

How many of you have heard of the kubernetes pod lifecycle?

<if any hands stay up> good on you - you're the good eggs

<else> yeah, that's what I thought - but you're about to learn enough to be dangerous

One of the most important functions of kubernetes

(aside from providing a standard interface over common deployment structures like networking, resource allocation, workload isolation, process environment, installed libraries...)

Ok fine, *an* important function of kubernetes

is scaling workloads to meet demand and recovering failed workloads

In order to do this without corrupting data or sending customers to 502 pages unnecessarily

Services need to respond correctly to the process signals kubernetes uses to manage workloads

This is very simple in concept, but it's often forgotten or poorly communicated, so it gets left on the table

Just respond to SIGTERM by shedding any stateful load to other instances of the service and shutting down

As an added bonus, this will make your rolling deployments much faster, as the kubelet doesn't have to wait 30s to kill each pod

If you're not using kubernetes, don't worry, I have some work for you, too

All too often, I see service crashes or deployment rotations lead to data loss or inconsistency

because network calls between services or to external providers are written with naive "at most once" behavior.

Just throwing bits out into the ether and hoping they get where they're going

(well when you put it that way, it sounds like fun)

But it's not very consistent, and consistency is - to my thinking - the most important part of software engineering

So every network call should be stuffed into a retry loop of some kind

And this is very important...

Failed requests should be sent to a dead-letter queue!

This is important for detecting incidents: if you don't know requests are failing, you can't respond to the issue

It is important for diagnosing incidents: if you don't know *which* requests are failing, you can't fix the issue

And it's important for mitigating incidents: you can replay the failed events or reset the event queue to just before failures began

Just don't give me too much of a good thing

When you're retrying, make sure to put in some exponential backoff on your requests

(The math nerd in me is also compelled to point out that most supposed "exponential backoff" schemes are in fact, geometric rather than exponential, but I digress)

This is my default backoff scheme, and I want to point out a few features that mitigate the impact of mass disconnect events

external DDOS attacks are child's play to mitigate, but self-inflicted thundering herds are far more difficult

First, we wait at least 1-3 milliseconds before trying again the first time - there are times you'll want to wait more or less time, but I find that to be a good default

Second, we have a maximum retry delay, usually not more than a second in a backend service, but frontend can be 60 seconds or more

Third, I want you all to focus on this JITTER_FACTOR_MS

If everyone is disconnected at once, either because of an internal network blip, a session cache invalidation, or anything else

They will retry in waves without a jitter factor, and it will take longer to recover

By skewing the retry time on a per-client basis, you'll get a more consistent throughput of retries and will be less likely to overwhelm your downstream services

And don't worry about what exactly these values should be - your operations engineers will know how to set these levers so long as you provide them.

Now on the subject of levers

One of the most common ways I see deployments fail in higher environments or cause incidents in production

Is because a new config variable was added into a service with a hardcoded default value

which was valid for local development and the dev environment, but then failed in staging or production

If you add a config variable, please just raise an Exception if the value couldn't be loaded from the environment

and then it will obviously fail in dev if it doesn't get set by the release team, so the bug will never hit customers

Well, as long as you log the failure, at least

I can't tell you how many times I've seen some variation of "Exception: An Error Occurred" in ostensibly production software.

Give me a stack trace at least, if not a full (anonymized) stack frame

but also, learn how your log aggregation tools work

You can save your incident management team a lot of time responding to issues just by making your logs conform with

the expected structure of whatever log aggregator you're using

The specifics are very organization-dependent, so I just have a few examples of best practices up here

but the less time we spend writing yet another service-specific yaml parser, the more time we can spend speeding up your builds

<sit down, bring down the energy of the room> Finally, let me tell you a story.

I spent some time working for a company that had a compelling mission, a solid engineering team, and an actual product in customers' hands

This company was well set up to push through to profitability and acquisition within their funding runway

All we had to do was push features required by new customers and keep everything scaling up without too much distress

<half laugh> I say "all" like it's that simple - we also had to get stupendously lucky, but that's a given

Unfortunately, this company had a serious liability

The Director of Engineering was a twat.

He would spend an hour every morning on the engineering-department-wide stand-up pontificating about his supposed

engineering prowess and berating every engineer as they went through their description of their previous day's work

and stamping his feet child.

Well, I say *every* engineer, but that's not strictly true - there were some of us who he never yelled at

He never yelled at me, for instance.

He was, after all, under the impression that I was a neurotypical, perfectly abled, straight, cisgender, white man

One out of six... isn't the worst possible score... on a true or false test

I was locked in because of a signing bonus I had taken, so I didn't say anything at first

After some time, I took it to HR and made it clear that if he ever yelled at *me*, I would quit on the spot regardless of the signing bonus

But I never said anything during the standups - I should have, but I didn't

As the person with the power to make hiring and firing decisions, he had too much power over me

Eventually, a CTO position was created to promote him out of the way, and my manager got his job

She was brilliant as both an engineer and a project manager, but the damage had already been done.

The employees were too disaffected, and most everyone left as soon as their year was up

Most of their staff have been laid off since then, they haven't managed to get another round of seed funding,

and their website barely even exists anymore

That's not really the happy ending we're looking for is it?

So let's make it happy - this is a demonstration of how tolerating bigotry in the workplace can wreck a company

but it's also a story of how a diverse team can build something great in the first place

and how a diversity of experience that isn't limited to tooling, frameworks, and design patterns allows

us to create better software. My last name

DEVELOPERS

DEVELOPERS

10 Things I Hate About You

DEVELOPERS

6

10‸Things I Hate About You

You don't capture and handle `SIGTERM` and `SIGKILL`

Kubernetes scale down process*

*simplified; see the kubernetes pod lifecycle for full details

You use at most once semantics

Example "at least once" network call scheme

*the `network_call` should be idempotent; see this presentation's repository for a more complete example

Your retries are too fast

Example retry scheme

*the recursive `network_call` won't work, but slides only have 10 lines

You hard code default config variables

Example configuration variable scheme

*see the spring boot properties hierarchy for a more complete (and arguably overengineered) list

Your logs don't report the source of errors

Rules for good logs

*This is contentious - see here and here for discussion

You don't speak up

Casual bigotry pushes marginalized groups out of the industry

*data pulled from the stackoverflow developer surveys and the papers examined in this 2022 meta-analysis of the literature

Thank You For Your Time

Davis St. Aubin

Software Engineer and Consultant

consulting@nyefan.org

https://www.nyefan.org/categories/#presentations

DEVELOPERS

DEVELOPERS

10 Things I Hate About You

DEVELOPERS

6

10‸Things I Hate About You

You don't capture and handle SIGTERM and SIGKILL

Kubernetes scale down process*

*simplified; see the kubernetes pod lifecycle for full details

You use at most once semantics

Example "at least once" network call scheme

*the network_call should be idempotent; see this presentation's repository for a more complete example

Your retries are too fast

Example retry scheme

*the recursive network_call won't work, but slides only have 10 lines

You hard code default config variables

Example configuration variable scheme

*see the spring boot properties hierarchy for a more complete (and arguably overengineered) list

Your logs don't report the source of errors

Rules for good logs

*This is contentious - see here and here for discussion

You don't speak up

Casual bigotry pushes marginalized groups out of the industry

*data pulled from the stackoverflow developer surveys and the papers examined in this 2022 meta-analysis of the literature

Thank You For Your Time

Davis St. Aubin

Software Engineer and Consultant

consulting@nyefan.org

https://www.nyefan.org/categories/#presentations

You don't capture and handle `SIGTERM` and `SIGKILL`

*the `network_call` should be idempotent; see this presentation's repository for a more complete example

*the recursive `network_call` won't work, but slides only have 10 lines