Techlunch #22: Horror stories (27/06/2018)

27/06/2018

Warning: theses notes are published raw, without any rewriting.

Attention: ces notes sont publiées telles quelles, sans retraitement particulier.

Talk #1 Breaking thousand of customers with 1 bug (Algolia)

Contexte : black Friday. ~70 commits to deploy

Staging was ok. Deploy on prod broke everything.

The bug: message corruption

But the corrupted message was persisted on disk, so rolling back the version was still crashing. Not seen on Staging before the 3 nodes were updated simultaneously (on prod: rolling deploy)

36 machines to fix, to remove the corrupted message.

3h to fix the prod.

Lessons:

deploy small diffs
have the same deployment process as prod
more validation in the code
ask for help
communicate (support)
give regularly updates
postmortem
- blameless
- impact
- timeline (scenario producing the problem)
- actions items
- what went well/wrong (monitoring, revert procedure)

Talk #2 Deleting a customer’s account instead of a feature flag (Aircall)

Stack : 4 EC2, 1 RDS, monolithic code.

Problem: Mistakes between “delete company feature” & “delete company”

Solution: Restart EC2, restore RDS

They were in SF, the client on Europe, so they had a 8h advance to fix the problems.

Problem retrieving the external resources, directly deleted. Had to open tickets to the different providers.

Lessons

DB Backup!
improve backoffice design
soft deleting external resources

Talk #3 My first steps as a Lead Developer (Madmoizelle)

The microbreaks nightmare

Podcast feed generation was slow, but working. Released it in production.

Problem: 503 for 1 second. Every hours.

Solution: use new relic to monitor

Lessons:

track prod
don’t “do it later”. It’s never done.

The redesign

Problem: Mobile redisign. Bought a template, problem to adapt it. Unrealistic deadlines

Lessons:

be realistic
have a real panel for UX testing
put your users in the center of your actions