Techlunch #22: Horror stories (27/06/2018)
Talk #1 Breaking thousand of customers with 1 bug (Algolia)
Contexte : black Friday. ~70 commits to deploy
Staging was ok. Deploy on prod broke everything.
The bug: message corruption
But the corrupted message was persisted on disk, so rolling back the version was still crashing. Not seen on Staging before the 3 nodes were updated simultaneously (on prod: rolling deploy)
36 machines to fix, to remove the corrupted message.
3h to fix the prod.
- deploy small diffs
- have the same deployment process as prod
- more validation in the code
- ask for help
- communicate (support)
- give regularly updates
- timeline (scenario producing the problem)
- actions items
- what went well/wrong (monitoring, revert procedure)
Talk #2 Deleting a customer’s account instead of a feature flag (Aircall)
Stack : 4 EC2, 1 RDS, monolithic code.
Problem: Mistakes between “delete company feature” & “delete company”
Solution: Restart EC2, restore RDS
They were in SF, the client on Europe, so they had a 8h advance to fix the problems.
Problem retrieving the external resources, directly deleted. Had to open tickets to the different providers.
- DB Backup!
- improve backoffice design
- soft deleting external resources
Talk #3 My first steps as a Lead Developer (Madmoizelle)
The microbreaks nightmare
Podcast feed generation was slow, but working. Released it in production.
Problem: 503 for 1 second. Every hours.
Solution: use new relic to monitor
- track prod
- don’t “do it later”. It’s never done.
Problem: Mobile redisign. Bought a template, problem to adapt it. Unrealistic deadlines
- be realistic
- have a real panel for UX testing
- put your users in the center of your actions