Techlunch #22: Horror stories (27/06/2018)


Warning: theses notes are published raw, without any rewriting.
Attention: ces notes sont publiées telles quelles, sans retraitement particulier.

Talk #1 Breaking thousand of customers with 1 bug (Algolia)

Contexte : black Friday. ~70 commits to deploy

Staging was ok. Deploy on prod broke everything.

The bug: message corruption

But the corrupted message was persisted on disk, so rolling back the version was still crashing. Not seen on Staging before the 3 nodes were updated simultaneously (on prod: rolling deploy)

36 machines to fix, to remove the corrupted message.

3h to fix the prod.


  • deploy small diffs
  • have the same deployment process as prod
  • more validation in the code
  • ask for help
  • communicate (support)
  • give regularly updates
  • postmortem

    • blameless
    • impact
    • timeline (scenario producing the problem)
    • actions items
    • what went well/wrong (monitoring, revert procedure)

Talk #2 Deleting a customer’s account instead of a feature flag (Aircall)

Stack : 4 EC2, 1 RDS, monolithic code.

Problem: Mistakes between “delete company feature” & “delete company”

Solution: Restart EC2, restore RDS

They were in SF, the client on Europe, so they had a 8h advance to fix the problems.

Problem retrieving the external resources, directly deleted. Had to open tickets to the different providers.


  • DB Backup!
  • improve backoffice design
  • soft deleting external resources

Talk #3 My first steps as a Lead Developer (Madmoizelle)

The microbreaks nightmare

Podcast feed generation was slow, but working. Released it in production.

Problem: 503 for 1 second. Every hours.

Solution: use new relic to monitor


  • track prod
  • don’t “do it later”. It’s never done.

The redesign

Problem: Mobile redisign. Bought a template, problem to adapt it. Unrealistic deadlines


  • be realistic
  • have a real panel for UX testing
  • put your users in the center of your actions