this post was submitted on 14 Dec 2023
125 points (100.0% liked)

Asklemmy

1454 readers
113 users here now

A loosely moderated place to ask open-ended questions

Search asklemmy 🔍

If your post meets the following criteria, it's welcome here!

  1. Open-ended question
  2. Not offensive: at this point, we do not have the bandwidth to moderate overtly political discussions. Assume best intent and be excellent to each other.
  3. Not regarding using or support for Lemmy: context, see the list of support communities and tools for finding communities below
  4. Not ad nauseam inducing: please make sure it is a question that would be new to most members
  5. An actual topic of discussion

Looking for support?

Looking for a community?

~Icon~ ~by~ ~@Double_A@discuss.tchncs.de~

founded 5 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[–] dan@upvote.au 11 points 11 months ago* (last edited 11 months ago) (1 children)

I broke the home page of a big tech (FAANG) company.

I added a call to an API created by another team. I did an initial test with 2% of production traffic + 50% of employee traffic, and it worked fine. After a day or two, I rolled out to 100% of users, and it broke the home page. It was broken for around 3 minutes until the deployment oncall found the killswitch I put in the code and turned it off. They noticed the issue quicker than I did.

What I didn't realise was that only some of the methods of this class had Memcache caching. The method I was calling did not. It turns out it was running a database query on a DB with a single shard and only 4 replicas, that wasn't designed for production traffic. As soon as my code rolled out to 100% of users. the DBs immediately fell over from tens of thousands of simultaneous connections.

Always use feature flags for risky work! It would have been broken for a lot longer if I didn't add one and they had to re-deploy the site. The site was continuously pushed all day, but building and deploying could take 45+ mins.

[–] jjjalljs@ttrpg.network 6 points 11 months ago

Always use feature flags for risky work! It would have been broken for a lot longer if I didn’t add one and they had to re-deploy the site. The site was continuously pushed all day, but building and deploying could take 45+ mins

This reminds me of the old saying: everyone has a test environment. Some people are lucky enough to have a separate production environment, too.