Friday, April 22, 2011

Captain Destructo Breaks Everything

Alternate title ideas: "It's not all s/Tryserver/Try"  or "What I should have done, and didn't"

I bet you get the point by now. Today I caused a fairly lengthy, unnecessary downtime on Try.  Now that I'm writing this, things are under control again and there's a few small niggly bits left but nothing that will keep me up at night.

It all started with a bug about graphserver posts from tests not getting through because they were looking for MozillaTry (the tinderbox name for Try) but instead the graphserver only knew about Tryserver (the branch name for Try) and nothing was using Try (except the repo for Try) which is what it ought to have been doing in the first place.

Now that I've been adding a lot of project branches in a short amount of time, certain things have become more streamlined and so I felt that the best option was to go through and rename Tryserver/MozillaTry to Try everywhere so that from the repo going forward, everything was the same. This has been working extremely well for our project branches and helps make setup a snap.

Here's where it gets all broken. I approached this bug with a quick swipe at this problem was superficial and ended up causing some preventable burning.  I shall now list for you (and future me) what I did and what I should have done:

Did:
  • hg rename on configs for desktop
  • branch configs for s/tryserver/try
  • updated graphserver branch name to Try
  • a quick downtime window from 10am - 12pm in order to prevent builds from getting split into two different upload dirs
Should have done:
  • hg rename on configs for mobile
  • grep of buildbotcustom for "tryserver" as we have special casing for it in several files
  • log uploader and post_upload scripts to make sure everything about the try build was going to the right place
  • updated the dir permissions on ftp for the new upload location and ensure that the archive is on nfs mount
  • edited cronjobs on staging to catch the new try builds
  • updated graphserver machines table for each try platform's builder name
  • more notice for downtime, with a 4 hour window that would have allowed a test push to make sure everything was wired up correctly  
  • updated the treeclosure hook to include the new tinderbox page
Some of the things I should have done didn't have an impact on the burning/try closure but it's fair to say that if I had done a staging round of all my plans first I would have caught more of the obvious things that I missed. I would have then planned the downtime better and been prepared to ensure the disturbance would have been minimal since this was, after all, a really low priority bug.

Aki told me that he had a manager who said "you don't learn til you break something".  Well I broke everything try-related today and here's hoping that I have learned something because the stress of this whole day is not something I want to experience often. It's that feeling you get when you realize you've started something that you can't back out of and there's no way to go but forward, even though everything in front of you now appears hopeless and messy.


So here's some lessons to take away:

  1. Staging is not to be underestimated even for just renaming things that are already working
  2. Taking the time to search with grep/mxr and find the terms you are replacing before starting the upgrade in production will help find wiring you might have overlooked in your preparations
  3. Prepare more thoroughly and have a clear idea of the env. you started in and what it will take to have that env. back when you're done. Leaving dangly bits is not ideal.


Happy Friday.
(and many thanks to Aki)