Tuesday, August 12, 2008

Welcome to Build, Ben says

Today was a big day for the Firefox 3.0 unittest set up. Since QA and Build have become separated, I have been working towards lining up all out unittest masters on the Build network. What used to be 10+ master addresses will be narrowed to 2 - you're either on staging-master or production master.

Easy.

No. It's actually not that easy. What I estimated would be 2 hours of downtime has turned into almost 8 hours (and counting) for many reasons, including the following:

* All the slave VMs had to have a new user created, one that is consistent with all our other Build machines. It makes sense to do this all at once, but it takes some time to get all the permissions and paths and ssh keys and other little details to line up properly

* In switching networks and users, the linux boxes were unreachable by VNC for some time until it was discovered (thanks to bhearsum & joduinn) that the xstartup in ~/.vnc was configured differently than the other linux boxes. I think it took almost an hour to get the fix on this figured out


All in all there were many little trips and glitches that made this process go for so long, and the fact that it can take over an hour to see if a build & test run is successful sucks. Thank you very much to all the Build Team who helped during this process.

At the time of writing this, I am only waiting on the pgo box to come back up on the new network with a 30GB disk partition added, and looking into a few compiler warnings on Mac and Windows. The PGO box didn't have an fcal disk partition for building on and I wonder if the issues in this bug are related to that. It would be a pretty great bonus if this switch turned up the fix for that machine.

The good news is that we are in the process of streamlining and making things more efficient for the future. All the build machines are getting closer every day to being interchangeable. The time it takes to get a new linux VM running is miniscule - and hopefully the same will be true of the other two platforms soon.

Things still to do:
* post about the new machine names of these VMs
* make sure that Nagios is clear about what it should be reporting on
* update the cron job that does the rsync of the buildmaster logs to the TB share
* file patches for 1.9 unittest's mozconfigs, master.cfg, mozbuild.py and killAndClobber.py

Back to watching the buildbot waterfall for green.

No comments: