Crashing Into Open Source (without a paddle): August 2008

Thursday, August 14, 2008

Continuing saga of the 1.9 Unittest Move

When we left off, there was a check error happening across all Linux slaves and a reftest failure on the Win32 ones.

Update #1: A bug (450637) has been filed on that win32 failure, and also I brought the physical boxes back from sleep to be up on the new 1.9 master alongside their VM counterparts. We should know in the next hour or so if the reftest failure is consistent on all of them.

Update #2: The check error on Linux was due to the placement of a simple .sqlite file bug-365166.sqlite to be specific. This file was in /tmp and not in the slave build dir and thus, escaped during chown. Being owned by buildbot instead of cltbld was the cause of the access denied errors. Huge thanks to Cesar and Sdwilsh for looking at that test with me and for catching this anomaly. I've filed a bug (450665)to remove the offending placement so that this doesn't happen again in the future. Files shouldn't be getting created outside of the build dir, creates a whole mess of problems.

Speaking of mess:

Ew. That's all I can say. I've been watching this waterfall obsessively (more than usual) as it has displayed a bruised variety of colours, mostly *Not* green.

In other news, something I noticed while upgrading the windows slaves:

Really? I didn't know that people _chose_ IE. I thought it just came with the OS. I wish they would choose their words more carefully.

Back to the unittest trenches.

Wednesday, August 13, 2008

Update on the Unittest 1.9 move

In order to streamline the buildslave pool, the names of the following unittest 1.9 slaves were changed when we switched networks yesterday.

All of these machines now run Buildbot 0.7.7 and the latest Twisted & Python.

The Linux machines had their names changed and user changed - they are the same VMs as before:

qm-centos5-01 --> fx-linux-1.9-slave07
qm-centos5-02 --> fx-linux-1.9-slave08
qm-centos5-04 --> fx-linux-1.9-slave09

The Mac machines are the same ones as before, only a user change:

qm-xserve01 --> bm-xserve20
qm-xserve06 --> bm-xserve21

The two non-pgo windows machines are now VMs, the pgo box is the same VM that it was before - with a user change and a new 30GB fcal drive added for building on

qm-win2k3-01 --> fx-win32-1.9-slave07
qm-win2k3-02 --> fx-win32-1.9-slave08
qm-win2k3-pgo01 --> fx-win32-1.9-slave09

At the moment all three Linux boxes are experiencing errors in Check :

gmake[2]: Leaving directory `/builds/slave_new/trunk_centos5_8/mozilla/objdir/storage/build'
gmake[2]: Entering directory `/builds/slave_new/trunk_centos5_8/mozilla/objdir/storage/test'
../../_tests/xpcshell-simple/test_storage/unit/test_bug-365166.js: FAIL
../../_tests/xpcshell-simple/test_storage/unit/test_bug-365166.js.log:
>>>>>>>
*** Storage Tests: Trying to close!
*** Storage Tests: Trying to remove file!
*** test pending
[Exception... "Component returned failure code: 0x80520015 (NS_ERROR_FILE_ACCESS_DENIED) [mozIStorageService.openDatabase]" nsresult: "0x80520015 (NS_ERROR_FILE_ACCESS_DENIED)" location: "JS frame :: ../../_tests/xpcshell-simple/test_storage/unit/test_bug-365166.js :: test :: line 22" data: no]
*** FAIL ***

<<<<<<<
../../_tests/xpcshell-simple/test_storage/unit/test_bug-393952.js: PASS
../../_tests/xpcshell-simple/test_storage/unit/test_bug-444233.js: PASS

And all three Win32 boxes are having the same 1 test fail in Reftest:

REFTEST UNEXPECTED FAIL: file:///E:/slave/trunk_2k3_8/mozilla/layout/reftests/bugs/212563-1.html

Please contact me if you have any ideas about what could be causing these.

-- Lukas

Tuesday, August 12, 2008

Welcome to Build, Ben says

Today was a big day for the Firefox 3.0 unittest set up. Since QA and Build have become separated, I have been working towards lining up all out unittest masters on the Build network. What used to be 10+ master addresses will be narrowed to 2 - you're either on staging-master or production master.

Easy.

No. It's actually not that easy. What I estimated would be 2 hours of downtime has turned into almost 8 hours (and counting) for many reasons, including the following:

* All the slave VMs had to have a new user created, one that is consistent with all our other Build machines. It makes sense to do this all at once, but it takes some time to get all the permissions and paths and ssh keys and other little details to line up properly

* In switching networks and users, the linux boxes were unreachable by VNC for some time until it was discovered (thanks to bhearsum & joduinn) that the xstartup in ~/.vnc was configured differently than the other linux boxes. I think it took almost an hour to get the fix on this figured out

All in all there were many little trips and glitches that made this process go for so long, and the fact that it can take over an hour to see if a build & test run is successful sucks. Thank you very much to all the Build Team who helped during this process.

At the time of writing this, I am only waiting on the pgo box to come back up on the new network with a 30GB disk partition added, and looking into a few compiler warnings on Mac and Windows. The PGO box didn't have an fcal disk partition for building on and I wonder if the issues in this bug are related to that. It would be a pretty great bonus if this switch turned up the fix for that machine.

The good news is that we are in the process of streamlining and making things more efficient for the future. All the build machines are getting closer every day to being interchangeable. The time it takes to get a new linux VM running is miniscule - and hopefully the same will be true of the other two platforms soon.

Things still to do:
* post about the new machine names of these VMs
* make sure that Nagios is clear about what it should be reporting on
* update the cron job that does the rsync of the buildmaster logs to the TB share
* file patches for 1.9 unittest's mozconfigs, master.cfg, mozbuild.py and killAndClobber.py

Back to watching the buildbot waterfall for green.

Monday, August 11, 2008

Scheduled Downtime Tues Aug 12 - 8:00 am PDT for Unittest network switch

Tomorrow there will be a ~2hr downtime starting at 8:00 am PDT as the 1.9 unittest master is moved over to the build network.

At the same time there will be a short interruption on the Mozilla2 production master.

If any issues arise, please comment in bug 450119.

Thursday, August 7, 2008

Looking for suggestions on dealing with lots of data

So I'm still plugging away at figuring out how to interpret the massive amounts of error log output that our unittest builds create.

As the test suites are being run, there is a steady stream of stdio being generated and logged. From this stdio, I gather up all the lines of output that contain "TEST-UNEXPECTED-FAIL" (thanks to Ted for unifying the output!).

Now I have files that look something like this:

linux-2 | 67 | 07/25/2008 | 06:40 | *** 61506 ERROR TEST-UNEXPECTED-FAIL | /tests/toolkit/content/tests/widgets/test_tree.xul | Error thrown during test: uncaught exception: [Exception... "Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIDOMWindowUtils.sendMouseScrollEvent]"  nsresult: "0x80004005 (NS_ERROR_FAILURE)"  location: "JS frame :: http://localhost:8888/tests/SimpleTest/EventUtils.js :: synthesizeMouseScroll :: line 273"  data: no] - got 0, expected 1
linux-2 | 67 | 07/25/2008 | 06:40 | *** 62352 ERROR TEST-UNEXPECTED-FAIL | /tests/toolkit/content/tests/widgets/test_tree_hier.xul | Error thrown during test: uncaught exception: [Exception... "Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIDOMWindowUtils.sendMouseScrollEvent]"  nsresult: "0x80004005 (NS_ERROR_FAILURE)"  location: "JS frame :: http://localhost:8888/tests/SimpleTest/EventUtils.js :: synthesizeMouseScroll :: line 273"  data: no] - got 0, expected 1
linux-2 | 67 | 07/25/2008 | 06:40 | *** 63084 ERROR TEST-UNEXPECTED-FAIL | /tests/toolkit/content/tests/widgets/test_tree_hier_cell.xul | Error thrown during test: uncaught exception: [Exception... "Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIDOMWindowUtils.sendMouseScrollEvent]"  nsresult: "0x80004005 (NS_ERROR_FAILURE)"  location: "JS frame :: http://localhost:8888/tests/SimpleTest/EventUtils.js :: synthesizeMouseScroll :: line 273"  data: no] - got 0, expected 1</pre>

Where the info is "|" delimited and goes like this:

<pre>PLATFORM | BUILD_NO | DATE | TIME | TEST-RESULT | TEST-NAME | TEST-OUTPUT

Approximately 7000 lines of error output for less than a month of constant testing.

I want to be able to know the following (at least):

* How many times has a particular test failed?
* On which platforms?
* How many times this week vs. last week?

That would be a start anyway.

I would love to be able to create a graph or something visual that shows peaks of test failures. Unfortunately I don't really know much about that area.

So I am asking for help/suggestions. If you had about 490,000 lines of errors (representing 3 platforms) in the above format - what would you do?

I can pretty easily add to the python script that greps for error output so that it creates sql insert statements instead of a text file and I would welcome tips that include creating/automating a database to hold all the error info. I've been thinking of setting something up with RoR to let people create their own views of the data depending on what they are looking for.

Looking forward to your advice.