Crashing Into Open Source (without a paddle)

Tuesday, August 12, 2008

Welcome to Build, Ben says

Today was a big day for the Firefox 3.0 unittest set up. Since QA and Build have become separated, I have been working towards lining up all out unittest masters on the Build network. What used to be 10+ master addresses will be narrowed to 2 - you're either on staging-master or production master.

Easy.

No. It's actually not that easy. What I estimated would be 2 hours of downtime has turned into almost 8 hours (and counting) for many reasons, including the following:

* All the slave VMs had to have a new user created, one that is consistent with all our other Build machines. It makes sense to do this all at once, but it takes some time to get all the permissions and paths and ssh keys and other little details to line up properly

* In switching networks and users, the linux boxes were unreachable by VNC for some time until it was discovered (thanks to bhearsum & joduinn) that the xstartup in ~/.vnc was configured differently than the other linux boxes. I think it took almost an hour to get the fix on this figured out

All in all there were many little trips and glitches that made this process go for so long, and the fact that it can take over an hour to see if a build & test run is successful sucks. Thank you very much to all the Build Team who helped during this process.

At the time of writing this, I am only waiting on the pgo box to come back up on the new network with a 30GB disk partition added, and looking into a few compiler warnings on Mac and Windows. The PGO box didn't have an fcal disk partition for building on and I wonder if the issues in this bug are related to that. It would be a pretty great bonus if this switch turned up the fix for that machine.

The good news is that we are in the process of streamlining and making things more efficient for the future. All the build machines are getting closer every day to being interchangeable. The time it takes to get a new linux VM running is miniscule - and hopefully the same will be true of the other two platforms soon.

Things still to do:
* post about the new machine names of these VMs
* make sure that Nagios is clear about what it should be reporting on
* update the cron job that does the rsync of the buildmaster logs to the TB share
* file patches for 1.9 unittest's mozconfigs, master.cfg, mozbuild.py and killAndClobber.py

Back to watching the buildbot waterfall for green.

Monday, August 11, 2008

Scheduled Downtime Tues Aug 12 - 8:00 am PDT for Unittest network switch

Tomorrow there will be a ~2hr downtime starting at 8:00 am PDT as the 1.9 unittest master is moved over to the build network.

At the same time there will be a short interruption on the Mozilla2 production master.

If any issues arise, please comment in bug 450119.

Thursday, August 7, 2008

Looking for suggestions on dealing with lots of data

So I'm still plugging away at figuring out how to interpret the massive amounts of error log output that our unittest builds create.

As the test suites are being run, there is a steady stream of stdio being generated and logged. From this stdio, I gather up all the lines of output that contain "TEST-UNEXPECTED-FAIL" (thanks to Ted for unifying the output!).

Now I have files that look something like this:

linux-2 | 67 | 07/25/2008 | 06:40 | *** 61506 ERROR TEST-UNEXPECTED-FAIL | /tests/toolkit/content/tests/widgets/test_tree.xul | Error thrown during test: uncaught exception: [Exception... "Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIDOMWindowUtils.sendMouseScrollEvent]"  nsresult: "0x80004005 (NS_ERROR_FAILURE)"  location: "JS frame :: http://localhost:8888/tests/SimpleTest/EventUtils.js :: synthesizeMouseScroll :: line 273"  data: no] - got 0, expected 1
linux-2 | 67 | 07/25/2008 | 06:40 | *** 62352 ERROR TEST-UNEXPECTED-FAIL | /tests/toolkit/content/tests/widgets/test_tree_hier.xul | Error thrown during test: uncaught exception: [Exception... "Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIDOMWindowUtils.sendMouseScrollEvent]"  nsresult: "0x80004005 (NS_ERROR_FAILURE)"  location: "JS frame :: http://localhost:8888/tests/SimpleTest/EventUtils.js :: synthesizeMouseScroll :: line 273"  data: no] - got 0, expected 1
linux-2 | 67 | 07/25/2008 | 06:40 | *** 63084 ERROR TEST-UNEXPECTED-FAIL | /tests/toolkit/content/tests/widgets/test_tree_hier_cell.xul | Error thrown during test: uncaught exception: [Exception... "Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIDOMWindowUtils.sendMouseScrollEvent]"  nsresult: "0x80004005 (NS_ERROR_FAILURE)"  location: "JS frame :: http://localhost:8888/tests/SimpleTest/EventUtils.js :: synthesizeMouseScroll :: line 273"  data: no] - got 0, expected 1</pre>

Where the info is "|" delimited and goes like this:

<pre>PLATFORM | BUILD_NO | DATE | TIME | TEST-RESULT | TEST-NAME | TEST-OUTPUT

Approximately 7000 lines of error output for less than a month of constant testing.

I want to be able to know the following (at least):

* How many times has a particular test failed?
* On which platforms?
* How many times this week vs. last week?

That would be a start anyway.

I would love to be able to create a graph or something visual that shows peaks of test failures. Unfortunately I don't really know much about that area.

So I am asking for help/suggestions. If you had about 490,000 lines of errors (representing 3 platforms) in the above format - what would you do?

I can pretty easily add to the python script that greps for error output so that it creates sql insert statements instead of a text file and I would welcome tips that include creating/automating a database to hold all the error info. I've been thinking of setting something up with RoR to let people create their own views of the data depending on what they are looking for.

Looking forward to your advice.

Wednesday, July 23, 2008

Grovelling isn't so bad

Been working on a couple of little utility scripts that I think are ready for public viewing. I'm interested in any tips on writing better code, or other ways to do what I'm doing that are more efficient.

The first one is cleanup.py which we need to be able to quickly get rid of old log files so that when we grovel through for errors, only the files of interest are being scanned.

Once you've got the old log files cleared out, you can use grovel.py to scan through for TEST-UNEXPECTED-FAIL. This script looks through each directory passed in from the command line, and prints all the failure lines to a .errors file for that directory - so the darwin log errors end up in a darwin_timestamp.errors file. The script also keeps a counter of TEST-PASS, TEST-KNOWN-FAIL, and TEST-PASS(EXPECTED RANDOM) and then prints the total tests run as well as these counters on the last line of the .errors file.

Next steps:

Add gathering up all the .errors files into a tarball

Set up a weekly cron job that will run these scripts and email the tarball

Create a database and insert results

Web interface for aforementioned db that will allow for searching

Even though these are pretty simple utility scripts, I'm excited because they will make my life a little easier and also because it's the first python I've written from scratch...oh, and it's not a school assignment :)

Monday, July 21, 2008

Discussing Data

Some general thoughts on the discussion of data, inspired by Mitchell's blog post.

When I first started using the internet with some regularity, about 13 years ago, I was suspicious about entering any personal information whatsoever. This was before identity theft was a common occurrence, before I had any money to worry about losing, I don't think I even had a credit card yet. Some of the fears were based on run of the mill rebellion against "The Man" but some if it was just a reaction to something new.

For many years, whenever prompted for personal information, I would look for a way around having to enter it. If I couldn't get under it or over it, I would make stuff up...or leave. Creating false accounts gets tiring, because then you have to remember all your lies. Firefox wasn't around yet to help me keep track of all my phony accounts. I sure do appreciate the password manager and extensions like BugMeNot.

Skipping forward to the present, I still look for a way out of having to enter any identifying data wherever possible. Something that continually annoys me is being required to choose between male and female on a form when I am making a purchase. This should NOT be required to buy a sticker, test beta software or sign up for a social networking site. I'd like to see the end of generalized marketing based on gender and find new ways of triangulating what cat owners are doing that is different than dog owners.

Back to the data...

Even though I hate the thought of anyone assuming they know me because of a few hastily checked radio buttons, I also want the freedom to go about my business on the internet as easily as I do in real life - with my driver's license and a bank card. I have proof of who I am and I have money - what more do you want? I should now get to do whatever it is I'm looking to do with as few clicks as possible.

So if the future web browser allows me to safely keep all the important stuff handy, to know that I am who I say I am, and let me skip the 3 page sign up process, this is a Good Thing.

How can we get to that kind of level without talking about data and all the good/bad/lukewarm associations we have with it?

I tell people all the time that they should be using Firefox because it is the safest. People care about safety, and this is what they need to hear. If Firefox starts to work with data, I trust that we will do so in the best interest of the people who came to us for safety. I'm excited to talk about data and what we can do with it.

My hope is that data collection will become less of a top-down "Tell me this information or you can't access {fabulous service name here}" and instead will become the equivalent of the clerk at Best Buy asking you for your postal code and being able to say "No, I don't want to give that information to you, but I will still buy Rock Band from you".

Thursday, July 17, 2008

Set the VNC Password for Mac's Remote Desktop in Terminal

I was stuck in trying to access one of our xserve machines that just got moved from the QA network to the Build network. I could connect via ssh, and Justin could ping it but attempting to connect with VNC wasn't working. It wouldn't accept the usual passwords. Justin seemed to think that it was possible to change the VNC password through the command line, so I google it and read a post from 2 years ago.

Something I've learned from reading "how-to" blogs is that you should always read the comments first. That's where the most up to date information will be, if there is any. The person who wrote the post used strange template structure that made his idea hard to read and understand. Anyone who didn't read the comments wouldn't know that kickstart now takes plain text passwords.

The long and short? If you want to change the VNC passoword do this:

sudo /System/Library/CoreServices/RemoteManagement/ARDAgent.app/Contents/Resources/kickstart -configure -clientopts -setvnclegacy -vnclegacy yes -setvncpw -vncpw [newpassword]

Apparently you can enable VNC access and set the VNC password via the kickstart command. It isn't terribly well documented, but since it now accepts plain text passwords, I think that's a step in the right direction.

Wednesday, July 2, 2008

Chasing rainbows is easier

I was so thrilled to discover Splunk that I installed it on one of the buildbot masters - qm-rhel02 - without realizing that in fact, Splunk starts to quickly eat up disk space and hogs memory usage. Yesterday afternoon some Talos boxes started to go down because of this, and once I stopped the Splunk server everything started to right itself.

Lessons learned:

     Do not play with the buildbot master. 
     Do not look directly at the buildbot master. 
     Do not taunt the buildbot master.

So today's tasks include getting access to the samba share that was set up, creating a cron job that will rsync the buildbot master logs to said share and then finding a safe place to set up Splunk again.

We really need to have a way to look at data from the buildbot master over a long period of time - otherwise filing bugs on these intermittent failures is just a shot in the dark. Take yesterday for example. qm-win2k3-pgo01 is being "unreliable" and had the same errors in refest for two consecutive builds. I file a bug, and the response is "grab me a copy of the offending objdir so we can poke at it". Wouldn't you know that the very next build does not have the same error output - this time it has mochitest issues that are seemingly unrelated. This morning I check again and it's had a compile failure, an exception (the most hideous purple) and then a completely green run.

Intermittent failures == needle in a haystack