PS1 IPP Czar Logs for the week 2010.12.06 - 2010.12.12

(Up to PS1 IPP Czar Logs)

Monday : 2010.12.06 : Roy

Not much data last night due to telescope azimuth amplifier errors. 123 science exposures in total, 99 through burntool.

bills notes: I forgot to mention that ipp007 has been relphot on /data/ipp007.0/gpc1/catdirs/ThreePi.V1.20101118 since last night. The working directory is ~ipp/dvo/relphot

11:00 bills: Mark Huber reported this morning that no data has been distributed since 11/30. rcDestination.dbhost was still set to ippdb02 so filsets were getting registered in the wrong database. At 08:08 Fixed this by faulting the affected filesets and letting revert / rerun recreate them. Later Armin reported that some were still missing from 11/30. Bill's date cut on the fix in HST while the time stamp is in UTC. The correct fix was to fault and then revert all filesets registered between '2010-11-30 13:33:58' and '2010-12-06'

11:30 bills: stackonly shut down so as not to interfere with detrend memory hogs.

11:38 heather: I started detrend as myself about ~ half an hour ago (with the current ipp trunk) to run another iteration on 585. I used whatever the default hosts were in detrend from months ago, and this caused load chaos. I stopped detrend and when the queue clears out I will chose different/less hosts.

13:13 roy: Turned destreak reverts on (63 faults). Czartool was lying about it already being on.

15:04 bills: turned destreak on and revert off. ran magictool -reverttree -label ThreePi.nightlyscience

15:28 eam; shutdown all pantasks, updated ipp user to ipp-20101206, restarted all pantasks

16:30 roy: all stdscience processing is complete

17:30 bills: Lots of unlogged activity the past few hours updating the production build to the new tag. Distribution shut off pending some distribution updates by bills.

17:49 bills: most all chip updates failing. update processing stopped.

18:57 bills: added label test_warpstack_forced to stdscience. As the label implies it is a test.

Tuesday : 2010.12.07

Heather is czar but this is all contributed by bills

05:27 restarted distribution pantasks. rcserver left turned off until bill tests changes to receive_file.pl

06:00 all distRuns for the stack-stack diffs made last night have some components that failed. From the file ownership and time stamps it appears that the script compressing the existing config dump files operated on these new ones creating files that are doubly compressed. They need two doses of gunzip. I.E. gunzip -c file.mdc | gunzip -c > outfile.mdc

06:30 stated up update pantasks running as bills using his ipp-20101029 build. Turned off MOPS.2 label to give full attention to the small number of requests that came in over the web interface.

07:30 many postage stamp requests completed, but now many are blocked because bills' build can't handle compressed config dump files.

08:30 Got fixes to ppImage. Rebuilt production build and restarted update as ipp user. Reverted chip faults from ps_ud_WEB and ps_ud_WEB.UP

09:59 1250 chips processed by update only 1 fault! 49 MD03.refstack skycells left to stack. Started them in stackonly pantasks on ipp042. hosts ipp040-44 are in use.

13:15 Needed to whack a few moles with regard to inconsistencies data_state and faults in the tables but we have had no non-revertable faults in chip and destreak update for several hours. Bill giddy with happiness.

Noticed that publishing was failing 100% of the time. Turned out to be a bug in publish_file.pl "if it isn't tested it doesn't work"

15:58 (heather) Heather is running detrend as heather on 038, 039, 046-053, c00-c10 (since 10am). After investigation minidvodb_create faults with bills, discovered that addstar has not been restarted since 11-23 (this is terrible, we rebuilt the ipp since then). Restarting addstar.

16:16 (bills) Queued 480 STS diffRuns label STS.20101202

16:46 (bills) stdscience diff.revert turned off

21:25 (bills) came back from reception to find that the update pantasks died around 17:42. Ken's team has submitted some postage stamp request files that resulted in nearly 7000 jobs. Nearly all of these will require update processing. Update processing is still going well. I need to do some work on destreak revert and cleanup to handle gone files.

Wednesday : 2010.12.08

9:57 (heather) no data from last night. I'm using the same hosts as I did yesterday to process ~30 images for the static mask (in a stdscience like pantasks running on c14). The label for these images is darktest.20101208

11:47 (bills) The STS stacks are too slow and faulting a lot. I have set the Label to inactive for the time being while we debug.(It's still in the stdscience label list but it's inactive so roboczar will complain about it being stuck)

2:17 pm (heather) minidvodb faults: merge faulted (this is the first time it has faulted on a non-test db), and is complaining about recipes (this is since the restart of addstar yesterday). Will investigate once the db is back up. Also, minidvodb_create faulted again - first, nothing bad happens - it succeeds on the second try. second, I think it occurs when there is a minidvodb switch (I'm not sure why it does that).

14:46 (Serge) stopped all services and stopped the mysql servers (new configuration) for duplication.

14:47 (Serge) restarted the mysql servers

14:51 (Serge) restarted all pantasks services but addstar (HAF), detrend (HAF), and replication (CZW).

15:10 (bills) after restarting some postage stamp requests that were taking a long time to parse finished quickly. Parsing a request file and inserting jobs is a database intensive operation, primarily for the gpc1 database. It appears that restarting the database improved things significantly. Looking at ippdb01's ganglia page we find that prior to the restart it was using quite a bit of swap space and the total number of processes was quite high. After the restart these number dropped significantly. Something to keep an eye on.

16:39 (heather) running my own stdscience and detrend (still) to process more darktest_nonl.20101208 (same as before but with non-lin turned off), and running darkmask processing.

18:25 (heather) I turned merging off in addstar to debug why the latest merge is failing. Oh, it just now finished with a segfault. merging is off.

21:31 (bills) postage stamp server / update processing has gotten sluggish. I suspect that the number of recurring faults is getting in the way. I've turned off pstamp.revert chip.revert and destreak.revert in the pstamp, update, and update pantasks respectively...

21:36 (bills) increaed poll limit from 200 to 300 in pstamp. This mostly allows more dependents to be checked and may free up some of the sluggishness. 21:44 still experimenting. In pstamp pantasks turned off reverts (pstamp.revert.off). My thinking is that if there are lots of jobs faulting repeatedly, they might be filling up the queues preventing real work from getting done.

22:15 doubled number of hosts for update. cowabunga!

Thursday : 2010.12.09

Bill is the czar the next two days.

06:00 Lots of update faults last night. It turned out that I forgot to check in the recipe changes for the SAS_REFERENCE reduction class. This caused updates from the SAS rerun to fail. Fixed that. Then noticed that destreak reverts were failing. This was due to the changes in chip update processing to not rerun photometry. Fixed that. There are also a significant number of chip stage faults. Bill is no longer grinning.

Disabled postage stamp request retrieval from MOPS.2 until I get things sorted out.

08:00 no data last night high humidity. OSS magic streak detection seems to have completed. The faulted STS diffRuns are being debugged or rather will be when I finish debugging the update failures.

08:46 found some chip faults where the cmf file was missing so update tried to rerun photometry and psphot popped an assertion.

Many "failure for: chip_imfile.pl" with fault 1 Log file shows that the job ran again and succeeded. The jobs probably worked but had trouble updating the database. Also noticed job timeouts on query tasks. Just now the database dump to ipp0012 is running.

Stopped stdscience and distribution because they have nothing to do.

09:15 Found the source of the chip update failures. They are due to the recent changes to chip update processing. We now demand the existence of all expected output files. Older data did not have the pattern correction enabled so that file does not exist so we fault. The bug is that I am using the current recipe to decide whether the we should expect the pattern file. I should be parsing the config dump file. That will add several seconds to chip processing. As a hack I will allow the non existence of this file to not be an error when run state is update

11:34 That was exhausting. Fixed various data state problems caused by the bugs described above. All postage stamp requests are finished. Turned MOPS.2 back on set all tasks to nominal states that is reverts on.

18:37 set stdscience and distribution to run

Friday : 2010.12.10

13:42 No data again last night. Quiet day. MPIA submitted many thousands of postage stamp requests. Just queued ps_ud_MOPS.2 and ps_ud_WEB.UP for chip cleanup.

Saturday : 2010.12.11

Sunday : 2010.12.12