PS1 IPP Czar Logs for the week 2010.12.13 - 2010.12.19

(Up to PS1 IPP Czar Logs)

Monday : 2010.12.13

Lots of update processing triggered by the postage stamp server has been going on for the past several days. Except for some requests for chips with reduction STDSCIENCE_V0 things have been progressing smoothly.

  • 12:40 (bills) added label MD02.2010.rerun to stdscience 562 exposures. About 1/2 hour later pantasks died. Restarted it at 13:31
  • 17:08 (heather) under my own stdscience, processed darktest%.20101213 to investigate edge nans. Queued darktest.201012013, consisting of 100 i band images, for the static masking (and the autogeneration of the static masks).

Tuesday : 2020.12.14

Update storm seems to have passed during the night.

Many faults from old chipRuns with reduction STDSCIENCE_V0 (chip_id ~= 20000). Many (if not all) of the config dump files for this reduction refer to files that no longer exist. I set the state to goto_purged but I have not yet pulled the trigger and set label to goto_cleaned.

  • 08:00 Queued MD02 nightly stacks. Label MD02.2010.rerun. Data group is MD02.V2.$date to distinguish from the MD02 stacks run previously with data group MD02.$date
  • 10:40 (Serge) Many diffs have faults. Started revert for diff (through czartool). I don't see anything else.
  • 11:00 (Serge) I ran ~heather/sshToNodes.py ipp /usr/local/sbin/nfscheck as ipp and all nodes tell they are OK.
  • 11:20 (Serge) Repeated entry in logs of failed jobs:
    I/O error code: 102
     -> pmFPAfileWrite (pmFPAfileIO.c:340): Known programming error
        Error: file->mode != PM_FPA_MODE_INTERNAL is not true.
     -> pmFPAfileIOChecks (pmFPAfileIO.c:90): I/O error
        failed WRITE in FPA_AFTER block for PSPHOT.BACKMDL.STDEV
     -> main (ppSub.c:89): I/O error
        Unable to close files.
     Unable to perform ppSub: 2 at
     /home/panstarrs/ipp/psconfig//ipp-20101206.lin64/bin/diff_skycell.pl line 400.
    

Gene speaks: "[...] hitting the same failure which [...] i mentioned earlier this morning. let's let them fail [and fix later]"

  • 13:14 stopping processing in order to update the build. Need to wait a few minutes to let the running stacks finish.
  • 13:41 rebuild complete processing restarted. set label STS.20101202 to inactive temporarily until fix is confirmed. Reverted the diff faults.
  • 14:45 set label STS.20101202 to active. Added MD02.2010.rerun to survey.dist and added label to distribution pantasks.
  • 15:48 (Serge) Added ippMonitor as mysql user on ippdb:
CREATE USER 'ippMonitor'@'ipp004.ifa.hawaii.edu' IDENTIFIED BY 'ippMonitor';
GRANT REPLICATION CLIENT ON *.* TO 'ippMonitor'@'ipp004.ifa.hawaii.edu';
FLUSH PRIVILEGES;

Modified ippMonitor/raw/site.php so that czar tool can check replication status on ippdb02 (SVN 30034).

Added link to czar log wiki pages in czartool page (SVN 30035).

  • 17:40 Stopped processing in order to rebuild psphot and psmodules. Restarted a few minutes later. set diff.revert.off so as to evaluate diff failures.

Wednesday : 2010.12.15

Bill is czar today

  • 06:41 (Bill) Got lots of images last night. czartool says 631 exposures and all have been downloaded. The MD02 rerun nightly stacks are nearly done, only 5 left out of over 5549. One is failing due to a corrupt mask file from warp 136656. I regenerated it with tools/runwarpskyfile.
  • set label STS.20101202 to inactive while nightlyscience is busy.
  • 09:56 (Serge) Stopped chip to speed up SweetSpot? data processing
  • 11:34 (Serge) SweetSpot? data not published: in stdscience survey.add.publish SweetSpot?.nightlyscience 5
  • 11:42 (Serge) Stopped all pantasks_server
  • ~12:15 (Bill) updated production build with changes to fix the diff problems. Tweaked the survey publish task to urge along the sweetspot data.
  • 13:30 activated label STS.20101202 reverted all outstanding diff faults. Queued stack-stack diff runs for MD02.2010.rerun
  • 18:00 Switched production build to ipp-20101215

  • 19:09 Had a few hiccups. For some reason magic didn't get built so distribution and destreak fell over. Did the rebuild while things were running and got some "config files not found faults" (Don't do that!). Shut down everything and restarted. After a few more missing files were "fixed". isp registration is turned off because Chris thinks that the system is going to want to run burntool on the images.
  • 22:13 stdscience stopped since 20:00 or so. Not sure why. Set to run. Registration is off while Chris debugs some problems.
  • 23:25 Set file that was repeatedly faulting on summit copy to fault = 110 ( HTTP GONE - 300 ) It was an exposure taken while doing some camera testing.
update pzDownloadImfile set fault = 110 where exp_name = 'c5546g0003o' and class_id = 'ota46';
  • set all runs with "label like ps_ud%" to goto_cleaned since there was no outstanding pstamp activity.
  • stdscience stopped again. Set it back to run. Wierd.
  • 23:55 Chis reports that he has fixed gpc1 registration and the first exposure (exp_id == 265105) has been burntooled and queued for chip processing during the night. registration is running now.

Thursday : 2010.12.16

  • 14:45 stdscience keeps getting set to stop. Was a problem with new code that is fixed now.
  • 15:47 ThreePi? seems to be proceeding slowly probably because of the run/stop problem. I am setting STS.20101202 label inactive to keep those diffs from using processing power.
  • 18:00 integrated some fixes to ppSub assertion failures and rebuilt. Thanks Gene. Turned STS back on and boosted the priority above 3pi. We need to get the STS folks some data they have been waiting for quite a while.

Friday : 2010.12.17

  • 04:30 (EAM) : added 2 more sets of compute2 nodes (6 total)
  • 08:20 (EAM) : pantasks/stdscience crashed around 7am; restarted it

Saturday : 2010.12.18

  • morning (EAM + CZW) : registration had some bugs related to regtool and register_imfile.pl (fixed)
  • 14:20 (EAM) : I stopped and re-started pantasks/stdscience -- it seemed a bit sluggish.

Sunday : 2010.12.19