PS1 IPP Czar Logs for the week 2012.01.02 - 2012.01.08

(Up to PS1 IPP Czar Logs)

Monday : 2011.01.02

Bill is czar today

  • 09:00 Got about 350 exposures last night. Nightly science is finished 3Pi diffs and is working on the MD05 stacks. LAP is proceeding.
  • 09:15 Restarted stdscience, pstamp, and update pantasks. The pcontrols were spinnning.
  • 21:45 I've got a bad feeling about this.... ipp060 panic'd. Faults all over czartool. A really suprisingly high number
  • 22:00 Stopped replication pantasks. Doing file stuff during periods like this seems like a bad plan.

Tuesday : 2012.01.03

  • 14:30 Mark looking into and fixing some out-of-order warps cleaned in LAP causing it to stall. Many were overlapping so the out-of-order cleaned warps was holding up all final stacks. fixed with dirty one-liner to batch file for each LAP id until updated all necessary warps (Chris probably has better command)
    laptool -listrun -seq_id 8 -state run -dbname gpc1 -simple
    laptool -exposures -simple -dbname gpc1 -lap_id 2073
    laptool -exposures -simple -dbname gpc1 -lap_id 2073 | cut -d " " -f26,27 | grep cleaned | awk '{print "warptool -dbname gpc1 -setskyfiletoupdate -set_label LAP.ThreePi.20110809 -warp_id "$1}' > fixwarpupdate_lap2073.bat
    ...
    
  • 15:00 looks like some chip updates having trouble too and looking into. looks like there was a conflict in postage stamp update/clean with LAP chips, for example chip_id=362588
    2234 RINGS.V3 skycell.2341 z.00000 run 2012-01-02T00:33:06.000000 0 LAP.ThreePi.20110809 LAP.ThreePi 250043 362588 362592 F T T to_magic update 0 60 339462 full 0 1 321786 full 323487 update 9 75 -1 198849 cleaned 7 63
    
    select chip_id,state,label,data_state,fault from chipRun join chipProcessedImfile using (chip_id) where chip_id=362588;
    |  362588 | update | ps_ud_WEB.UP | cleaned    |     0 | 
    |  362588 | update | ps_ud_WEB.UP | full       |     0 | 
    
  • 16:00 restarting update but leaving stopped until fixed LAP chips needing update. reset the chip_ids to goto_cleaned and the updated with -setimfiletoupdate. the chips updated okay as well as the remaining warps. 800+ final stacks running. update pantasks set back to run. overloading stack pantasks with wave4 for a couple hours, will remove before nightly science starts.

Wednesday : 2012.01.04

  • 07:30 Mark: appears LAP is getting stuck again like before with at least warps being cleaned that it is expecting to be updated.
  • 09:20 fixing during Ken's talk. adding extra wave4 to stack pantasks again (seems to increase the number of fault type 4 however).
  • 09:50 again LAP running against pstamp update/cleans
    |  362027 | update | ps_ud_WEB.UP | cleaned    |     0 | 
    |  362027 | update | ps_ud_WEB.UP | full       |     0 | 
    
  • 10:30 trouble chips
    | chip_id | state | label                | data_state | fault | quality | class_id |
    |  322711 | update | LAP.ThreePi.20110809 | update     |     2 |       0 | XY26     | 
    --> extn in chip, mv to .bad and cp'd good instance
    
    |  322715 | update | LAP.ThreePi.20110809 | update     |     0 |      42 | XY26     | 
    --> also extn problem in chip, mv'd .bad and cp'd good instance. however, couldn't get quality back to 0, just wanted to change fault 
    chiptool -updateprocessedimfile -set_quality 0 -chip_id 322715 -class_id XY26 -dbname gpc1 -fault 0
    --> manually set to full
    
  • 11:00 seem to have ended up with extra warps to update that aren't needed by LAP... waiting until LAP stalls again to cleanup

Thursday : 2012.01.05

  • 10:00 CZW: Launched SweetSpot? WSdiffs manually. This seems to be fairly easy to detect automatically, so this should probably be automated within the next week or two. The command used was:
     difftool -definewarpstack -good_frac 0.2 -warp_label MSS.nightlyscience -stack_label MSS.nightlyscience -set_label MSS.nightlyscience -set_workdir neb://@HOST@.0/gpc1/MSS.nightlyscience/2012/01/05/ -available -set_reduction WARPSTACK -set_dist_group SweetSpot -rerun -dbname gpc1
    
  • 10:01 CZW: Kicked LAP. Three runs were waiting on four chipRuns to be updated, and although the run was marked for update, none of the individual chipImfiles were. I re-updated these chipRuns using my script (~watersc1/bin/run_update_for_chip_id.pl) and these now seem to be running (which should in turn get LAP processing to complete for these lapRuns).
  • 10:16 CZW: We had 116 warpRuns set to update that weren't needed. I've sent these all to cleanup.
  • 16:52 CZW: Restarted distribution pantasks server.

Friday : 2012.01.06

Saturday : 2012.01.07

Bill is minding things while watching NFL playoffs

  • 10:45 Bill restarted stdscience pantasks. Warps had fallen behind.
  • 12:20 warp still behind. Stopped chip processing for a bit.
  • 12:40 It looks like lots of skycells want to put files on ipp058 which is the source of the backlog. ipp058 was off in stdscience. Setting them to on.
  • 12:57 That worked. chip back on.
  • 17:02 46 warpRuns were blocked because the underlying chips have been cleaned. Some database editing fixed that.
  • 19:06 Tonight's observations have started. Just for grins I've set the LAP priority to 400 to make it equal. I'll turn it down once this batch finishes.
  • 19:48 LAP priority reduced to nominal (200)
  • 20:40 processing is getting backed up a bit because hosts are waiting for ipp064 which is off. Turned it on in stdscience. Discovered that the /local/ipp directory has not been initialized since the rebuild of the raid. Did that.
  • 22:00 noticed that registration wasn't picking up the latest exposures. Decided to restart summit copy and registration

Sunday : 2012.01.08

Bill is keeping his eye on things again today.

  • 09:00 Fixed several stalled chips due to XY26 and psphot crash problems
  • 09:30 The ipp user is running a script on ipp062 that is rebuilding some M31 chip_bg files from ota XY26. It serially does: fix bad instance, rerun chip, rerun chip_bg, revert distComponent. Takes a couple of minutes per chip. There are about 1100 to do.
  • 16:25 chip completion rate has dropped significantly. Restarted stdscience pantasks
  • 18:00 restarted distribution pantasks (pcontrol load was spiked)