PS1 IPP Czar Logs for the week 2013.01.07 - 2013.01.13

(Up to PS1 IPP Czar Logs)

Monday : 2013.01.07

  • 04:40 Gene tried power cycling ipp027 and no response, further details pending
  • 10:00 Bill restarted pstamp and update pantasks. pcontrol for deepstack is using 30% of a cpu on ippc17. Since the task durations are so large this isn't too wasteful.
  • 13:17 Bill restarted cleanup pantasks. Added label goto_cleaned.rerun. Setting all chip and warp runs from LAP.2011% to state = goto_cleaned, label = goto_cleaned.rerun. This will cause cleanup to be rerun using the new code which will cause chip warp cmfs to be deleted.

Tuesday : 2013.01.08

  • 03:20 MEH: looks like registration pantasks has crashed, starting. also looks like distribution is stuck, restarting.
  • 07:45 MEH: MD03 o6300g0570o is strange, looks like streaks. observation started while still slewing? nightly stack in progress, will it be okay? so far looks okay.
  • 15:00 Serge: Restarted a bunch of mysql servers on storage nodes. Made a backup of the mysql configuration files in the operational repository: file:///data/ippc18.0/home/operational_repository/current/databases
  • 15:33 Bill: fixed a bug in ipp_cleanup.pl that was causing cleanup to be run repeatedly on the same runs

Wednesday : 2013.01.09

mark is czar (calendar loaded again through end of month)

  • 08:30 Serge: Stopped nebulous replication on ippdb02. Performing backup to ippc63 (through script being tested).

Thursday : 2013.01.10

mark is czar

  • 10:25 Serge: stsci04.1 is full
    neb-host --volume stsci04.1 repair
    
    • Gene cleared up some duplicate non-neb data from the ipp027-ipp030 rsync, setting neb-host up again
  • 10:45 Serge: The neb-df figure in czartool has not been updated for about two days. I restarted nebdiskd on ippdb00.
  • 15:00 MEH: with Bill, am setting LAP staticsky outside of the galactic plane to run in another pantasks (lap_staticsky_notgp) on ippc13 when nightly science isn't running -- after SSdiffs run by noon.
    • please do not load for stdscience or stack pantasks to run jobs without checking if lap_staticsky_notgp is using nodes
       -- after nightly science finished, can load 2x wave3 + 1x compute2 + 2x stare + 1x wave4 
       -- 2x wave4 can always run (-2x wave4 from stack and -1x wave4 from stdscience, processing should be fine), after nightly science will be total 3x wave4
       -- should amount to 84 nodes during day, 24 during night
      
  • 15:40 Serge: nebdiskd now relies on ssh+df instead of df-ing nfs-mounted partitions: when a host was down, nebdiskd was stuck because of the still mounted but not reachable partition (e.g. the "flat signal" observed in the czartool available space figure since ipp027 died two days ago). Restarted nebdiskd. Changes committed to tag (rev: 34906) and then merged into trunk (rev: 34907).
  • 15:55 MEH: pausing cleanup for a while so ippdb02 can catch up
  • 17:05 MEH: turning off wave3,compute2,stare in lap_staticsky_notgp so the longer 1-1.5 hrs jobs clear before nightly science
    • taking wave4 from stack pantasks (should still be able to keep up with nightly science) and keep 1x wave4 running lap_staticsky_notgp
  • 20:00 MEH: looks like poor weather, working on queuing up WS diffims for MOPS fields 160.3PI.00.E-4.K.z, 160.3PI.00.E-4.K.y
    • queued with difftool manually, w/o something specific like -comment, would pick up visit 2 warps as well. also difftool took a good 3-5 minutes to work through which to queue. used command
      difftool -dbname gpc1 -definewarpstack -good_frac 0.2 -data_group ThreePi.20130110 -stack_label LAP.ThreePi.20120706 -set_dist_group NULL -set_label ThreePi.nightlyscience.20130110_mops_ws -set_data_group ThreePi.20130110.mops_ws -set_workdir neb://@HOST@.0/gpc1/ThreePi.nt.20130110_mops_ws -simple -comment "160.3PI.00.E-4.K.z%" -filter z.00000 -pretend
      
      difftool -dbname gpc1 -definewarpstack -good_frac 0.2 -data_group ThreePi.20130110 -stack_label LAP.ThreePi.20120706 -set_dist_group NULL -set_label ThreePi.nightlyscience.20130110_mops_ws -set_data_group ThreePi.20130110.mops_ws -set_workdir neb://@HOST@.0/gpc1/ThreePi.nt.20130110_mops_ws -simple -comment "160.3PI.00.E-4.K.y%" -filter y.00000 -pretend
      
    • yes, needs -reduction SWEETSPOT so DIFF_PSPHOT set with DV3 and TRAIL.. so -rerun with label ThreePi?.nightlyscience.20130110_mops_ws_ss
    • publish to TEST-2? waiting response to email back to Larry -- yes TEST-2
      pubtool -dbname gpc1 -definerun -label ThreePi.nightlyscience.20130110_mops_ws_ss -set_label ThreePi.nightlyscience.20130110_mops_ws_ss -client_id 12 -simple -pretend
      

Friday : 2013.01.11

  • 08:30 MEH: there were more obs of 160.3PI.00.E+3.K.z, 160.3PI.00.E+3.K.y last night so MOPS will likely want those processed as well, but maybe wait until the look at the first set -- asked to run, running now @10:00
    • also, since tests still, maybe don't want -good_frac 0.2 and remove for this round -- attempts ~10 additional skycells, ~half have quality issues.
      difftool -dbname gpc1 -definewarpstack  -data_group ThreePi.20130111 -stack_label LAP.ThreePi.20120706 -set_dist_group NULL -set_label ThreePi.nightlyscience.20130111_mops_ws_ss -set_data_group ThreePi.20130111.mops_ws_ss -set_workdir neb://@HOST@.0/gpc1/ThreePi.nt.20130111_mops_ws_ss -simple -comment "160.3PI.00.E+3.K.z%" -set_reduction SWEETSPOT -filter z.00000 -rerun -pretend
      
      difftool -dbname gpc1 -definewarpstack  -data_group ThreePi.20130111 -stack_label LAP.ThreePi.20120706 -set_dist_group NULL -set_label ThreePi.nightlyscience.20130111_mops_ws_ss -set_data_group ThreePi.20130111.mops_ws_ss -set_workdir neb://@HOST@.0/gpc1/ThreePi.nt.20130111_mops_ws_ss -simple -comment "160.3PI.00.E+3.K.y%" -set_reduction SWEETSPOT -filter y.00000 -rerun -pretend
      
      
      pubtool -dbname gpc1 -definerun -label ThreePi.nightlyscience.20130111_mops_ws_ss -set_label ThreePi.nightlyscience.20130111_mops_ws_ss -client_id 12 -simple -pretend
      
    • finished and published @11:15
  • 08:45 MEH: ippdb02 caught up, cleanup turned on again
  • 08:50 MEH: at noon, after SSdiffs done, can add extra nodes to lap_staticsky_notgp -- tweaked to run early, finished @09:30
    • setup 2x stare first to run and remove the 8x stare from stdscience to see how much the rate is changed -- looks like can reach 40/hr with 2x wave4 + 2x stare before noon
    • after MOPS WS diffs done, will add 1x compute2 + 2x wave3 + 1x more wave4 and turn stdscience off (skycal wont run but those are fast and will catch up in evening)
  • 12:00 MEH: add 1x compute2 + 2x wave3 + 1x more wave4 to lap_staticsky_notgp and turn off in stdscience (when turning back on remember to call ignore_xxxx), this leaves ~110 nodes for any non-nightly processing like skycal or MOPS diffims, could probably be enough for MD SSdiffs and could be done earlier.
    hosts off compute2
    hosts off compute2
    hosts off compute2
    hosts off compute2
    
    hosts off wave3
    hosts off wave3
    hosts off wave3
    hosts off wave3
    
    hosts off wave4
    hosts off wave4
    hosts off wave4
    hosts off wave4
    hosts off wave4
    
  • 17:00 MEH: reminder to remove nodes from lap_staticsky_notgp and re-enable in stdscience -- can take ~1-1.5 hrs for some to clear
    -- lap_staticsky_notgp
    hosts off compute2
    hosts off wave3
    hosts off wave3
    hosts off wave4
    
    -- stdscience
    hosts on compute2
    hosts on compute2
    hosts on compute2
    hosts on compute2
    
    hosts on wave3
    hosts on wave3
    hosts on wave3
    hosts on wave3
    
    hosts on wave4
    hosts on wave4
    hosts on wave4
    hosts on wave4
    hosts on wave4
    
    hosts off ignore_compute2
    hosts off ignore_compute2
    hosts off ignore_compute2
    hosts off ignore_compute2
    hosts off ignore_wave3
    hosts off ignore_wave3
    hosts off ignore_wave3
    hosts off ignore_wave3
    hosts off ignore_wave4
    hosts off ignore_wave4
    hosts off ignore_wave4
    hosts off ignore_wave4
    hosts off ignore_wave4
    

Saturday : 2013.01.12

  • 17:30 MEH: Larry asked if excessively large postage stamp job under WEB.UP could be paused to get the MOPS stamp requests through. removed WEB.UP label for now.
  • 21:55 more MOPS again and still many WEB.UP to do, so set WEB.UP priority from 500->494 to run below MOPS/MOPS.2

Sunday : YYYY.MM.DD