PS1 IPP Czar Logs for the week 2013.11.18 - 2013.11.24

(Up to PS1 IPP Czar Logs)

Monday : 2013.11.18

  • 13:00 bill set stdscience to off in preparation for daily restart.
  • 13:10 looks like we only have two storage partitions with any space
  • 13:15 set distribution and pstamp to stop because I have a big query running on the ippc17 mysql database that might block operations
  • 13:44 stdscience restarted with STS.rp.2013 label out. So there is nothing to do.
    • a reminder that the LAP label is also out because of low disk space, so no LAP processing has happened for over a week.
  • 14:21 Bill stopped all pantasks to pick up a bug fix in psModules. assertion failure !isfinite(z) caused by zero axes.major in pmMoelSetShape
    • 14:25 the fix is in.
  • 14:55 MEH: adding 1x compute3 into cleanup to help get through and make some space

Tuesday : 2013.11.19

Wednesday : 2013.11.20

mark is czar

  • 07:15 MEH: nightly finished
    • diffim fault 5 from new tag changes still -- diff_id= 495078 skycell.033 diff_id= 495197 skycell.072
    • summit fault on bias image -- interrupt on TDI plan, set exposure to drop
      pztool -updatepzexp -exp_name c6616g0001b -inst gpc1 -telescope ps1 -set_state drop -summit_id 671838 -dbname gpc1
      
      -- update ippdb01
      update summitExp set exp_type ='broken', imfiles=0, fault=0 where exp_name='c6616g0001b';
      
  • 10:00 MEH: deep stack processing is moving to a local deepstack pantasks with the normal 1x compute3 allocation for tag modifications... deepstack pantasks should remain stopped and compute3 in limited use unless rebalanced properly
  • 11:00 MEH: even though MD05 was observed last night, not going to be a regular field and diffims not going to be made (nor a new refstack made until a deepstack made)
    • it is there, might as will enable the last season refstack for diffims in near future
  • 19:05 MEH: ipp066 may have crashed -- yes, power cycled and back up (ipp066-20131120-crash.log)

Thursday : 2013.11.21

mark is czar

  • 15:30 MEH: restarting stdsci to enable all MD refstacks for any random field that pops up and the final return. need to manually remove following hosts from stsci (and all pantasks) while mysql is in heavy use there
    ipp011
    ipp049
    ipp050
    ipp052
    ipp061
    ipp063
    ipp065
    -- more may need to be added if have nightly data tonight and gets stuck
    
  • 16:30 LAP in update is stalling PSS requests again. >2k chips+warps to do, Chris pointed LAP queue ~ipp/lap/current.queue to point to off.queue (empty file). need to balance disk space available, when finished chips (but not warps) will be cleaned up for LAP.
    • Chris estimates ~6TB or so in warps will be used in the end
    • if no nightly data then may turn on to just be doing something, may be TDI for part of night as well, weather looking better?
  • 17:24 Bill: set sts chip runs in new state to label STS.rp.2013.hold. Added sts label to stdscience pantasks. Once the 58 exposures in camera and warp are distributed we'll recover space by cleaning up warps and chips.
  • 18:21 Bill: sts warps are all done. chips have been distributed set chipRuns to be cleaned. 59 warps left to distribute.
  • 18:23 added LAP.ThreePi?.20130717 to stdscience
  • 18:50 MEH: Heather restarted mysql on many of the problem hosts, so adding back in. of course other nodes now looking troublesome and taking out
    ipp066
    ipp059
    ipp046
    
  • 20:30 someone is trying to kill ipp059, LAP needs chips there so chip processing is stalled
  • 22:50 MEH: guess pstamp gave up around 21:50, restarting

Friday : 2013.11.22

  • 13:30 MEH: manually balancing compute2 use while using in a local pantask for staticsky
  • 14:53 Bill: stopping stdscience for daily restart
  • 15:00 Bill: restarted stdscience chip.off for a while.
  • 18:15 MEH: getting dark, turning chip.on
  • 18:30 MEH: summary of compute2 manually balanced use -- ippc26-c29 should always be turned off when ever a pantasks is restarted, rest of compute2 can be used as normal
    • ippc26-c29 are part of a compute2_himem host subgroup in pantasks_hosts.input (may add to automatically turn off depending on how long processing goes)
  • 21:25 MEH: poor weather, chip.off of a bit

Saturday : 2013.11.23

  • 15:20 MEH: doing regular restart of stdsci after large number of LAP updates finished -- compute2_himem, 1x c3, ipp066, 050, 059, 046 off
  • 22:20 MEH: Bill notes pstamp could use a restart, done along with distribution

Sunday : 2013.11.24

  • 23:45 MEH: registration stalled? ~30 exp behind -- registration needed a restart, catching up now..