PS1 IPP Czar Logs for the week 2013.02.11 - 2013.02.17

(Up to PS1 IPP Czar Logs)

Monday : 2013.02.11

  • 11:20 Bill queued two M31 chip runs as a test.
  • 11:30 Bill: fixed burntool table for o5257g0058o.ota22.fits
  • 14:15 Bill: restarted stack after fixing typo in the input file
  • 14:20 CZW: stopping processing to make a database change to allow testing of the new background code.
  • 15:56 CZW: processing up again.

Tuesday : 2013-02-12

  • 14:45 CZW: Added czw.bkg_test.ipp label to stack pantasks to check differences in stacking between the current working tag and the current trunk.

Wednesday : YYYY.MM.DD

Thursday : 2012.02.14

  • 9:30 Bill: The cluster seems to have become somwhat wedged in the past half our. Setting stdscience to stop for a bit.
  • 09:38 programs that are stuck are all trying to write files to /data/ipp063. Set neb-host ipp063 repair and restarted rpc.statd twice
  • 09:53 several programs are also stuck writing to ipp007 set it to repair as well
  • 10:12 rebooting ipp063. Restarting stdscience
  • 10:18 neb-host ipp063 up ipp007 still in repair
  • 10:27 neb-host ipp007 up
  • 10:54 and now ipp041 is being a problem child. setting to repair
  • 11:12 ipp007 hanging replicate requests. Setting to repair until nightly science finishes. This unclogged the queue to let the nightly chips finish. Setting chip off.
  • 11:36 rfixed a slew of broken instances
  • 13:04 Heather turned chip back on

Friday : 2013.02.15

mark is czar

  • 08:00: MEH: nightly science looks finished, stdscience needs its regular restart and will do before SSdiffs get started
  • 08:20 restarting gmond on ipp018,020,057,063 so ganglia memory report is correct
  • 10:00 czar page hanging because replication problem on ? -- ippdb02 is out of disk space, db00 <70G -- all processing stopped except pstamp/update, deactivate LAP label to finish SSdiffims from last night in stdscience
  • 12:30 Serge cleaned up some space to last a couple months (see ipp-dev email), but disk upgrades very much needed. Processing restarted.
  • 14:00 MEH: restart mysql on ipp010,012,026,057,058. found down when repairing LAP chip
  • 14:15 MEH: to catch up on pileup of LAP warps, while fixing warps
  • 14:30 MEH: working on clearing out LAP runs languishing over the past week -- the bad case of many rej and then no sources (likely related to problem Chris looking into), so just set to qual 42, fault 5
    stacktool -dbname gpc1 -updatesumskyfile  -fault 5 -set_quality 42  -stack_id 2003865
    stacktool -dbname gpc1 -updatesumskyfile  -fault 5 -set_quality 42  -stack_id 2007217
    stacktool -dbname gpc1 -updatesumskyfile  -fault 5 -set_quality 42  -stack_id 2007899
    stacktool -dbname gpc1 -updatesumskyfile  -fault 5 -set_quality 42  -stack_id 2010296
    stacktool -dbname gpc1 -updatesumskyfile  -fault 5 -set_quality 42  -stack_id 2015989
    stacktool -dbname gpc1 -updatesumskyfile  -fault 5 -set_quality 42  -stack_id 2017261

Saturday : 2013.02.16

  • 07:55 MEH: looks like stdscience crashed @0404.. restarting
    [2013-02-16 04:04:05] pantasks_server[29418]: segfault at af88f18 ip 0000000000407d98 sp 0000000041f0ef50 error 4 in pantasks_server[400000+13000]
  • 08:05 MEH: looks like ipp014 has had a kernel panic @0330 ipp014_crash-20130216. power cycled and back up. for a while again push through the pile of warps that will show up and fix few chips
  • 09:40 MEH: restarting update to reset the 1900 faults
  • 11:30 MEH: looks like LAP update, pstamp/update, cleanup got mixed up again on a few LAP runs for chip/warp early in the week. resetting to update and moving into stack now

Sunday : 2013.02.17

  • 08:00 MEH: stdscience needs its regular restart to keep the LAP rate up