PS1 IPP Czar Logs for the week 2012.02.13 - 2012.02.19

Monday : 2012.02.13

  • 10:30 Bill restarted distribution. The task was hung up. pantasks_server though the maximum five jobs was running but pcontrol had no jobs it it's queue.

Tuesday : 2012.02.14

  • 20:02 Bill restarted stdscience and distribution. Earlier today restarted pstamp and update.

Wednesday : 2012.02.15

  • 22:35 Bill set stdscience and distribution to stop in preparation for restart.
  • 22:45 restarted stdscience and distribution

Thursday : 2012.02.16

Bill is czar today

  • 09:53 stdscience is a bit bogged down due to backlog of jobs wanting ipp015. Setting LAP and STS labels to inactive to stop their chipRun processing until the situation improves
  • 10:05 ThreePi? chips are done. Turning labels back to active and setting
  • 10:56 ThreePi? warps are done. All diffSkyfiles are in the queue. Turning chip.on
  • 16:34 ipp056 and ipp058 load spiked and the system sbecame unresponsive. Power cycled them. There was something on the console that looked like a machine code stack trace but it wasn't recognizable.
  • 16:45 stdscience set to stop in peperation for the daily restart
  • 18:00 restarted stdscience

Friday : 2012.02.17

Bill is czar today

  • 10:40 ipp041 went down. It is preventing all jobs in the queue from completing. Setting all pantasks to stop for the time being. Ran "neb-host ipp041 down"
  • 11:04 all pantasks servers shut down. Lingering jobs pkilled. Restarting stdscience with reverts turned off.
  • 11:22 all outstanding jobs failed. We seem to be having nebulous problems. Cannot connect to the nebulous database server. Serge is investigating.
  • 12:32 we restarted the web servers but several of them lied to us and did not restart. That is fixed now. processing running with reverts off. Many jobs need data on ipp041 so reverting continuously seems pointless for now.
  • 12:35 ran pztool -clearcommonfaults to revert summit copy failures due to the nebulous outage. All science exposures are now downloaded and proceeding with burntool
  • 4:10pm heather found a bad detrend (non-existant) with ipp041 down. chris repaired it.
  • 16:40 Mark: tweak_ssdiff to run at 1700 for the MD03,06 stacks
  • 16:49 Bill: STS.PP5 is done. Removed labels
  • 17:42 CZW: Noticed that despite numerous attempts at queuing diffs, two exposures in LAP were not successfully diffing. I dropped the exposures (153451 & 153452) and will look at the reason next week (I suspect it was caused by the assumed exposure pair not having the correct overlap).

Saturday : 2012.02.18

  • 09:15 Mark: kicking LAP a bit. 16 or so warps holding up stacking. odd alloc and io faults cleared after couple reverts.
  • 09:40 finding LAP warps not getting set to full, setting with
    warptool -dbname gpc1 -tofullskyfile -warp_id 370999 -skycell_id skycell.0778.009
    warptool -dbname gpc1 -tofullskyfile -warp_id 371004 -skycell_id skycell.0777.009
    warptool -dbname gpc1 -tofullskyfile -warp_id 371029 -skycell_id skycell.0776.049
  • 10:00 16 LAP warps in update don't seem to belong to any run, setting goto_cleaned.
  • 14:00 LAP moving but 1 stack repeatedly faulting stalling lap_id 3090, setting qual 42
    stacktool -dbname gpc1 -updatesumskyfile -stack_id 749702 -set_quality 42
  • 14:30 LAP stalled by six ~60ks jobs of pswarp on ipp015... cleared..
  • 18:40 LAP needed chips stuck in goto_cleaned state while data_state full (and NOT the ones set goto_cleaned earlier). looks like cleanup is off? leaving off and reseting needed warps back to full for LAP to finish.

Sunday : 2012.02.19

  • 16:12 Bill checked on postage stamp server. Found that several jobs were not progressing. Suspected that they were waiting for data to be cleaned up (only to be immediately updated again). Yep cleanup pantasks is stopped. Set the warpRuns back to update and the jobs finished. They are probably runs containing the recent supernova. Restarted cleanup pantasks.