PS1 IPP Czar Logs for the week 2012.03.05 - 2012.03.11

(Up to PS1 IPP Czar Logs)

Monday : 2012.03.05

  • 20:07 Bill rebuilt psModules and psphot earlier to integrate some changes to fix psphotStack faults. Restarted stdscience.

Tuesday : 2012.03.06

Bill is czar today

  • 10:15 psphotStack process on ipp060 has grown over 90GB in size. Since the node was doing other processing it got swapped out. Killed everything else off and removed the node from stack and stdscience pantasks and now it proceeding with the extended source analysis of the globular cluster M30. From the log
   total of 146208 sources for 6 images
  • LAP seems to have run out of work to do. Perhaps it is waiting for the last few diffs to finish. They were bogged down because their files are on ipp060.
  • 10:35 restarted stdscience. It was confused about the state of some of the jobs

Wednesday : 2012.03.07

Bill is czar today

  • 11:34 stopped processing in preparation for picking up a couple of bug fixes in psModules
  • 11:40 noticed that ipp014 went unresponsive a few hours ago. All sorts of jobs were stuck waiting for it. Gavin power cycled it.
  • 12:15 stopped all pantasks, Issued pkill commmands on all nodes for ppImage psastro pswarp ppSub and ppStack
  • 12:30 restarted processing
  • 14:20 summitcopy is stopped while Craig fixes the fileset list for a broken fileset
  • 14:48 summit copy back on last dark being copied

Thursday : 2012.03.08

  • 07:00 Bill fixed a few XY26 instances. Killed a pswarp process that had been running for > 50000 seconds. After reverting the warp completed.
  • 07:16 All pending LAP data through warp 803 stacks to process. Staticsky is proceeding but slowly. 2500s average time.
  • 07:29 Set one set of wave 4 hosts to off in stdscience. Added one more set of wave 4 hosts to deepstack.

Roy is czar, but Bill has already fixed everything...thanks!

  • 11:20 Mark is grabbing ippc54-ippc63 compute3 from stack pantasks to do some deepstack tests over next +12hrs. swapping ippc62 to ippc53 as bill has psphotStack running there.
  • 15:25 CZW Cycled power on ipp008 as it was unresponsive.
  • 16:10 Bill stopped processing to pick up a couple of bug fixes for psphotStack.
  • 16:19 Bill killed off some camera_exp processes that were hung due to ipp008 going down.
  • 16:20 Bill Restarted stdscience

Friday : 2012.03.09

Roy is czar.

  • 08:30 Roy: No data last night, czartool looks clean.
  • 11:05 Gene rebooted ipp054
  • 12:40 psphotstack running amok on ipp061 barely managed to login and kill (see neb://any/gpc1/LAP.ThreePi.20110809/2012/02/01/RINGS.V3/skycell.1313.086/RINGS.V3.skycell.1313.086.stk.144806.log)
  • 17:10 ipp058 similar problem to ipp061, psphotstack swap run amok. had to reboot it.
  • 17:22 CZW I've switched the host allocation for deepstack to use the compute 3 nodes instead of wave4 storage. Those are less of a nuisance when they crash.
  • 17:42 CZW I've stopped stack, stdscience, and detrend pantasks, and restarted the deepstack pantasks (but did not turn it to run) pending ipp058 being kicked.
  • 17:49 CZW ipp058 magically came back to the world of the living, I killed the offending psphotstack instance, and I've restarted processing. I have no clue if Mark's attempt at cycling the power on ipp058 will kick in, but at this point, it seems to be functioning properly again.

Saturday : 2012.03.10

  • 07:55 Bill repaired 7 bad XY26 raw files
  • 08:55 A stuck apply_burntool task was blocking registration. It was running on ipp008. Bill killed it but registration was stuck. Restarted pantasks. No help. Changed the affected chip's state to 'pending_burntool' and burntool_state =0 and off it went.
  • afternoon Bill queued new staticsky runs. Unfortunately they happen to be in the galactic plane and memory usage went high order.
  • 17:30 ippc59 died from memory exhaustion. It just came back online so maybe somebody rebooted it.

Sunday : 2012.03.11

  • 07:30 Bills: Registration got stuck again. Looks like it was sometime around 8 pm last night. Fixed by changing exp_id 462948 XY03 from check_burntool -1 to pending_burntool 0
  • 08:00 deepstack is done with the pending work. Stopping the pantasks for now while last night's data catches up.
  • 10:51 pantasks is losing track of some jobs. I have found a couple of jobs that have finished but pantasks hasn't figured it out after 30 minutes. pcontrol load is quite high. Stopping pantasks in preparation for restart
  • 14:41 added another set of wave4 hosts. There were a lot of jobs wanting to run there.
  • 15:20 Mark: tweakin' stdscience to get the MD SSdiffs out
  • 18:00 Bill queued more staticsky runs using post new tag stacks outside of the galactic plan. Restarted deepstack.