(Up to PS1 IPP Czar Logs)

Monday : 2012.11.19

  • reminder for czar: ipp020,ipp014 taken out of processing saturday from stalling jobs due to mounts. need more attention, likely a reboot during working hours (MEH)
  • 09:50: Serge: Added SC.TEST.MOPS.PS1_DV3 to stdscience and publishing for mops tests
  • 09:50 Serge: Queued some OSS WS diffs observed a fwe nights ago
    difftool -dbname gpc1 -definewarpstack -good_frac 0.2 -warp_label OSS.nightlyscience -data_group OSS.20121116
     -stack_label ecliptic.rp -set_label SC.TEST.MOPS.PS1_DV3 -set_data_group SC.TEST.MOPS.PS1_DV3.OSS 
    -set_workdir neb://@HOST@.0/sch/SC.TEST.MOPS.PS1_DV3.OSS -available -set_reduction SWEETSPOT_WS 
    -set_dist_group NODIST -rerun -simple

Tuesday : 2012.11.20

Bill is czar tomorrow

  • 14:49 Bill restarted cleanup pantasks
  • 15:50 Bill turned off chip cleanup in the cleanup pantasks. He is doing some testing of the new code in another pantasks. Will turn it back on at the end of the day.
  • 16:07 Bill stopped processing because ipp014 and ipp020 are stuck
  • 16:23 Bill rebooted ipp020 and ipp014 Restarted stdscience
  • 20:47 Bill turned chip cleanup on, stopped testing

Wednesday : 2012.11.21

Bill is czar today

  • 05:55 Set a number of our problem nodes to nebulous up all together. 16, 18, 8, 9, 10, 11, 12
  • 08:00 Well that didn't work! Set all pantasks to stop in preparation for restart and host rebalancing for staticsky.
  • 09:45 restarted stdscience. poll limit set to 80 for now. I'm not sure if the cluster has fully recovered from all of the wave 1 hangs
  • 10:02 There are a number of engineering exposures that haven't registered yet. They all seem to be pending_burntool, but aren't getting run
  • 11:16 deepstack is running LAP.ThreePi?.20120706 staticsky on compute 3 nodes
  • 11:48 Serge: Stopped replication from ippdb00 to ippc63. Started copy of ippc63/mysql to ippdb04... using ssh but we don't have any other way :( I expect it to be ompleted on Friday.
  • 12:46 ipp014 paniced: BUG: unable to handle kernel paging request at ffffffff800073d4
  • 14:43 Gene noticed ipp014 was wedged and sent an email to Bill who was at lunch
  • 15:10 set pantasks to stop, power cycled ipp014
  • 15:44 shut down stdscience. pkilled all of the ipp user's glockfile processes. restarted stdscience. set other pantasks to run
  • 16:26 set !LAP.ThreePi?.20115% staticsky distribution runs to be cleaned

Thursday : 2012.11.22

  • 09:53 Serge: rsync to ippdb04 complete. Restarted nebulous replication on ippc63.

Friday : 2012.11.23

Bill is czar today

  • 08:00 one of the nodes has been stuck for 60,000 seconds stopped all pantasks
  • 08:15 reboooted ipp014 which tried to panic last night with the same fault as on Wed.
  • 08:25 Serge: Nebulous database is now being replicated to ippdb04 (~256,000 seconds behind).
  • 08:35 set pantasks to run with ipp014 set to repair
  • 10:10 set chip.off to allocate more processing power to the 2000 warpRuns pending
  • 10:20 Serge: ippdb04 is now visible in czartool. It is slowly catching up (~248,000 seconds behind)
  • 11:13 stopped polling MPIA for new postage stamp requests. They have > 20,000 outstanding
  • 12:00 changed fault of 1826177 and 1862204 from 2 to 5 because it is not a file problem Should be debugged
  • 12:14 set pantasks to stop in preparation to add some debugging code to psLib and also daily restart.
  • 12:50 killed lingering ipp user glockfile processes
  • 13:00 updated build after adding a couple of new assertion tests in psVectorStats to attempt to figure out why it is asking for a histogram with numBins <= 0
  • 15:36 set ipp016 to repair and all pantasks to stop. glockfiles are hanging for files on that node. sigh
  • 15:50 looks like a false alarm. staticsky has stopped making progress because I reverted a number of runs that are faulting again.
  • 17:30 Found the problem causing many of the staticsky runs to crash. Stopped everything and rebuilt psphot.
  • 17:40 turning chip back on.
  • 20:39 Now that the staticsky runs aren't crashing as often, several of the compute 3 nodes are in swap heck. 2x psphotStack in the plane is too much. Setting to stop. Will restart with one set enabled.

Saturday : 2012.11.24

  • 03:11 Bill killed psphotStack jobs that were killing each other with memory use and restarted with one set of compute3 notes
  • 07:37 Serge: ippdb04 has caught during the night
  • 22:15 MEH: looks like ipp011 has given up and brought everything to a halt... not responding and cannot ssh into.. ipp011-crash-20121124
    • host off in all pantasks, neb-host down, reboot.. -- neb-host up, processing back online

Sunday : 2012.11.25

Bill is up and can't help but look in ...