(Up to PS1 IPP Czar Logs)

Monday : 2012.12.17

Mark is czar

  • 06:50 Bill added hosts to deepstack 1 x (wave3, wave4, stare, compute2)
  • 07:35 MEH: registration pantasks seems to by busy for no reason on ipp052, restarting. not at 100% but still above 50% CPU, haven't noticed this before
  • 08:00 MEH: ippc18 home disk <30GB again, rsync'ing the pantasks logs to the Archive link (to ippc18.1) and compressing them through November
  • 09:18 Bill: outstanding staticsky runs have completed except for 27 with faults to be investigated. None of these are in the firs psps region. Stopping deepstack. Will restart with 1 x compute3 to work on the galactic center
  • 12:05 Bill: added one set of wave3 and one set of wave4 to deep stack

Bill is taking over since Mark is leaving town.

  • 16:44 ipp010's swap space has been ramping up. Set to repair in nebulous and restarted rpc.statd. mysqld is pretty busy there as well.
    • MEH: not really ramping more than past day, suspect depends on the cached mysql memory. interesting restarting rpc.statd twice didnt clear the swap used, maybe wasn't orphan rpc.statd processes?
    • MEH: don't forget to run /etc/init.d/gmond restart to clear the purple memory plot in ganglia
  • 16:55 decided to reboot ipp010 then set it back to up in nebulous
  • 17:45 or so set wave3 nodes in deepstack to off.
  • 19:40 MEH: homdisk ippc18.0 back >150GB, few more Archive directories bzip'ing in registration to free up space on ippc18.1
  • 20:00 MEH: looks like wave4 is still in active use in deepstack, turning off in stdscience.
  • 20:05 MEH: appears wave3 still in use in deepstack from the long runtimes? no, looks like they are still on in deepstack.. turning off in stdscience as well until Bill clears them.
  • 20:20 MEH: MD02,03.GR0 V2->V3 chip-warp processing wrap-up setup to run when system is idle.
  • 20:25 MEH: ipp010 isn't doing well with normal processing, lost mounts and stalling things.. putting back to repair
  • 20:35 ipp010 caught up, taking one instance out of processing and leaving in repair unless someone else wants to watch it
  • 20:49 Bill: set wave3 to off in deepstack since the nodes are being used elsewhere. Killed the swapping psphotStack jobs except for the one on ipp050 which is doing detection efficiency so it is *almost* done so trying to let it finish.

Tuesday : 2012.12.18

Bill is sort of the czar today

  • 05:20 MEH: succeeded(?) in glock stalling to ipp062 for good 20 mins from other nodes with MD02 chip (some warp) stage and CZW SAS processing, not sure what pushed it over. seems okay again.
  • 06:30 MEH: ipp010 should have mysql restarted after its reboot sudo /etc/init.d/mysql start
  • 07:30 MEH: MD02,03.GR0 now also running stacks in stack pantasks when no other stacks to do
  • sometime copied a valid raw imfile on top of a broken instance of one of the MD03 raw images
  • 14:53 set deepstack pantasks to to stop in preparation for creation of new tag
  • 16:20 Restarted all pantasks except deepstack with ipp-20121218. Letting the 26 deepstack jobs already running to finish before restarting. Current average job time is 3 hours and rising
  • 16:45 Queued ps_ud_% label data for cleanup. In cleanup pantasks executed
    set.check.all.components
    
  • ... this will cause cleanup to go back and clean up all components even those already in data_state = cleaned. This has the effect of deleting chip and warp stage cmf files as well as picking up any components that weren't available in previous cleanups.
  • 18:48 Performed some label and ptolemy.rc wizardry and started up a *second* deepstack pantasks The old one has 3 jobs still running. the labels were changed for those.
  • 19:41 all done. All of the IPP is running the new tag.

Wednesday : 2012.12.19

  • 06:20 MEH: chip.revert off while looking into some repeat faults
  • 06:43 Bill: warp.revert.off while looking into pswarp problem 858 success versus 30737 faults since restart
  • 14:32 Bill: found a fix to the pswarp bug. It was probably something hidden by testing with -O0 builds. Modified ppStack to not perform the median background model construction, since the code isn't ready to handle input warps without background models
  • 21:16 Bill: all but 6 of the psphotStack jobs in deepstack pantasks have finished. The last few are proceeding but have been running for up to 10 hours. Don't want to waste this work. Modified ~ipp/deepstack/ptolemy.rc to set the port back to PANTASKS_SERVER_PORT 2030:2039. Started a new deepstack pantasks with this setting. Set c37 c45 c43 c38 c58 and c60 to off. Those are the nodes still running in the previous pantasks which is using PANTASKS_SERVER_PORT 2032:2039. Connecting with pantasks client with the current ptolemy.rc will connect to the new one. use -D PANTASKS_SERVER_PORT 2032 to connect to the old one. The sky_ids still running in first pantasks_server have had their labels changed to LAP.ThreePi.20120706.hide. After they finish they should be set back to the nominal value. They are the only runs with this label. At that point the hosts can be turned on.

Thursday : 2012.12.20

  • 20:07 bill checking in from CA shut down the second deepstack pantasks. The job finished. In the running deep stack 94 skycells finished since last night with an average task time 16,000 seconds.

Friday : 2012.12.21

  • 05:30 MEH: looks like will be using wave4 nodes for restacks and will need more time to run, label MD06.refstack will show up shortly and wave4 group will be taken from normal stack pantasks in the next day or so leaving it ~60 nodes which was sufficient for nightly science.
  • 07:45 MEH: MD03.GR0.20121212 label inactive, seems to be sneaking in even at lower prio over nightly
    • turning on a few more summitcopy hosts to bring the number to 30

Saturday : 2012.12.22

  • 06:30 MEH: ipp012 having troubles, cannot ssh into, jobs stalling. setting to repair, removing from processing. many glocks stuck even though /data/ipp012.0 available, rebooting ipp012-crash-20121222
  • 10:00 MEH: chip.off, warps underpowered and holding back SS processing. stdscience could use its regular restart later
  • 12:40 MEH: restarting stdscience, nightly still not finished
  • 14:30 nightly science mostly finished, chip.on

Sunday : 2012.12.23

  • 07:30 MEH: chip.off again, 220 nightly 3PI warps remaining -- nightly warps finished, chip.on @10:00