PS1 IPP Czar Logs for the week 2014-03-17 - 2014-03-23

(Up to PS1 IPP Czar Logs)

Monday : 2014-03-17

  • 09:30 Bill storage.hosts.on, storage.hosts ignore in staticsky pantasks
  • 11:15 at suggestion of serge and gene, kicked a diff to have a quality code of 42

Tuesday : 2014-03-18

mark is czar

  • 07:05 MEH: clearing WS diffim fault 5 (532198,532231,532259,532292 skycell.1089.062; 532381 skycell.2608.044; 533099 skycell.1084.087, skycell.1084.096)
  • 07:50 MEH: nightly finished, staticsky storage.hosts.on, storage.hosts.ignore
  • 08:20 MEH: after ~4 days, restarting regular pantasks for the week
  • 10:30 MEH: seeing load on ipp032 for mops and ippc30 for PSS, setting off in staticsky adds manually
    • ippc30 has MOPS stamps taking ~1min each rather than ~3/min normally, running staticsky on ippc30 probably isn't helping
  • 18:00 MEH: extra nodes for staticsky off in time for nightly, storage.hosts.off
  • 19:30 MEH: ipp053 looks to have gone unresponsive ~19:20.. seriously.. -- ipp053_log
    • nothing on console, power cycling since non-responsive for over 10 min, ganglia showing drop of cpu to only ~20% cpu_wait @1900 so something probably just crashed
    • failed to boot on first power cycle, finally kicked over on second power cycle and one of the long boot screen systems for ram check..
    • nothing in logs except USB drive faults/[sr0] Unhandled sense code after reboot -- leaving in repair and out of processing for tonight

Wednesday : 2014-03-19

mark is czar

  • 07:50 MEH: fault 5 diffim to clear (533929,533896 skycell.2518.043), nightly mostly done but for ~100 WS diffs
  • 08:40 MEH: WS diffims finished, switching nodes to staticsky except for ipp032 for Serge/MOPS
  • 13:30 MEH: ipp053 down again after running only staticsky load while still in neb-host repair from last night. now not booting at all on power cycle.
    • Haydn will look at tomorrow, setting power off
  • 16:00 Bill is rebuilding psModules for tests in staticsky
  • 17:00 MEH: rebuilding ippconfig to remove compression on the diffims for MOPS
  • 18:00 MEH: if extra nodes still running in staticsky, will be turning off
  • 19:44 Bill: fullforce processing is done (modulo faults to be investigated) setting staticsky poll limit back to 160 which is slightly larger than the number of currently enabled nodes.

Thursday : 2014-03-20

  • 09:05 Bill: storage.nodes.on, wait for all idle, storage.nodes.ignore.
    • rebuilt ippsky's psphot directory to fix a problem in psphotFullForce
  • 09:43 Bill: restarted postage stamp server pantasks.
    • the pstamp working directory is 96% full. This is because I turned pstamp cleanup off yesterday and so it hasn't been cleaning up after Johannes' big jobs
    • removed PSI and WEB labels from pstamp to give full attention to MOPS
  • 18:20 CZW: storage.nodes.off
  • 20:15 Bill: restarted summitcopy and registeration pantasks because registration was behind. ok now
  • earlier Bill: pstamp working directory /data/ippc30.1 ran out of space.
    • I stopped pstamp processing for awhile, by changing labels. Johannes' big jobs are no longer running. Asked him to stop submission for now.
    • Also removed the IFA label for now.
    • Increased the number of cleanup jobs allowed to run to 15. This makes the load go pretty high.
    • 20:47 Disk space is clearing out. (2% free 164GB) In a couple of hours there should be enough space to allow today's mops jobs to run.
    • 23:01 22% free space. npending pstamp cleanup jobs set to 5

Friday : 2014-03-21

  • 11:20 CZW: restarted stdscience, as it was at 100k chip/warp/diff jobs.

Saturday : 2014-03-22

Sunday : 2014-03-23