PS1 IPP Czar Logs for the week 2013.11.11 - 2013.11.17

(Up to PS1 IPP Czar Logs)

Monday : 2013.11.11

  • MEHL ipp052 has stalling jobs, looks like mysql could use a restart there -- taking out of all pantasks

Tuesday : 2013.11.12

  • 13:41 Bill restarted distribution pantasks.

Wednesday : 2013.11.13

Thursday : 2013.11.14

Bill is czar today

  • 9:00 restarted pantasks. Queued sts distribution bundles for cleanup. Started processing of STS.rp.2013.201107%
  • 9:20 ipp065 is being overwhelmed by ipptopsps. Taking it out of stdscience host list and temporarily setting it to down in nebulous to avoid failures due to esentially inaccessible detrend files.
  • 09:48 queued 28 mostly 3pi exposures of M31 for reprocessing.
  • 10:02 ipp050 crashed. Power cycled it.
  • 10:44 stdscience stopped. There are a vast number of M31 chip failures due to failure to read auxmask file problems.
  • 11:08 found the problem. Actually 2 M31 labels removed.

Friday : 2013.11.15

bill is czar today

  • queued some more sts data.
  • 11:00 stopped stdscience and distribution for restart. There are several stdscience jobs that seem to be stuck.
  • 11:30 ipp049 is having problems. Ganglia claims that ipp055 is down but it is not.
  • 11:45 rebooted ipp049 (it's doing fsck)
  • 11:53 stdscience started. summitcopy and registration set to run
  • 12:05 earlier I was having trouble getting pantasks to start properly on ippc15 so we rebooted it. Upon reboot the 3.7 kernel was used. This apparently triggers the problem where pantasks jobs get stuck in EXIT state. There is likely a library incompatibility (we build on ippc18 2.6 and run on 2.7) Rebooting back to 2.6
  • 19:40 MEH: looks like ipp065 has been unhappy for a bit.. stalling summit+reg+stdsci.. lots of swap used, taking out of all pantasks -- ipp011,059,061,063 suspect will have similar problems soon, so also dropping unless someone is going to watch over night
    • caught up and nightly processing now

Saturday : 2013.11.16

  • 18:40 MEH: ipp048,049 looking overloaded in RAM from mysql, taking out of processing as jobs on ipp048 sluggish
  • 22:50 MEH: looks like stsci14 down, set neb-host down. cannot reboot
  • 01:25 MEH: ippc01-09 apache hungup so restarted, finished clearing all summit+reg+stdsci+stk+dist jobs possible, restarted pantasks.. nightly finally moving forward..

Sunday : 2013.11.17

  • 10:00 MEH: Gavin rebooted stsci14, neb-host up and remaining handful of stalled exposures finished.
    • diffim fault 5 from new tag changes still -- diff_id=493968, skycell.1620.066