Monday : 2013.09.30

  • Gavin put in request to MHPCC to reboot stsci13 and not the raid this morning, may take 24hrs for them to get to it with staffing changes and the changeover happening
  • 14:40 Gavin called MHPCC directed and Brad rebooted the machine -- things are back up, mounts are happier
  • 16:10 Bill restarted the pstamp and update pantasks servers

Tuesday : 2013.10.01

  • 12:10 Bill restarted all pantasks after rebuilding ppImage with some changed code. I botched the merge on reductionClasses.mdc so I removed the STS.test.20130930 label right after I added it.
    • 13:35 all fixed. STS.test.20130930 consists of 34 exposures testing the new STS auxiliary masks.

Wednesday : 2013.10.02

Bill is czar today

  • We were shutdown during the afternoon for some mother board replacements and reboots

Thursday : 2013.10.03

Bill is czar today

  • 10:00 A number of diff skycells with the 14006 should be quality error. Another had an assertion failure set that one to quality = 4
    • here is the tedious human intervention to work around uggy ppSub
      • MEH: again, not buggy ppSub, but unhandled psLib+psphot faults with newish ipp-130712 ops tag
            difftool -updatediffskyfile -set_quality 14006 -skycell_id skycell.068 -diff_id 482509 -fault 0
            difftool -updatediffskyfile -set_quality 14006 -skycell_id skycell.026 -diff_id 482472 -fault 0
            difftool -updatediffskyfile -set_quality 14006 -skycell_id skycell.041 -diff_id 482472 -fault 0
            difftool -updatediffskyfile -set_quality 14006 -skycell_id skycell.041 -diff_id 482475 -fault 0
            difftool -updatediffskyfile -set_quality 14006 -skycell_id skycell.052 -diff_id 482472 -fault 0
            difftool -updatediffskyfile -set_quality 14006 -skycell_id skycell.024 -diff_id 482486 -fault 0
            difftool -updatediffskyfile -set_quality 14006 -skycell_id skycell.022 -diff_id 482487 -fault 0
            difftool -updatediffskyfile -set_quality 14006 -skycell_id skycell.086 -diff_id 482487 -fault 0
            difftool -updatediffskyfile -set_quality 14006 -skycell_id skycell.031 -diff_id 482461 -fault 0
            difftool -updatediffskyfile -set_quality 14006 -skycell_id skycell.084 -diff_id 482461 -fault 0
            difftool -updatediffskyfile -set_quality 14006 -skycell_id skycell.024 -diff_id 482465 -fault 0
            difftool -updatediffskyfile -set_quality 14006 -skycell_id skycell.052 -diff_id 482470 -fault 0
            difftool -updatediffskyfile -set_quality 4 -skycell_id skycell.1150.051 -diff_id 482436 -fault 0
  • 12:09 Restarted the pantasks_servers

Friday : 2013.10.04

  • SC 07:50 Same diff issues
    difftool -dbname gpc1 -updatediffskyfile -set_quality 14006 -skycell_id skycell.062 -diff_id 482517 -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 14006 -skycell_id skycell.074 -diff_id 482517 -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 14006 -skycell_id skycell.084 -diff_id 482528 -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 14006 -skycell_id skycell.013 -diff_id 482533 -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 14006 -skycell_id skycell.021 -diff_id 482544 -fault 0
  • 10:45 Bill started reprocessing of STS label STS.rp.2013 There are 16,222 exposures marked as obs_mode STS
  • 12:23 Bill ipp034 is down. changed it's nebulous state from repair to down
  • 13:30 CZW pantasks stopped to prevent problems as Haydn attempts to increase the RAM visible on ipp030.
  • 14:10 CZW pantasks back online.
  • 18:50 CZW restarted stdscience pantasks.

Saturday : 2013.10.05

  • 19:55 MEH: distribution restarted, some stalled ESS and STS now moving through
  • 22:40 MEH: something is unhappy.. reg backed up, things faulting.. -- ipp016 lockd gone insane -- neb-host repair and out of processing until settles down
  • 22:55 MEH: pstamp also been holding onto some QUB jobs all day.. going to restart
  • 23:05 MEH: stdsci also some long stalled jobs and >100k Njobs so while things stalling, do a restart

Sunday : 2013.10.06

  • 16:05 MEH: appears ipp016 has been down for >14hrs.. because neb-host repair and not down, SSdiff for MD are stalled..
    • setting neb-host down to clear SSdiffs -- cleared and stalled diffim in stdsci cleared after reboot
    • trying to reboot -- nothing on console, reboot fine. putting back into neb-repair
    • pstamp QUB request stalled for 4 hrs also cleared..
    • HAF notes that ipp016 going down knocked out ipptopsps. Restarted Monday