PS1 IPP Czar Logs for the week YYYY.MM.DD - YYYY.MM.DD

(Up to PS1 IPP Czar Logs)

Monday : 2013.06.10

  • reminder @0830 power off to IfA-Manoa

Tuesday : 2013.06.11

  • RCUH holiday
  • 13:15 MEH: Serge has pointed out pstamp running slowly lately for MOPS requests, looks like pantasks running since 5/29. doing restart of pstamp and update, will see if rate back up tomorrow morning/next MOPS round
  • 18:55 MEH: also did restart of summitcopy, registration since running for while as well. stdscience probably needs to be done no later than tomorrow.

Wednesday : 2013.06.12

Bill is czar today

  • 05:50 Quite a bit of data was taken last night. 18 object exposures left to download. 95 exposures still at chip stage. Ganglia looks nominal except that network usage has a not quite uniform sawtooth pattern with period of about 5 minutes.
  • 06:12 Chip stage throughput is only 30 exposures per hour. stdscience pcontrol is spinning. Going to restart stdscience. Restarted at 06:22

Thursday : 2013.06.13

Bill is czar today

  • 05:20 disk space is getting low. Set ps_ud% and STS.rp.20130508 data to goto_cleaned
  • 14:00 MEH: starting the select MD09 reprocessing for MD09.refstack.2013, manually adding deepstack compute3 nodes back into stdsci
  • 19:30 MEH: stdsci chip/warp @60k jobs run and Nrun underloaded.. seems sooner than typical for regular restart.. doing before nightly starts..
  • 19:45 MEH: ippc18/home disk 36G free..
  • 21:00 MEH: never fails.. md09 reprocessing almost done and system goes belly-up... ipp057 having trouble, load >150.. neb-host down, trying to take out of processing..
    • 21:20 had to kill 5xppImage, 1x pspwarp, reducing stdsci load to it -- unclear what happened, like other wave4, mysql ~50% or more of RAM and looks like fell into 100% system CPU for something
  • 21:50 MEH: reconfiguring deepstack back to doing deep stacks only, taking 4-of-6x compute3 out of stdscience again

Friday : 2013.06.14

  • 01:00 MEH: md09 reftests faulting often in stack creation with I/O error code: 32cd.. appear to be ippc40,c44, maybe c53 often having the faults. unclear why, didn't happen with md08 in April. turning on revert in deepstack overnight, maybe most will get done eventually...
  • 11:10 MEH: suspect md09 reftest faulting is related to compute3 nodes with no local disk space left for tmp files (many need to be made for the conv process ~1-10G), ippc40 has 35M free and this is very bad..
    • compute3 ippcXX.0 mirror raid disks are being used for dvo backups, will want to switch the local tmp dir from ippcXX.0 to ippcXX.1, a larger not-mirror raid -- .0/ipp/tmp is now symlink to .1/ipp/tmp
    • stare nodes also have .0, .1 disk setup, so moving ipp/tmp to .1 with symlink to .1/ipp/tmp
    • stare03 .1 disk is not mounted, unclear why -- no change made on that system until clear .1 disk is okay (and mounted)
  • 20:40 MEH: ipp055,057,058 driven to >100 load for some reason.. 055 has mysqld at 75%, looks like they have some ippdvo running.. taking 2x out of stdscience for 055,057,058 to see if stabilizes

Saturday : 2013.06.15

Sunday : 2013.06.16