PS1 IPP Czar Logs for the week 2014.04.21 - 2014.04.27

(Up to PS1 IPP Czar Logs)

Monday : 2014.04.21

Bill is czar today

  • 06:20 : EAM : preparing to add ps2_tc3 to the ippdb03 mysql replication (dumped the ippdb01 database & copied to ippdb03):
    mysql> show master status;
    | File              | Position | Binlog_Do_DB | Binlog_Ignore_DB |
    | mysqld-bin.030242 | 44889610 |              |                  | 
  • 11:48 Bill set pantasks to stop in preparation for periodic restart
    • added ipp055 to the hosts ignore list while the raid is rebuilding. Haydn put it in repair
  • 11:55 restarted stdscience, distribution, cleanup, pstamp, registration, stack, and summitcopy. Leaving stack in stopped state because I forgot to wait for jobs started by the previous instance to finish.
    • 12:12 stack set to run
  • 13:50 investigated corrupt raw file neb://ipp036.0/gpc1/20120817/o6156g0277o/o6156g0277o.ota55.fits. Turns out that there were 5 instances only one of which was corrupt. Fixed this.

Tuesday : 2014.04.22

Bill is czar today

  • 12:52 setting pantasks to stop in preparation for periodic restart
    • with restart ipp055 has been removed from the hosts ignore list since it has successfully finished it's RAID rebuild

Wednesday : 2014.04.23

  • 07:35 MEH: clearing two fault 5 diffims for OSS, several still needed for WS
    difftool -dbname gpc1 -updatediffskyfile -set_quality 14006 -skycell_id skycell.0853.066 -diff_id 542889  -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 14006 -skycell_id skycell.0938.000 -diff_id 542890  -fault 0
  • 15:30 CZW: I wanted to push stacks so LAP could finish in a reasonable amount of time. I've changed the host allocation:
    • m0 => continue to be out of processing for stacking tests
    • m1 => into staticsky to work on the galactic center
    • c0x3, c1bx2, c2x3 => taken from staticsky into stack
  • 18:15 MEH: remaining fault 5 WS diffim need clearing, warps only online for 5d so need to not ignore
    difftool -dbname gpc1 -updatediffskyfile -set_quality 14006 -skycell_id skycell.0852.062 -diff_id 542893  -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 14006 -skycell_id skycell.0853.032 -diff_id 542920  -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 14006 -skycell_id skycell.0854.039 -diff_id 542920  -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 14006 -skycell_id skycell.0853.090 -diff_id 542924  -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 14006 -skycell_id skycell.0853.091 -diff_id 542924  -fault 0

Thursday : 2014.04.24

  • 09:37 Bill: load is oscillating. Time to restart stdscience. Set pantasks' to stop except for stack and staticsky.
    • 09:44 pantasks' restarted
  • 13:20 Bill set 2 x c2 to off in stack and added them to staticsky.
    • 15:30 we got good staticsky throughput for the past 2 hours because the skycells were at low galactic latitude, so no xfits, but not too low) until. Now it has slowed down. projection cell 1031 has skycells with glat between 5.8 and 9 degrees. Now we are detecting so many peaks that we run into the slow footprint cull problem (the whole image is reduced to one large footprint) so far memory use has not exploded for any of the running processes.
    • 15:50 now we are getting some stress. Each staticsky process in this region of the sky is growing to ~20G (and none have completed yet). 2 x that will push up against the 48G memory supply on the s2 nodes. Setting 1 set to off in staticsky.
  • 21:30 MEH: night not looking promising, queuing 4 exposures for MOPS reprocessing -- OSS.20140324.redo140424

Friday : 2014.04.25

mark is czar

  • 06:00 Bill: turned 1 x c2 back on in stack. I meant to do that lats night but forgot.
  • 12:41 Bill: set second c2 nodes back on. Removing ipps nodes from staticsky.

  • 16:30 MEH: stdsci underloaded, regular restart before nightly obs start

Saturday : 2014.04.26

  • 00:30 MEH: looks like ippc13 unresponsive for ~10min.. nothing on console, failing to come back up after a couple power cycle attempts. must be the weekend..
  • 13:00 MEH: stdsci polls going underloaded, about ready for necessary regular restart before impacts processing rate -- will do before 1400 to avoid interfere with

Sunday : 2014.04.27

  • 09:30 or so Bill: set 1 set of c2 nodes to off in stack to avoid memory overload due to staticsky processes in bulge
  • 10:30 MEH: time for regular restart of stdsci
  • 12:24 Bill: set the othher set of c2 nodes to off in stack to avoid memory overload due to staticsky processes in bulge stacks will need to back up for the next couple of days
  • 14:30 MEH: such loss of power in stack will leave the system mostly idle, manual reallocation of stsci nodes to stack will be necessary until nightly starts (most power pushed into stack likely needed to fully pre-load stdsci processing in case poor weather at night)
    • should be enough power for PSS updates, except large poll numbers mean will take a while to get queued in place of LAP.. need to drop poll number
  • 19:20 MEH: will change nodes from stack to stdsci once nightly starts