PS1 IPP Czar Logs for the week YYYY.MM.DD - YYYY.MM.DD

(Up to PS1 IPP Czar Logs)

Monday : 2015-05-18

  • 16:23 CZW: restarting ippsky/pv3ff{lt|rt} because they appear to have been running forever.

Tuesday : 2015.05.19

  • 05:45 EAM: stdscience is running somewhat slowly. i'm restarting it.
  • 12:10 MEH: an abundance of ps_ud_QUB label diffs in goto_cleaned state not being cleaned again -- changing label to goto_cleaned to free space
  • 20:00 EAM: restarting fforce
    • 20:45 EAM: we are getting some excessively large fforce jobs on pv3ffrt from the Galactic plane. I'm removing storage nodes from that pantasks until we are past.

Wednesday : 2015.05.20

  • 05:30 EAM: rebooting stsci03 -- processing is somewhat behind, but not yet clear if that is due to stsci03. (the reboot helped -- things are now moving along)
  • 06:10 EAM: I'm restarting the pv3fflt,pv3ffrt pantasks. pv3ffrt is working on the plane -- it is slower than pv3fflt and it needs more memory per job. at the same time, pv3fflt is constantly running out of jobs because the queuing is barely able to keep up. I'm adjusting the loading to put all but 14 of the x nodes in pv3ffrt (since they have more memory), and leaving both storage and compute to pv3fflt (when stdscience does not need them) along with the 14 unused x nodes.
  • 11:00 EAM: I allowed too many jobs to the x2 and x3a nodes for pv3ffrt : they are thrashing on too much memory. I am trying to kill off jobs and bring them down to 5 jobs instead of 9.
  • 13:50 CZW: Started p1_clear.pl shuffle job on stsci15 volume 0, directory 00. This will add some disk IO, but should not overburden the host.
  • 20:20 EAM: retarted stdscience and pv3ffrt.

Thursday : 2015.05.21

  • 09:00 MEH: more old distribution data to cleanup again
  • 09:50 MEH: doing months of past bzip2 on the ipp pantasks logs.. symlink to export .1 Archive broken while on ippc19, but bzip should free up a good fraction
  • 10:30 MEH: doing regular restart of nightly pantasks
  • 12:30 MEH: fixing stalled diffim from 5/18 due to "cannot build growth curve" and warps being cleaned up
  • 13:00 MEH: apache neb servers low on disk space, best to cycle out the logs now rather than Friday before a long weekend.. -- done c01,02,03,07,08,09 -- were the others going to be added into the mix once neb db switched to new machine?
  • 14:20 MEH: looks like there was a glitch in cleanup yesterday, skipped cleaning nightly products and triggering manually now...

Friday : 2015.05.22

  • 07:20 MEH: seems NCU has >300 updates for stamps failing since 5/19, should probably be looked into..
    • oddly, while looking into them, they all got sent to cleanup..
  • 10:30 EAM: Chris has deferred the fforce galactic bulge regions for now; I'm restarting pv3ffrt and pv3fflt with the original loading (x0,x1,storage & x2,x3,compute) to work on the rest of the sky; at some point next week, we can target the galactic bulge to a set of high memory machines.
  • 15:00 MEH: stdsci didn't have an init.day succeed -- ~ipp/stdscience/pantasks.stderr.log being spammed with variable DB:2 not found
    • do a restart of stdsci and current date available, manually sent chip/warp/diff to cleanup

Saturday : YYYY.MM.DD

  • 17:05 MEH: restarting cleanup and pstamp
  • 23:00 EAM: restarting pv3ff[lr]t.

Sunday : 2015.05.24

  • 10:50 MEH: restart cleanup, leaving stdsci until after initday to watch it works properly versus the other day when failed
  • 13:50 MEH: stsci18 down, trying power cycle -- back up
  • 14:10 MEH: stdsci again throwing error and not doing proper cleanup -- something is wrong with changes made to it?
  • 18:25 MEH: stsci05 power cycle --
  • 22:45 EAM: full-force left and right ran out of work, so i've launched the queues for 18h (right) and 21h (left). I've also launched 15h for remote. I restarted the ff pantasks.