PS1 IPP Czar Logs for the week 2015.07.06 - 2015.07.12

Monday : 2015.07.06

  • 06:55 Bill: stdscience is in pcontrol spin heck stopping to restart
    • 07:00 stdscience pantasks restarted
  • 14:11 CZW: The logs indicated PSF generation failures, so I've set these to bad quality so the images flow on to MOPS.

warptool -updateskyfile -warp_id 1601763 -fault 0 -set_quality 42 -skycell_id skycell.2398.001 warptool -updateskyfile -warp_id 1601775 -fault 0 -set_quality 42 -skycell_id skycell.2217.014

  • 16:55 CZW: stopping IPP pantasks for a nightly restart, as I'm czar tomorrow.

Tuesday : 2015.07.07

  • 05:33 Bill: ipp017 is down. Set it to down in nebulous.
  • 05:54 Bill: dropped two warps with "cannot build curve of growth error" -warp_id 1602637 -skycell_id skycell.0864.074 and -warp_id 1602720 -skycell_id skycell.2211.011
  • 06:19 Bill: stdscience's queue is nearly full with jobs that have been running for > 5000 seconds. Some of the observations last night were apparently near M31 so it wouldn't be too shocking if psphot and camera were slow, but given that ipp017 is down they may just be stuck with nfs issues.
  • 17:42 CZW: restarting the ipp pantasks.

Wednesday : 2015-07-08

  • 12:40 CZW: Starting ~ippsky/staticsky pantasks to do staticsky updates. This will be using 2x loading on the x0,x1,x2,x3 nodes. Queue commands can be found in /data/ippc19.0/home/ippsky/queues/staticsky.updates. -updaterun does not accept the -pretend option, so as written, the commands will fail (although these shouldn't cause much if any problem if too many are active).

Thursday : 2015.07.09

  • 05:20 MEH: restarting pstamp
  • 13:40 MEH: clearing out diff and diff distribution jobs stalled all week (faults and then cleaned problem...)
  • 14:45 MEH: FYI -- many summitcopy, registration, stdscience faults last night around @2100 and @2200
     unhandled fault - database error: DBI connect('','nebulous',...) failed: Too many connections at /usr/lib64/perl5/site_perl/5.8.8/Nebulous/ line 157
  • 14:50 MEH: regular restart of nightly pantasks

Friday : 2015.07.10

  • 06:30 MEH: restarting pstamp
  • 14:00 CZW: I've added a third batch of x0-3 to ippsky/staticsky to speed up the updates. The memory usage on those nodes is not a problem so we should get more done.

Saturday : 2015.07.11

  • 03:35 MEH: clearing stalled
    • warp of cannot build growth curve
      warptool -dbname gpc1 -updateskyfile -set_quality 42 -skycell_id skycell.1040.021 -warp_id 1603563   -fault 0
    • diff fault 5
      difftool -dbname gpc2 -updatediffskyfile -set_quality 42 -skycell_id skycell.1043.033 -diff_id 5607  -fault 0
  • 22:00 MEH: nightly processing completely stalled.. chiptool and warptool timeouts for finding pending jobs.. -- once again things not running well on the weekend because loads not being kept an eye on?
    • removing the label ps_ud_MOPS seems to help -- should clear the ~31k jobs it is trying to query for? -- only helps for chiptool, warptool still stalling
    • stdsci crashed -- restarting
    • killed off long running warptool query in gpc1 on ippdb05
    • stop other ippsky, replication pantasks, removed extra labels to try and help
    • seeing a few warps running, but still basically stalled

