PS1 IPP Czar Logs for the week 2014-01-27 - 2014-02-02

(Up to PS1 IPP Czar Logs)

Monday : 2014-01-27

  • 8:00 CZW: shutdown pantasks servers in advance of today's set of moves.
  • 10:45 CZW: restarted nebulous apache servers, as they became cranky at some point. This was causing nebulous interactions to freeze. Post-restart, nebulous jobs appear to be completing as usual.
  • 11:30 CZW: neb-host ipp024-ipp029
  • 15:11 Bill: changed host status for ipp006-19 (except 17) from up to repair since they are going to be moved soon. changed neb-host status for ipp030 0 36 from down to repair since they have been moved.
    • 15:27 changed ipp017 from down to repair as well it was moved this morning
  • 15:21 Bill restarted ~ippsky/staticsky, pstamp, and update pantasks
  • 15:40 CZW: cab4 has been checked and appears fine, so I have updated the cabinet location in the nebulous database.
  • 16:31 CZW: all cab3 hosts look fine with the exception of ipp023, which had problems last night anyway. I've set the rest of cab3 to repair, left ipp023 as "down", and am now bringing up the pantasks that haven't been restarted yet.
  • 19:41 Bill set staticsky to run. It was stopped to investigate a potential problem.

Tuesday : 2014-01-28

  • 17:30 CZW: Moves complete for the day; all hosts up. I've restarted pantasks for tonight, and set hosts to repair (I'm still finishing some of the disk scans).
  • 21:55 Bill: set ippsky/staticsky to run

Wednesday : 2014-01-29

  • 09:30 Bill restarted czarpoll which didn't seem to be updating, and roboczar just because (reminder: these are run in screen by ipp@ippc11 )
  • 11:55 EAM : ipp028 died. various things were blocked on that partition, so stopped all processing, then laboriously checked each machine for hanging dfs, then killed jobs owned by ipp (but not ippsky), then did a force umount (tried until the df was cleared). restarted all services.
  • 16:00 Bill after talking with Gene removed one set of compute3 nodes from stdscience and enabled one more in ippsky. More red on the board now.
  • 16:43 Bill restarted stdscience including the ps_ud labels. The update pantasks is decomissioned for the time being.

Thursday : 2014-01-30

  • 05:45 Bill : set staticsky to stop in preparation for server moves.
  • 07:57 Bill : set stack to stop in preparation ...
  • 13:20 Bill : turned chip,warp, and stack revert off because there are many faults because required files are only located on nodes that are down for moving.

Friday :2014-01-31

  • 07:41 Serge : turned warp revert on
  • 08:37 Bill : turned chip revert on and stack.revert off since the stack faults need files on ipp028
  • 12:56 Bill : shut down the update pantasks. It isn't used anymore.
  • 14:20 MEH: Gavin reports the mezzanine is 94F, what systems can he shutdown?
    • when happened before used Development_Cluster_Uses
    • ipp003 can shutdown -- will email transients group
    • Gene notes ipp022 nothing critical as well
    • @1500 Bill Unruh notes on-site evaluation of HVAC in progress
    • need final decision on what remains up over weekend if AC not repaired
    • all fixed
  • 18:50 Bill : since it is alive, set ipp028 to repair and reverted the stacks and staticsky runs that were waiting for data there. There was a burst of activity which has since relaxed

Saturday : 2014-02-01

  • 12:00 MEH: with updates added to stdsci, even more regular restarts are probably going to be required to keep polls full.. doing stdsci restart
    • STS chip stalled on ippc38 for past 164ks... (chip_id 942886, XY63) -- cleared fine after revert
  • 12:45 MEH: large number of hosts out and no longer need to be for rsync/moves.. slowly putting back in
    • ippc20-c28
    • ipp017, 020, 021
    • ipp024, 025, 030
    • others normally out in ignore_, updating ippconfig/pantasks_hosts.input
  • 13:00 MEH: stack has 3x c3 loaded while 2x c3 used for staticsky and 5x in stdsci.. that is asking for trouble..
  • 14:50 MEH: also looks like ipp054-ipp058 was put into ignore_wave4 in ippconfig/pantasks_hosts.input w/o a note for the move. putting back into use.
  • 16:10 MEH: stack.poll of 64 is underloading (ie idle nodes). 100 is better

Sunday : 2014-02-02

  • 15:50 MEH: regular restart of stdsci before nightly starts

Sunday : 2014-02-02