PS1 IPP Czar Logs for the week 2014.07.28 - 2014.08.03

(Up to PS1 IPP Czar Logs)

Monday : 2014.07.28

  • 07:05 Bill: Restarted distribution pantasks. (The rest of them could probably use a restart as well)
    • increased npending for rcserver.makefileset.run task to 10 from 5. It is about 200,000 behind the number of completed filesets. We should keep an eye on ippc17 to insure that this doesn't overload the datastore server ippc17.
  • 08:45 MEH: w/ nightly processing finished, using the idle nodes for MOPS diffim reprocessing tests (couple hours to run) -- finished
  • 09:11 Bill: Preparing to restart pstamp pantasks, letting pstamp.request.finish jobs finish.
    • 09:50 pstamp pantasks server restarted

Tuesday : 2014.07.29

  • 05:30 Bill: changed label to allow another 50,000 skycal distRuns to be processed. There are also 50,000 filesets to be made.
  • 11:00 MEH: going to start some MD.pv2 staticsky runs on underutilized s3+c2 -- turning off s3 before nightly processing
  • 16:24 CZW: I've added ipp071 into nebulous, and after some time and a few spot tests, it seems to be accepting files correctly. I have therefore set it to 'up'.

Wednesday : 2014.07.30

Thursday : 2014.07.31

  • 08:38 Bill: started up skycal as ~ipptest to process the pole skycells. Starting with the CNP version.
    • 08:42 that was quick. Now adding label for rings.v3 version
    • also set another 100,000 lap skycal distRuns to be processed. Here is the current status
      mysql> select label, state, count(dist_id) as num_runs, min(dist_id) as first_dist_id, max(dist_id) as last_dist_id 
      from distRun where label like 'lap.threepi.20130717%' and stage ='skycal'  group by label,state;
      +---------------------------+-------+----------+---------------+--------------+
      | label                     | state | num_runs | first_dist_id | last_dist_id |
      +---------------------------+-------+----------+---------------+--------------+
      | LAP.ThreePi.20130717      | full  |   550002 |       3465080 |      4015081 | 
      | LAP.ThreePi.20130717      | new   |   100000 |       4015082 |      4115081 | 
      | LAP.ThreePi.20130717.wait | new   |   333697 |       4115082 |      4677520 | 
      +---------------------------+-------+----------+---------------+--------------+
      3 rows in set (4.55 sec)
      
      
  • 09:20 MEH: clearning another fault 5 diffim from the other night before the warps get cleaned up -- 579391,skycell.2054.023 -- too late, now have to manually update warps..
  • 09:30 MEH: doing regular restart of stdscience
  • 09:57 Bill: skycal processing for the pole has finished. Ready to run staticsky for 3 skycells in M31 and reruns for 16 other skycells that were missing one or more inputs.
  • 11:10 MEH: using 2x c2/compute3 group to get a specific MD.pv2 refstack set finished before midnight tonight + 1x s3 (should be fine with 2x for PV3 pole)
    • adding 1x c1a,c1b; 2x c0
  • 12:20 MEH: doing regular restart of pstamp
  • 13:45 MEH: lots of misc months old RINGS* tmp files filling up several /local/ipp/tmp dirs.. scanning and clearing out..
  • 13:50 Haydn replacing power unit in ippdb03, will be down for a bit
  • 17:00 MEH: turning off MD.pv2 refstack use on s3,c1a to clear out before nightly begins

Friday : 2014.08.01

  • 06:50 MEH: four OSS diffims w/ fault 5 to clear
    difftool -dbname gpc1 -updatediffskyfile -set_quality 14006 -skycell_id skycell.0952.012 -diff_id 579807  -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 14006 -skycell_id skycell.0867.013 -diff_id 579811  -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 14006 -skycell_id skycell.0955.006 -diff_id 579817  -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 14006 -skycell_id skycell.0784.027 -diff_id 579827  -fault 0
    
  • 12:00 Gavin restored ipp002 (so IPP svn+wiki are back), 2nd socket on ipp002's motherboard is bad so at reduced capability..
  • 14:00 MEH: preparing for full IPP cluster shutdown ~1500 over weekend -- system down ~1445
    • ippsky staticsky job killed so fault normally, same for MD stacks
    • all pantasks shutdown, ippc11 czarPoll+roboczar stopped
    • all apache servers ippc01-09 stoppped
    • ippdb03,02,01,00 mysql shutdown gracefully -- mysqladmin shutdown
    • Gavin added message to ippops1/proxy for datastore+pss -- also stopped to keep connection attempts from slowing down shutdown of servers
    • Haydn and Gavin powering off all nodes and cabs, checking so things don't power up on own w/ power cycle

Saturday : 2014.08.02

Sunday : 2014.08.03