PS1 IPP Czar Logs for the week YYYY.MM.DD - YYYY.MM.DD

(Up to PS1 IPP Czar Logs)

Monday : 2015.06.08

16:05 Bill: restarted pstamp pantasks. It was looking tired to me.

  • 17:50 MEH: will start using the ippxNNN nodes for MD night stacks until the nodes are needed for other things

Tuesday : 2015.06.09

  • 06:35 EAM : rebooting ipp066 which crashed with no console messages

Wednesday : 2015.06.10

  • 07:35 EAM : stdscience is way behind. I'm stopping and restarting it.
  • 11:44 Bill: increased priority of label ps_ud_WEB to 500 (higher than nightly science) so that two chips can get processed for a MOPS postage stamp request. Once they are in the stdscience queue I will set it back.
  • 15:00 CZW: Reduced shuffle from -j 24 to -j 12 to half the load.
  • 16:30 CZW: Starting ipplanl/pv3shuffle pantasks which will execute neb-replicate commands for all the LANL stack products. These are targeted onto the stsciXY.Z nodes.
    • 17:00 HAF restarting everything, will pull out the wave1s that are on the funky pdu - ipp004, ipp005, 006, 007, 009, 010, 011, 015, and ipp066 pulled out of pantasks for tonight.
  • 22:20 EAM : removed ippc21, ippc55, ippc62 from processing lists -- not mounting home dirs?
  • 22:20 EAM : i've killed off all hung jobs and umount the cab2 machines from all hosts. the cluster seems clear now. I'm going to restart ipp pantaskses.

Thursday : 2015.06.11

  • 05:36 Bill: changed label priority for ps_ud_MOPS to 500 which allow the small number (< 50 per day) of chip update jobs to be processed during nightly science processing.
  • 07:42 Bill: added gpc2 database to summitcopy pantasks and reg.add.date for gpc2 to registration. That removed the gpc1 date from the registration date list. I guess the operation has changed
  • 11:55 MEH: ipp066 has ganglia reporting down for 18+ hrs, just needs
    sudo /etc/init.d/gmond restart
    
  • 12:55 EAM: Haydn reports the cab2 machines are back up after he replaced the PDU. I've set them to neb repair.
  • 12:59 Bill: restarted registration pantasks
  • 13:36 Bill: dropped a repeatedly faulting warp: warptool -updateskyfile -set_quality 42 -fault 0 -warp_id 1589979 -skycell_id skycell.0857.004
  • 15:20 MEH: with nightly processing ~finished, turning MD processing back on
  • 15:45 MEH: gpc2 warp stuck -- cannot build growth curve (psf model is invalid everywhere)
    warptool -dbname gpc2 -updateskyfile -set_quality 42 -skycell_id skycell.1207.015 -warp_id 7109 -fault 0
    
  • 17:35 MEH: pstamp has been stopped for >2 hrs w/o reason? turning back to run for QUB stamps
  • 20:40 MEH: cleanup wasn't disabled for MOPS data backlog from 20150610 and it got cleaned up...
    • restart stdsci so fresh for this large amount of reprocessing.. -- also appears the ipp004,005,006,007,009,015 were still in a down state after their powerup
    • LAP.PV3 warp_id 1159355 accidentally cleaned, will need to be updated w/ proper PV3 tag..
  • 20:55 MEH: also removing MPE label from pstamp since QUB still doesn't have their stamps yet... -- put back in once QUB stamps finished
  • 23:45 MEH: OSS update seems complete, turning off extra nodes in stdsci
  • 00:40 MEH: summitcopy regularly throwing errors for ipp066 -- turn off
    2015/06/12 00:40:16 | ipp066 | FATAL | Nebulous::Client::replicate - can not copy instance file:///data/ipp070.0/nebulous/b9/91/7090330756.gpc1:20150612:o7185g0074o:o7185g0074o.ota34.fits
    

Friday : 2015.06.12

  • 08:45 EAM: various issues this morning. processing seems to be sluggish for reasons that are not yet obvious. some things I needed to do:
    • registration problems for a couple of exposures. I needed to revert the exposures:
      regtool -dbname gpc1 -revertprocessedexp -exp_id 927096
      regtool -dbname gpc1 -revertprocessedexp -exp_id 927200
      
    • 7 diffs from Jun 11 were not running because their source warps had been cleaned. I set the warps to update. some useful mysql commands:
       select diff_id, diffInputSkyfile.skycell_id, diffSkyfile.fault, warpSkyfile.data_state, diffSkyfile.data_state, warp1, warp2 
         from diffInputSkyfile 
         left join diffSkyfile using (diff_id, skycell_id) 
         join diffRun using (diff_id) 
         join warpSkyfile on (warp1 = warp_id and warpSkyfile.skycell_id = diffInputSkyfile.skycell_id) 
         where label = 'ThreePi.nightlyscience' and state = 'new' and diffSkyfile.fault is null
      +---------+------------------+-------+------------+------------+---------+---------+
      | diff_id | skycell_id       | fault | data_state | data_state | warp1   | warp2   |
      +---------+------------------+-------+------------+------------+---------+---------+
      | 1160284 | skycell.1961.008 |  NULL | update     | NULL       | 1589194 | 1589744 | 
      | 1160286 | skycell.1879.060 |  NULL | update     | NULL       | 1589192 | 1589742 | 
      | 1160286 | skycell.1879.070 |  NULL | update     | NULL       | 1589192 | 1589742 | 
      | 1160286 | skycell.1879.090 |  NULL | update     | NULL       | 1589192 | 1589742 | 
      | 1160287 | skycell.1881.058 |  NULL | update     | NULL       | 1589200 | 1589827 | 
      | 1160291 | skycell.2038.051 |  NULL | update     | NULL       | 1589201 | 1589775 | 
      | 1160291 | skycell.2038.081 |  NULL | update     | NULL       | 1589201 | 1589775 | 
      | 1160292 | skycell.2039.090 |  NULL | update     | NULL       | 1589195 | 1589743 | 
      | 1160293 | skycell.1959.010 |  NULL | update     | NULL       | 1589187 | 1589825 | 
      | 1160293 | skycell.1959.040 |  NULL | update     | NULL       | 1589187 | 1589825 | 
      | 1160293 | skycell.1959.050 |  NULL | update     | NULL       | 1589187 | 1589825 | 
      | 1160293 | skycell.1959.070 |  NULL | update     | NULL       | 1589187 | 1589825 | 
      | 1160293 | skycell.1960.009 |  NULL | update     | NULL       | 1589187 | 1589825 | 
      | 1160293 | skycell.1960.039 |  NULL | update     | NULL       | 1589187 | 1589825 | 
      | 1160293 | skycell.1960.049 |  NULL | update     | NULL       | 1589187 | 1589825 | 
      | 1160294 | skycell.2038.051 |  NULL | update     | NULL       | 1589189 | 1589745 | 
      +---------+------------------+-------+------------+------------+---------+---------+
      
      The warps in data_state 'update' were earlier in 'cleaned'. I dumped the output above to a file and used the following awk to set them to update:
      awk '(NR > 1){printf "warptool -setskyfiletoupdate -set_label ThreePi.nightlyscience -dbname gpc1 -warp_id %s -skycell_id %s\n", $6, $2}' | tcsh
      
  • FYI -- these were the .multi diffims and why 6/11 diffs needed 6/10 -- is MOPS actually using these? QUB is just using the WS diffs.
  • 10:00 EAM: manually queued diffs for o7185g0185o-o7185g0205o:
    difftool -dbname gpc1 -definewarpwarp -warp_id 1590362 -template_warp_id 1590377 -backwards -set_workdir neb://@HOST@.0/gpc1/OSS.nt/2015/06/12 -set_dist_group SweetSpot -set_label OSS.nightlyscience -set_data_group OSS.20150612 -set_reduction SWEETSPOT -simple -rerun
    
  • 13:10 EAM : stopping all pantaskses to update mysql@ippdb08 : increase innodb_buffer_pool_size.
  • 13:20 EAM : ippdb08 has been upgraded to a larger innodb_buffer_pool_size (128G). I've moved the nebulous logs out of the way and restarted apache on ippc01-ippc10. Nebulous seems to be working and I've restarted the ipp user pantaskses.
  • 15:30 EAM : ipp044 was having trouble setting file locks. I noticed it was missing the directory /var/lib/nfs/sm. I created it and the locks worked fine from then on.
  • 17:10 EAM : ipp066 was behaving badly last night, i've taken it out of processing (in ignore) and set to repair. stopping and restarting pantasks to pick up the change.

Saturday : 2015.06.13

  • 06:10 EAM : gpc2 was still not in the summitcopy database list. i added it, and also fixed the input script to add it so it will be there on restart. (It had been removed earlier when the gpc2 datastore was having trouble).
  • 08:30 EAM : cleared repeatedly failing diffs:
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -fault 0 -skycell_id skycell.0697.027 -diff_id 1160975
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -fault 0 -skycell_id skycell.0697.027 -diff_id 1161004
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -fault 0 -skycell_id skycell.0697.027 -diff_id 1161016
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -fault 0 -skycell_id skycell.0697.027 -diff_id 1161061
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -fault 0 -skycell_id skycell.2596.016 -diff_id 1161211
    

Sunday : YYYY.MM.DD