PS1 IPP Czar Logs for the week 2015.01.19 - 2015.01.25

(Up to PS1 IPP Czar Logs)

Monday : 2015.01.19

  • 05:25 EAM: ipp062 crashed around 04:15, nothing on console. processing completely stalled. i'm rebooting it now.
  • 05:40 EAM: after the reboot of ipp062, processing back up and running.
  • 09:55 MEH: ipp084 is in WT state and may be dragging on processing, into repair
  • 10:20 MEH: two warps stalling OSS finish
    warptool -dbname gpc1 -updateskyfile -fault 0 -set_quality 42 -warp_id 1369542 -skycell_id skycell.1609.051
    warptool -dbname gpc1 -updateskyfile -fault 0 -set_quality 42 -warp_id 1369710 -skycell_id skycell.1609.051
    
  • 14:45 EAM: stopping & restarting stdlocal
  • 15:15 EAM: rebooting ippx016 -- load very high for no good reason, ganglia shows there has been no I/O since midnight -- I suspect an ethernet crash.
  • 15:20 EAM: oops: I just shutdown stdlocal accidentally. restarting it again.

Tuesday : 2015.01.20

  • 05:00 EAM: stdscience is running sluggishly; stopping it to restart it
  • 10:20 EAM: some warps with growth curve failures:
    	warptool -dbname gpc1 -updateskyfile -fault 0 -set_quality 42 -warp_id 1371670 -skycell_id skycell.1697.023
    	warptool -dbname gpc1 -updateskyfile -fault 0 -set_quality 42 -warp_id 1371761 -skycell_id skycell.1698.077
    	warptool -dbname gpc1 -updateskyfile -fault 0 -set_quality 42 -warp_id 1371765 -skycell_id skycell.1700.047
    	warptool -dbname gpc1 -updateskyfile -fault 0 -set_quality 42 -warp_id 1371802 -skycell_id skycell.1787.048
    	warptool -dbname gpc1 -updateskyfile -fault 0 -set_quality 42 -warp_id 1371947 -skycell_id skycell.1701.070
    
  • 12:00 EAM: ipp041 crashed, nothing on console. rebooting.
  • 12:20 EAM: ipp056 has hung mounts, trying to clear
  • 12:25 EAM: ipp056 all jammed up (kernel segfaults), rebooting.
  • 14:00 EAM: bad diff (growth curve, repeat failures):
    difftool -dbname gpc1 -updatediffskyfile -fault 0 -set_quality 42 -diff_id 642344 -skycell_id skycell.1699.083
    
  • 15:08 CZW: starting up ~ipplanl/pv3stacksummary/ pantasks to process/reprocess PV3 stack summaries. I'm going to run these on the s0-s3 nodes, with one instance of each. The jobs are rather low-impact, with IO being the largest resource. The server defaults to 10 running jobs, and I'll bump this up if the ipp6509-e 10G link stays clear.

Wednesday : 2015.01.21

  • 06:50 EAM: ipp045 crashed, nothing on console. rebooting.
  • 09:25 EAM: stdscience was far behind due to ipp045. I've added ippsNN and some ippxNN nodes (esp the x0b and x1b set -- on the n5k switch)

Thursday : 2015.01.22

  • 14:10 EAM: all sort of work today: we stopped all processing and unmounted everything by ~12:30 so Haydn and Gavin could reconfig the network, adding an extra trunk link between the 6509e switch and ippcore. this was completed and I restarted stdlocal. unfortunately, my umount -a jobs also unmounted /export/ippxNN for all ippx hosts. that meant widespread warp failures until I could remount those. I'm going to restart stdlocal once the remaining jobs have cleared.
  • 14:45 EAM: stdlocal is running, staticsky is now running as user ippsky, using m0, m1, x0b, x1b, x3 nodes (4x each).
  • 22:40 MEH: ipp077 getting clobbered by something, high cpu wait conditions. staticsky is an /any/ target, should it be stsci nodes only?

Friday : 2015.01.23

  • 07:20 MEH: nightly rate is ~10-20% lower than nominal, something dragging on it
  • 09:45 MEH: ipp085 (maybe ipp089) getting load spikes when really no nightly left and cleanup had been turned off.. -- ipp082,085,093,095 particularly harassed Wed and Thurs nights..
  • 10:30 MEH: 2x loading in cleanup to push through the backlog of chips on the lower/fuller new storage nodes, distribution stalled on bundle from older cleanup and expecting a possible double nightly cleanup set from when stdsci was off during init.day
    • yes, cleanup of the large diffs (and warps+chips) for 1/21 was missed yesterday.. had to manually put to goto_cleaned
  • 12:30 MEH: running small sample SNIa test data in stdsci chip--warp -- done
  • 23:30 MEH: fixed bad OSS warp so won't stall exposure diffs
    warptool -dbname gpc1 -updateskyfile -fault 0 -set_quality 42 -warp_id 1385544 -skycell_id skycell.1782.036
    

Saturday : 2015.01.24

  • 07:45 MEH: manually had to clear exposure fault, o7046g0640o.. hopefully nothing is out of order..
    regtool -dbname gpc1 -revertprocessedexp -exp_id 861689
    
    • of course things are out of order.. manually queuing the proper diffims.. (but also have to wait for warps of o7046g0640o to finish..)
      difftool -dbname gpc1 -definewarpwarp -exp_id 861668  -template_exp_id 861689  -backwards -set_workdir neb://@HOST@.0/gpc1/OSS.nt/2015/01/24 -set_dist_group SweetSpot -set_label OSS.nightlyscience -set_data_group OSS.20150124 -set_reduction SWEETSPOT -simple -rerun
      
      difftool -dbname gpc1 -definewarpwarp -exp_id 861704  -template_exp_id 861724  -backwards -set_workdir neb://@HOST@.0/gpc1/OSS.nt/2015/01/24 -set_dist_group SweetSpot -set_label OSS.nightlyscience -set_data_group OSS.20150124 -set_reduction SWEETSPOT -simple -rerun
      
      
  • 08:55 MEH: nightly and other processing finishing w/ last publish file from messed up diffim.. turning storage nodes on in stdlocal.. storage.hosts.on
  • 09:40 MEH: cleared and old required nightly WS data product in distribution (cleaned up before distribution finished) from 1/20...
  • 17:15 MEH: doing the necessary stdsci restart so nightly processing rate keeps up

Sunday : 2015.01.25

  • 01:20 MEH: looks like ipp054 is wedging up nightly and summito+registration... trying to clear
    • ipp054 kernel panic fun -- had to power cycle
    • put ipp054 neb-host down, but now seems neb-host isn't responding?
    • summitcopy also timing out?
    • ippdb00 disk is full fantastic! -- slow.log is 20G again.. binlogs also using 1.1G each..
  • 02:30 MEH: looks like mysql on ippdb00 is actually down..
    • rotating slow.log out of way first
    • try restart mysql --
      150125  2:45:36  InnoDB: Database was not shut down normally!
      InnoDB: Starting crash recovery.
      InnoDB: Reading tablespace information from the .ibd files...
      InnoDB: Restoring possible half-written data pages from the doublewrite
      InnoDB: buffer...
      InnoDB: Doing recovery: scanned up to log sequence number 3525 4237682176
      ...
      
    • if comes back up, try purge some older binlogs to free up space as well --
  • 03:40 MEH: no recovery so far, last log entry ~1hr ago -- no idea how long this will take, leaving until morning
    InnoDB: Doing recovery: scanned up to log sequence number 3526 210101
    
  • 06:00 EAM: mysql is up. ippdb03 is working on mysqld-bin.004482, ippdb06 is working on mysqld-bin.004396. i'm going to purge up to mysqld-bin.004372 (Jan 17):
    PURGE BINARY LOGS TO 'mysqld-bin.004372';
    
  • 14:45 MEH: nightly finished, turning storage nodes on in stdlocal.. storage.hosts.on
  • 15:00 EAM: stopping stdlocal to re-start (over 100k jobs completed)