PS1 IPP Czar Logs for the week 2015.08.10 - 2015.0816

Non-standard Processing

Paul Sydney is transferring data off the system at high(?) rate? where is this data at, we should probably watch network loads over next month or so


(Up to PS1 IPP Czar Logs)

Monday : 2015.08.10

  • 05:57 Bill : set quality fault on -warp_id 1613166 -skycell_id skycell.1824.065 "cannot build curve of growth"
  • 09:05 MEH: >5 WS diffs fault 5 since saturday night.. must get set qual 42 before cleaned.. takes like less than 5 minutes to just check this and clear..
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.0635.036 -diff_id 1189950  -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.0635.036 -diff_id 1189968  -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.0635.036 -diff_id 1189999  -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.0635.036 -diff_id 1190015  -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.0711.011 -diff_id 1189953  -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.0711.011 -diff_id 1189963  -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.0711.011 -diff_id 1189996  -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.0711.011 -diff_id 1190016  -fault 0
    
    • These skycells failed beacause the template stacks are useless. Few pixels, those that are there have weird pixel values. Prevented this from happening again with
      stacktool -fault 0 -updatesumskyfile -stack_id 2301016 -set_quality 66
      stacktool -fault 0 -updatesumskyfile -stack_id 3330240 -set_quality 66
      stacktool -updaterun -set_note 'this stack is unable to be a template for diffs' -stack_id 2301016
      stacktool -updaterun -set_note 'this stack is unable to be a template for diffs' -stack_id 3330240
      
  • and yet another WS stalled diff from friday night for which the warps have been already cleaned.. so now will take more work to clear.. no time until i'm czar later in week..
    • Bill used the postage stamp to update the needed warp 1612381 skycell.1140.060
    • Then the diff failed again due to failure to read the stack from ippb06. Culled that instance of the stack and reverted. Then it succeeded.

Tuesday : 2015.08.11

  • 06:20 EAM : ipp061's raid is offline. i need to reboot the machine. i've stopped the local addstar and am removing ipp061 from processing.
  • 12:30 CZW: I've restarted all ipp pantasks servers after moving ipp004-ipp021 into the ignore list. They do not convert times properly, and that is interfering with processing. I've added x3 nodes to stdscience to prevent slowdowns in processing.

Wednesday : 2015.08.12

  • 08:35 MEH: Gavin fixed the localtime issue on ipp004-021 except ipp010 for testing, returning the fixed nodes back to processing today (mostly, the pantasks_hosts.input is really an unacceptable mess of commented out hosts w/o a note)
    • the use of 3x x2 nodes seemed to run fine unlike Gene's attempted allocation last week
  • 09:30 MEH: sending months old gpc2 data missed to cleanup
  • 16:30 MEH: ippc11 will be rebooted to replace a failing drive
  • 16:45 MEH: ipp074 has been in repair since last saturday due to concern w/ VFS: file-max limit, putting back up -- if it needs to be rebooted we need to reboot, but keeping nodes in repair like this creates an unbalance of space and part of the problem ending up with few nodes (with lots of space) for data and decimating nightly processing..
  • 20:20 MEH: processing faults all over, ipp062 disk unavailable -- XFS failure -- luckily in repair for nightly processing, neb-host down, all jobs need to be removed, addstar stopped, and at some point xfs_repair done.. hopefully not file lost case like ipp018 PS1_IPP_Czarlog_20150105
    • had to reboot on console to umount disk -- /var/log/messages likely doing the metadata recovery needed
      Aug 12 22:46:31 ipp062 [   18.101291] XFS (sda4): Mounting Filesystem
      Aug 12 22:46:31 ipp062 [   18.372863] XFS (sda4): Starting recovery (logdev: internal)
      Aug 12 22:46:31 ipp062 [   51.890507] XFS (sda4): Ending recovery (logdev: internal)
      
    • need to turn off swap before umount because swap is on that disk -- and probably not hurt to do mysql as well and stop exporting the disk
      sudo swapoff -a
      sudo mysqladmin shutdown
      sudo exportfs -uav
      sudo umount -v /export/ipp062.0
      
    • sudo xfs_repair -n /dev/sda4 -- indicated no problems to be fixed so going to skip a full repair run then
    • looks like same issue ipp061 had day before, so putting back neb-host repair and into processing again

Thursday : 2015.08.13

  • 10:20 MEH: very old MD stacks pre-2012 starting to go to cleanup
  • 11:00 MEH: restarting pstamp
  • 23:30 MEH: Haydn report ippc11 having disk issues after wrong disk replaced -- unclear if data loss but czarpoll aborted with input/output errors so at least ippmonitor seems to be non-functional for the time being.

Friday : 2015.08.14

  • 07:35 Bill: Serge reports that processing is stalled. chip processing is stuck needing detrend files from ipp063 which Gene reported is having problems earlier this morning. Set it to down in nebulous.
    • 08:11 chip processing is completed. All but 4 camRuns are finished as well. Just reverted those.
  • 10:30 CZW: ipp063 back online after reboot.
  • 12:20 Bill: Serge reports that one pair has not finished. It has one skycell with a recurring fault.
    difftool -updatediffskyfile -fault 0 -set_quality 42 -diff_id 1192664 -skycell_id skycell.1326.090
    
  • 15:15 CZW: stopping and restarting ipp pantasks for daily/pre-weekend restart.

Saturday : 2015.08.15

  • 22:20 MEH: looks like no chips are moving forward -- logs indicate ipp061 not reachable, xfs errors again -- neb-host down, take out of processing and let others know for reboot in future
    • nightly moving forward now
  • 23:00 MEH: nightly stalled again, something caused queries to hang and hiccup'd registration. had to manually run (or wait for it to maybe get reverted...)
    regtool -updateprocessedimfile -exp_id 958315 -class_id XY46 -set_state pending_burntool -dbname gpc1
    

Sunday : 2015.08.16

  • 09:05 MEH: other jobs running on ipp061 seem to finally be clear so rebooting -- leaving out of nightly processing and neb-host down