PS1 IPP Czar Logs for the week 2013.12.02 - 2013.12.08

(Up to PS1 IPP Czar Logs)

Monday : 2013.12.02

Bill is czar today

  • 10:47 Set all data with label like ps_ud% to be cleaned. Restarted cleanup pantasks.
  • 13:30 Dropped the 6 STD.nightlyscience chipRuns from 11-27 and 11-29 for exposures with filter = 'OPEN' We don't process those because we have no flat detrends
  • 13:35 set all pantasks to stop in preparation for full restart. (The only one doing anything currently is cleanup)
  • 16:15 set STS.rp.2013 distribution bundles to be cleaned
  • 19:05 MEH: a reminder when restarting any pantasks, hosts may need to be manually turned off to account for DVO/PSPS/mysql use on machines until mysql etc can be restarted or jobs/nightly processing is likely to get jammed -- currently looks like ipp048--ipp053, 039, 057 may still be problem childs.
    • ipp053 holding onto registration job >3ks, chip jobs >2ks so far..
    • turning these ipp048--ipp053 off

Tuesday : 2013.12.03

  • 08:35 Bill: Serge reports that ipp is stuck. It appears that ipp057 is down and some registration jobs are stuck trying to access files there power cycling it
  • 16:10 MEH: in addition to manually taking nodes out of processing if DVO/PSPS overloading, they may also need to be put into neb repair (more of an issue now that there is open space on many of the similar data nodes
  • 16:20 MEH: tweak_ssdiff to get the SSdiff out for QUB

Wednesday : 2013.12.04

  • 09:38 Bill: dropped chip_id 919320 and 919321 because they will fail to run. The filter for these STD exposures is OPEN and we don't have flats for that case.
  • 12:25 MEH: ipp065 needed to be taken out of stack as well.. tweak_ssdiff for whatever MD stacks are available now
  • 17:50 MEH: Heather finished restarting mysql on ipp048-ipp053, ipp065 so putting back into pantasks
    • ipp039 may need to have an eye kept on
  • 18:00 CZW: I've set stsci13.X to repair in nebulous. Something is broken with NFS, and because of this, the log is full of "lockd: cannot monitor ipp046" entries. This prevents any new files from being written, as we lock files to ensure they're written correctly. This was forcing nearly all jobs to fail, and suggests that stsci13 needs to be rebooted to regain its sanity.

Thursday : 2013.12.05

mark is czar

  • 00:40 MEH: something was slowly reducing processing until ~midnight -- registration got ~20 exposures behind, camera stage taking ~2ks, rates ~20 exp/hr. removed LAP label and lap.off @midnight and things slowly getting back to normal, LAP label back in and lap.on but such a backlog of chips/warp will have to wait and see if degrades again.
  • 04:50 MEH: again found LAP overusing processing, >80 3PI chips to do, registration behind -- LAP label out.
  • 08:00 MEH: nightly finished except for WS. cleared five fault 5 WS diffim (501812, skycell.1157.030; 501836, skycell.1156.007; 501847, skycell.1157.030; 501868, skycell.1156.007). cleared two distRun faults hanging around from 12/2. LAP chip.off for a bit to clear some warps for stacks, then chip.on.
    • unclear what the slowdown was last night, whether other jobs (dvo, ps2iq,?) were doing it or if mysql on ippdb01 just needs to be restarted again
  • 11:40 MEH: notice 33G left on ippc18.0, will archive the small 10G or so of logs before the weekend..
  • 18:00 MEH: doing regular restart stdsci before nightly -- will be watching processing again to see what could be causing the stalling seen last night with LAP running
    • cleanup was probably overloading, 2x w4,c,c2,c3 is probably overkill. noticed jobs taking >1ks to finish earlier so turned poll down to 30, turned off all computes and 1x wave4
    • lap.off helped to drop ippdb01 load, adjust.lap seems to set a better poll period so running with that for a bit
    • mysql on ippdb01 likely just needs a restart as well
  • 20:30 MEH: lap.off until ~midnight (when ~half chunk LAP stacks should be finished) as ippdb01 loaded again and plenty of nightly to do
  • 23:50 MEH: lap.on with increased poll period and loading new LAP jobs for past hour and seems ok, leave on for rest of night
    • registration seems to still get a few behind but better than 20+
    • ipp015 seems to have extra load, taking 2x out of stdsci

Friday : 2013.12.06

mark is czar

  • 08:00 MEH: change in LAP poll period seems to have helped keep nightly from backing up last night -- will make default in stdsci startup
  • 10:30 MEH: after talking with Gavin, stsci13 just needs a reboot -- looks like SAS staticsky will take most of day still and don't want to interrupt and don't want to reboot at end of day on Friday.. leave for Monday
    • ippdb01 mysql could also likely use a reboot and will leave for Monday (bad idea on friday)
  • 10:38 Bill: set skycal to off in stdscience and prevented it from being loaded at next restart. The sas processing needs to be run with the new code in order to get updates to the cmf format.
  • 18:00 MEH: stdsci will need its regular restart before nightly, will be done around this time
  • 18:30 MEH: cleanup of chip cmf jobs >2ks again, turning chip.cleanup.off..
  • 19:00 MEH: processing barely advancing, LAP label and lap.off -- still very poor...
  • 19:30 MEH: restarting apache on ippc0x, faulted many stalling jobs.. cleanup clear so restarting. registration catching up, processing moving along better
  • 20:00 MEH: rawcheck stop --
  • 20:30 MEH: LAP label back in, lap.on
  • 21:30 MEH: rawcheck run
  • 02:40 MEH registration climbing past 30 behind -- LAP off until caught up

Saturday : 2013.12.07

  • 10:00 Bill: two sas staticsky runs faulted in the extended source fits. No stack trace in the log. Reverted them.
  • 10:05 Bill: difftool -updatediffskyfile -diff_id 503401 -skycell_id skycell.2241.075 -fault 0 -set_quality 14006
  • 17:40 MEH: regular restart stdsci
  • 23:30 MEH: putting compute3 back into stack

Sunday : 2013.12.08

  • 12:20 MEH: appears ipp005 has had a kernel panic.. rebooting
  • 12:40 MEH: clearing files stalled for ~7hrs on ipp055,065
    • ipp050 also giving trouble, restarting nfs and looking for others -- ippc26
  • 13:10 MEH: now moving on to ipp001 gpc1 mysql dumps -- disk is full, using Gene's script
    -- as ipp
    cd /data/ipp001.0/ipp/mysql-dumps
    
    ~/mysql-dump/delete_log_spacing.sh mysql-gpc1-ippdb03 test
    
    ~/mysql-dump/delete_log_spacing.sh mysql-gpc1-ippdb03 commit
    
  • 14:20 MEH: watching LAP to flush out any other mount troubles
    • ipp023 to repair -- will need to be rebooted like stsci13 on Monday
  • 16:40 MEH: appears ipp050.0 also borked and /data/ipp050.0 not mounting -- needs a reboot Monday
    • reading mount back, leaving in repair
  • 17:45 MEH: ipp015, 058 also to repair -- LAP mostly moving again, nightly darks also finally cleared and registered
  • 20:50 MEH: ipp011 having trouble with locking files, neb-repair and out of processing
  • 22:30 MEH: ipp063 trouble locking files, neb-repair and out of processing
  • 22:50 ipp057 same