PS1 IPP Czar Logs for the week 2015.07.20 - 2015.07.27

(Up to PS1 IPP Czar Logs)

Non-standard Processing

In an effort to improve communication, so we know what's going on better, HAF thinks we should list in the czar pages additional (non-standard) processing - so that we all know what's going on (no requirements for others to help run it). HAF will start:

HAF:

  • ipp054 - ipp081: addstar processing for full force / diff. If you notice problems and need to reboot / stop this, please do the following:
    • contact HAF via email or phone (all should have my #)
    • if you feel addstar needs to be stopped and can't reach her:
             ssh ippdvo2@ippc19
             cd addstar.ipp054 (replace with machine that is problematic)
             pantasks_client
             stop
      
      • and then check status: wait for addstar run to finish, wait for minidvodb.premerge to finish (otherwise, risk corruption of mini)
  • stsci03 is currently running ipptopsps for SAS39 FW. This is fairly resilient - no need to stop it if problem. you can access it via:
             ssh ippdvo2@ippc19
             screen -r fw
    
    • you can ctrl-c to stop it if necessary (but ipptopsps really doesn't care, it's fine.. it gracefully dies off if gpc1/nebulous die off, for example)


Monday : 2015.07.20

  • 13:45 EAM: I restarted the gpc1 / ippdb05 mysql (and moved the slow log out of the way). I'm now restarting ipp pantasks.

Tuesday : 2015.07.21

  • 07:30 MEH: registration is stalled on even darks for past 5+ hrs.. simple regtool query not completing on cmdline even when little to no jobs in mysql/ippdb05 processlist running now.. same regtool command on gpc2 works fine, chiptool on gpc1 works fine
  • 09:30 MEH: restarting czarpoll after crash sometime last night..
  • 15:35 MEH: restarting czarpoll again after crash when gpc1 was down today?
  • 15:53 CZW: Starting ippsky/pv3holes to attempt to finish off the remnant processing that needs to be done to get PV3 completely done.
  • 16:10 MEH: bump pstamp some to get the QUB stamps from stalled weekend diffs done
  • 16:40 CZW: pv3holes pantasks stopped and shutdown, as I've completed the missing warp and diff updates for PV3. I'll probably bring it back up to complete full force when I have the task list for that organized.
  • 16:50 MEH: ipp034 has neb-host down, unlogged as to why since brought back up 7/2 with 1 cpu -- data there is needed so putting to repair now
  • 20:30 MEH: restarted pstamp to reset for normal processing and >!00k after all the QUB stamps

Wednesday : 2015.07.22

  • 12:20 MEH: few data nodes available for nightly that aren't red or in repair
    • ipp082 -- turns out BBU not connected on new raid card since 7/2 but cache is on for writes.. leave in repair until fixed
    • ipp071,074,076,079,080 -- in repair for DVO, Gene email says can put back up
    • ipp090 -- note just says getting slammed from 7/8 when Chris czar.. put back up
  • 18:00 CZW: started rsync based shuffle off of stsci. This is running in a set of screen session on stsci00 as the ipp user. They seem to respond nicely to C-c, if they need to be emergency stopped.
  • 18:30 CZW: restarting ippsky/staticsky to run skycal operations. There are 17 outstanding staticsky runs, but they appear to be in the bulge and need special handling. The labels are LAP.PV3.20140730.skycal01.holes and LAP.PV3.20140730.skycal01.update, and the processors are 4x x2, x3, and 2x c2.

Thursday : 2015-07-23

  • 13:00 CZW: Noticed ipp073 was odd in nebulous. There's some issue with the filesystem, and although the host is up, the disk isn't. I've sent a message to Haydn to see if he can learn more.
  • 14:30 CZW: Reboot of ipp073 seems to have resolved the issue. Haydn confirms the RAID looks fine, so this should be just a one-off glitch.
  • 17:20 CZW: Restarting all the ipp pantasks so they'll be fresh for tonight.
  • 18:20 CZW: Restarting...I clicked the wrong date.

Friday : 2015.07.24

  • 06:31 Bill: chip processing is stalled because ipp061 has some required detrend files and it is apparently not responding to nfs. Set it to down and reverted and now progress is being made.
    • Note according to ganglia ipp074, 076, and 079 have load of 500
  • 07:29 Bill: increased camera poll value to 90 from 60 to push through the backlog
  • 07:40 EAM : rebooting ipp061 to addresss xfs problems
  • 07:45 EAM : ipp074, ipp076, ipp079 are the output targets of the stsci rsync, which is clobbering them. i've set them all to repair.
  • 07:47 EAM : ipp061's raid came back online after the reboot; i've put it in repair for now.
  • 12:45 CZW: set ipp096 to up to see if that helps alleviate the load issues with ipp072 and ipp078.
  • 18:20 CZW: restarting ipp pantasks so they're fresh.
  • 20:20 MEH: still WS diffs from last night faulted... either need to be reverted a few times because the primary stack for some undocumented reason is on ippb06 or just set quality 42 so they can be distributed to QUB as they are paying for timely access. more of a hassle if get left when warp cleaning happens or behind the current nightly processing..

Saturday : YYYY.MM.DD

  • 06:45 EAM : ipp090 was reported down by ganglia, but it was just overloaded. i've set it to repair.
  • 10:10 EAM : i restarted gmond for ipp090 so it is visible in ganglia. i've kicked registration a couple of times. gavin has repaired ipp078 so I am setting it to repair.

Sunday : YYYY.MM.DD

  • 07:45 EAM : nightly processing is running a bit slowly so I'm restarting the ipp user pantasks.