PS1 IPP Czar Logs for the week YYYY.MM.DD - YYYY.MM.DD

(Up to PS1 IPP Czar Logs)

Monday : 2011.10.03

  • 02:50 odd squelching behavior again on processing when plenty of exposures available. stdscience pantasks looks to go completely empty, then reloads. looks like started ~02:20 based on ganglia past hour load plot.
  • 03:10 Mark: restarted stdscience to see if helps. fully reloaded by ~03:50.
  • 04:10 looks to be happening again. no condor running so wasn't that. whatever it is also involves ippc11, looks like iqanalysis is running. processing back to constant levels ~04:40.
  • 04:50 back again, no iqanalysis on ippc11. ssh into other systems 2-5x longer.
  • 05:50 looks like processing moving back to regular levels? (by 06:05 yes) this behavior not clear in the ganglia 1d span, only hour span. seems to easily reduce the chip-warp rate to ~30/hr.
  • 12:40 Mark: off'd the chip.revert until Chris turns on the missing instance file script in order to stop filling up the pantasks log.
  • 23:00 ipp029 unresponsive, ran neb-host down ipp029
  • 23:47 CZW: cycled power on ipp029 as it was preventing LAP destreak from finishing correctly, holding up further processing.

Tuesday : 2011.10.04

  • 03:00 Mark: LAP,MD03/4 fully loading processing queue, most of the oscillating load behavior seems to be gone with the ippc18-ippc19 rsync turned off. Asking Gavin/Cindy to set it to do 6am on Wednesday to watch the behavior again then.
  • 06:45: roy: everything down from summit (not much, only 21 science exposures)
  • 11:40 Mark: was running 1 node for stacks last night on ippc11 and seemed okay. How many pantasks stack jobs does it take to crack c11? Looks like 4 or less.

Wednesday : 2011.10.05

  • 07:20 Mark: Gavin moved the ippc18->ippc19 backup rsync to start at 6am and the oscillating behavior is seen at that time now. Hard to tell how much of a hit on processing rate with the change from nightly science to LAP processing. By 9:40 rate decrease is clear by 40-50%. Will ask Gavin to leave rsync at this time (~6-10am) until a solution can be found .
  • 13:50 Bill set data with label like 'ps_ud%' to be cleaned up. This will free up about 1000 exposures worth of chip data and a couple of hundred warps.
  • 14:04 Bill: The slow rsync problem may be happening because the log files have grown very large and thus take a long time to copy. I changed to move the pantasks_logs directory into logs/YYYYMM. Restarted all pantasks except for stack. It is stopped and I'm letting the queue empty. (Mark's MD deep stacks may take an hour or so)
  • 14:49 Decided to let stack continue. Set pantasks state to run.

Thursday : 2011.10.06

  • 08:00 re-nice'ing rsync ippc18->ippc19 not helping much. Will look at a bandwidth limit along with Bill's log rotation of the large log files Friday morning.

Bill is czar today

  • 10:00 We have 2 repeat faults. 5 warp skycells faile due to a corrupt camera mask file. Rerunning cam_id 288973. The other is a stack failure due to a corrupt warp. Unfortunately the chipRun has been cleaned already. Set the chip run 308628 to update then will rerun the warp.
  • ~10:05 something was started that heavily loaded ippc18 (NFS?) and tanked processing for ~10min
  • 15:34 Restarted stdscience and distribution
  • 22:00 Mark: ipp029 unresponsive, set neb-host down but don't have access to reboot.

Friday : 2011.10.07

  • 07:40 rsync ippc18->ippc19 now quick (~20 mins) with Bill's rotation of the logs. A 10MB/s bandwidth limit was also added to help smooth out any possible large transfers in the future. Leaving the start time at 6am to watch if it becomes a problem again.
  • 10:45 Serge: I'm investigating why ippdb02 cannot keep up with the replication (~140000 seconds late).
  • 17:57 CZW: I've merged in the new ippTasks/ and restarted the summit copy pantasks. This update should allow stare data to be put on specific (very empty) hosts, and not clutter up the small buffers we have on the other hosts. I did not merge in the registration fix that prevents stare exposures from being burntooled, as that is a more extensive set of changes (and one set of friday evening changes was enough).

Saturday : 2011.10.08

  • 12:30 ipp029 unresponsive, set neb-host down. no access to check/reboot
  • 13:50 restarting stdscience while things seem stuck
  • 16:30 ipp029 unresponsive again after ~30min. adding to ignore_wave2 set and turning off in respective pantasks. but needs someone with access to reboot
  • 21:15 registration stalled. restarted registration pantasks and still stalled. looks like running again.
  • 22:00 registration stalled for while after 10 more exposures, ran on
    o5843g0066o  XY45 -1 check_burntool neb://ipp028.0/gpc1/20111009/o5843g0066o/o5843g0066o.ota45.fits
    regtool -updateprocessedimfile -exp_id  403727 -class_id XY45 -set_state pending_burntool -dbname gpc1
  • 23:15 registration again
    o5843g0126o  XY45 -1 pending_burntool neb://ipp028.0/gpc1/20111009/o5843g0126o/o5843g0126o.ota45.fits
    regtool -updateprocessedimfile -exp_id 403786 -class_id XY45 -set_state pending_burntool -dbname gpc1
  • 23:20 CZW: ipp028 had a bad mount of ipp020. This caused burntool table replication to periodically freeze when it attempted to replicate there. force.umount ipp020 resolved this, and burntool appears to be running again.

Sunday : 2011-10-09

  • 06:25 Serge restarted distribution
  • 21:30 Mark: looks like summitcopy is down. restarting.
  • 22:00 ippc11 was active in distribution, did someone add? stdscience sluggish? doing daily restart.