PS1 IPP Czar Logs for the week 2011-09-05 - 2011-09-11

(Up to PS1 IPP Czar Logs)

Monday : 2011-09-05

  • 10:00 heather restarted registration - burntool restarted.
  • 10:52 that didn't work. various incantations of regtool -revertstuff didn't work (nothing to revert). Tried regpeek it complained about XY06 o5809g0511o, did this ( regtool -updateprocessedimfile -exp_id 387484 -class_id XY06 -set_state pending_burntool -dbname gpc1 ) and now burntool is moving. I hate burntool.
  • 11:11 some more that needed some help (discovered using regpeek and gpc1):
    • [heather@ippc18 ~ipp/registration]$ regtool -updateprocessedimfile -exp_id 387562 -class_id XY12 -set_state pending_burntool -dbname gpc1
    • [heather@ippc18 ~ipp/registration]$ regtool -updateprocessedimfile -exp_id 387605 -class_id XY65 -set_state pending_burntool -dbname gpc1

Tuesday : 2011-09-06

Bill is czar today

  • 08:00 users report strangeness on the data store. They are correct. Sometime around 1pm yesterday mysql stopped processing fileset registration requests. This has left a few thousand incomplete filesets. mysql log indicates that a table is damaged.
  • 08:30 stopped postage stamp pantasks and rcserver task in distribution.
  • 08:50 stopped publishing
  • 08:56 Gavin repaired the database.
  • 09:20 set publishing to run and restarted postage stamp server pantasks in order to have clear fault counts.
  • 10:14 restarted update pantasks
  • 10:36 one of the instances of neb://ipp036.0/gpc1/20110906/o5810g0282o/o5810g0282o.ota55.fits was corrupt. Saved the bad file to ~bills/o5810g0282o.ota55.fits.corrupt for investigation and copied the good one (as measured by successful funpack) on top of it.
  • 12:30 shut down and restarted distribution pantasks because pcontrol was using a full cpu and progress has been slow.
  • 14:10 stopped processing to prepare to reset the nfs mounts for ipp010 ipp017 and ipp020
  • 14:38 all pantasks set to run

Wednesday : 2011-09-07

  • 00:30 Mark: stdscience server down ~23:45 (was experiencing network connection during that time to ippc18, related?). restarted and running again.
  • 00:45 haf: removed ippc10 from pantasks - it's having some raid complaints and it scares me.
  • 07:50 Bill restarted registration pantasks. Removed all dates except for 2011-09-07. Registration failed for exp_id = 388654 and class_id = 'xy21' Set fault for that chip, then reverted. This started things moving again.

Thursday : 2011-09-08

  • 9:25 heather is czar. reverted fault 4 on warp and stack (see if this clears it up)
  • 9:40 heather restarted distribution (Bill noticed it was getting sluggish)
  • 15:41 CZW: Changed stdscience/input and distribution/input to include the refurbished wave 1 machines.
  • 16:35 CZW: Things that looks like comments were not valid comments in the stdscience and distribution input files. Fixed, restarted these servers, and things look fine now.

Friday : 2011-09-09

  • 00:34 CZW: Restarted stdscience after pausing to try to let ippdb01 calm down. This was somewhat successful, as I was able to get LAP cleanup to run, as well as registering the majority of the exposures (check_burntool/pending_burntool bug again). Everything should be running as smoothly as possible right now, although the load on ippdb01 is still unreasonably high. Completely stopping stdscience still resulted in a load of 10-14 on ippdb01.
  • 13:55 Bill: ipp021 became unresponsive. Console message at:

Saturday : 2011-09-10

  • 12:50 Mark: ipp020 kill 2 3PI warps stalling on system (250907,250943) from overloaded memory swap from 2xstreaksremove. Removed from stdscience until streaksremove finish.
  • 19:00 STS warp (251222) failing from corrupted camera mask files (ie, rerun with
    perl ~ipp/src/ipp-20110622/tools/ --redirect-output --cam_id 264154

Sunday : 2011-09-11

  • 00:30 Mark: ipp020/021 overloaded again, turned off 1 each in distribution.
  • 11:50 Bill: re enabled update processing.
  • 11:56 Bill: distribution is slow and behind. Restarting pantasks.
  • 13:35 Bill: power cycled ipp021. It lost network connectivity. Setting to mode repair. Just a few more days left with wave 1 machines :)
  • 15:09 Bill: noticed that cleanup pantasks died. Started it up. Removed it's host (ippc07) from stdscience host list to avoid memory overload by STS diffs
  • 16:39 Bill: ppMops is exploding again. Apparently I forgot to check in the changes to ppMops to the trunk. Stopped everything and merged in the changes. Restarted.
  • 17:18 Bill: stopping the madness. ipp021 is not responding. All pantasks set to stop (except cleanup). Will check for broken nfs mounts and slowly restart things in 20 minutes
  • 18:05 Bill: apparently ipp021 being a zombie for several hours caused several clients to silently give up trying to access /export/ipp021.0. Tediously ran force.umount on the affected nodes. Set stdscience to run with LOADPOLL set to 32. Will increase it gradually. ipp021 set to repair
  • 18:50 Bill: I'm almost ready to say all clear. stdscience poll back up to 180.
  • 17:20 All Clear (except for one destreak corruption problem I will fix tomorrow)