(Up to PS1 IPP Czar Logs)

Monday : 2012-10-29

Serge is czar

  • 06:05: Still copying data from the summit (about 60 images). Nothing to worry about.
  • 06:10: Cleaning gpc1 of old backups preventing mysql slave to run and backups to be successfully performed.
  • 08:50: Nightly processing complete
  • 15:45 MEH: stdscience needs regular restart. will start to feed in MD05 reprocessing

Tuesday : 2012-10-30

  • 00:20 MEH: made mistake of looking at processing.. looks like ipp016 has lost it around 22:30, cannot ssh into and jobs (registration/stdscience) running there stalled but disk seems to be available (but in repair). will try to isolate for processing tonight to get going again.. then reboot. Ipp016-crash-20121029
    • restarted registration and stdscience
    • MD03 also removed from nightly diffs until refstack is made.
  • 11:00 MEH: the failing exposure report from OTIS turned out to be an added whitespace to the TXT page on the datastore when going from 9999->10000 during the night of 10/17 that made the downloaded file not a real FITS file and hence the missing END card. Todd fixed and cleared the Problem Notification email.
  • 11:05 Rita set neb-host ipp035 repair->up as plan to recover/test disks in repair
  • 13:59 CZW: Noticed a large number of chips failing with perfectly fine input data. Tracked down the issue to ippc02 and /tmp/ being filled with the nebulous_server.log. Moved log to /export/ippc02.0/ipp/nebulous_server_20121030.log, and I'm now bzip2-ing it. I've restarted apache and reverted all LAP faults, which should resolve the issue.
  • 15:00 MEH: MD05.GR0 chip->warp+NS running along side LAP
  • 16:05 MEH: stdscience regular restart

Wednesday : 2012-10-31

  • 09:35 (Serge): Publishing last night data to IPP-MOPS-DEV datastore
  • 09:56 (Bill): Postage stamp server is busy. Added another set of hosts.pstamp to pantasks. Requests from MOPS, MOPS.2, and MPIA. MOPS is the high priority queue so I raised its priority to 497 the others are at 495. The later two labels' requests require update processing.
  • 10:03 (Bill): Restarted pstamp pantasks. Turned off pstamp dependency checking in order to increase the number of jobs actually making stamps. Need to remember to turn it back on later: pstamp.dependent.on
  • 10:20 (Bill): restarted update pantasks but set it to state stop. in pstamp hosts.update to add several more hosts.
  • 10:30 MEH: stdscience needing its regular restart. MD03.refstack runs added now as well.
  • 11:49 (Bill) set update to run with poll limit 8 and pstamp dependency checking on with poll limit 16
  • 12:40 MOPS pstamp jobs are done. restarted pstamp server without the hosts.update
  • 13:08 restarted cleanup pantasks because there were many errors. Set all warp, chip, and diffRuns in state error_cleaned back to goto_cleaned
  • 13:10 Haydn set neb-host ipp059 repair->up as plan to recover/test disks in repair

Thursday : 2012-11-01

  • 14:25 Rita set neb-host ipp045 repair->up as plan to recover/test disks in repair
  • 14:30 MEH: stdscience half loaded, time for regular restart.
  • 15:50 MEH: wiki user asteroidnerd=Alan Fitzsimmons (a.fitzsimmons@…)
  • 17:25 MEH: ippc07 has given up.. GPF Ippc07-crash-20121101 so power cycled.
    • seems to have caused nebulous to hang up as well -- shutting all pantasks down
    • processlist seems to be slowly clearing out on ippdb00
  • 18:15 CZW: pantasks death orphaned a large number of ppImage and pswarp processes that were hanging (probably waiting on a nebulous request to return?). I manually killed these jobs (forcing the script to fault the job), which seems to have helped clear the nebulous database processlist.
  • 19:25 MEH: ippdb00 only has sleeping processes (~425) and nebulous is responsive again, starting things up slowly

Friday : 2012-11-02

  • 09:35 (Serge): Queued warp-stack diffs ESS last night data for mops
  • 11:30 MEH: MD03.refstack chip->warp finished, feeding in MD05.GR0.20121030 at same prio as LAP
  • 14:24 Bill doubled number of hosts in pstamp pantasks. (hosts.pstamp) and reduced the queue depth for dependency checking. set.dependent.poll 8 (default is 64)
  • 16:45 MEH: stdscience needed regular restart. taking compute3 back manually added stdscience and turning deepstack back on for MD03.refstack.20121101

Saturday : 2012-11-03

  • 14:55 MEH: stdscience needs regular restartj
  • 23:11 Bills: Postage stamp server/update is not making progress for some reason. QUB's updates didn't get queued. Restarted pstamp and update.

Sunday : 2012-11-04

  • 14:55 MEH: stdscience underloaded, regular restart in progress