PS1 IPP Czar Logs for the week 2011-02-28 - 2011.03.06

(Up to PS1 IPP Czar Logs)

Monday : 2011.02.28

Serge is czar.

  • Bill stopped stdscience last night and feel asleep before restarting it. (We had a burst of faults that he wanted to investigate. Revert took care of them.) Restarted stdscience about 7:30 am.
  • 14:25 Restarted Replication. It was stopped Sunday night around 20:37. Note that stdscience crashed around 8:30pm as well. Its log many errors (these errors becoming more frequent from 20:05 till the crash) like:
    500 Can't connect to ippdb00.ifa.hawaii.edu:80 (connect: timeout) at /home/panstarrs/ipp/psconfig//ipp-20110218.lin64/lib/Nebulous/Client.pm line 852
    
    I checked the different syslogs (apache, mysql, /var/log/messages) but I didn't see anything interesting in them.

Tuesday : 2011.03.01

Bill is czar.

  • about 4am queued another set of ThreePi?.136 diffs. As advertised the script takes a long time
  • 7:30 running low on outstanding diffs. Ran the script again.
  • 7:50 many faults. Log shows ippc20 can't access files from ipp037. force.umount fixed that problem. Reverted faults.
  • 10:30 EAM : removed ippc00 - ippc04 from stdscience to use for static sky photometry
  • 11:21 heather set chip/warp/stack of MD07.trp and MD06.jtrp to goto_cleaned, and added MD08.jtrp to ipp/stdscience.
  • 11:21 heather started the diff queue script for ThreePi?.136
  • 12:30 EAM : ippc02 hung on over-full memory. i've rebooted it.
  • 16:23 heather ran the diff queue script again.
  • 21:36 several skycells from 168757 failed repeatedly due to corrupt camera mask files. Fixed with tools/runcamersexp.pl --cam_id 180584
  • 21:41 bill too has been runing the diff queue script periodically. At this point it seems like it needs to be running all of the time to keep diffs going. ~220 exposures left to get through warp. They should be done by midnight. Warp has been falling behind recently. Perhaps it is all of the processing going on on ippc15... I"m setting that host to off in stdscience to see if it improves things. If that doesn't work I will restart distribution.
  • 21:52 Just noticed that ippc04 died 75 minutes ago. Nothing at all on console. cycled power.
  • 22:04 restarted distribution with 2 x normal default hosts
  • ... and then noticed that everything is falling over because ippdb00:/ was out of space. Fixed by moving /tmp/nebulous_server.log to /export/ippdb00.0/ and restarting apache.

Wednesday : 2011.03.02

Bill is czar.

  • 01:45 noticed that somebody queued a whole bunch of MD08.jtrp processing which is delaying the completeion of the ThreePi?.136. Label has been deep sixed for the time being.
  • Shut everything down this afternoon except for pstamp for nebulous database optimization. The PS request parser has been set to fault any job that requires update processing. MOPS is no longer submitting requests.

Thursday : 2011.03.03

Gene is czar.

Friday : 2011.03.04

Gene is czar, but Bill has been driving.

  • 09:00 Restarted cleanup and update.
  • 09:10 queued warps from ThreePi?.20110126 and ThreePi?.20110203 and all data updated for postage stamp requests for cleanup
  • 09:25 restarted distribution, queued some STS.2010.raw for destreaking
  • 13:26 restarted stdscience
  • 13:55 Added ThreePi?.136 to survey magic and destreak
  • 14:00 as expected nightlyscience queued today's data for cleanup.
  • 14:49 all pantasks have been restarted except for addstar.
  • 18:04 heather has a stdsci on c14 for magictesting. label = magictest.3Pi.20110302

Saturday : 2011.03.05

  • 06:00 good morning. Well not for Bill. Czartool shows lots of red and MOPS's endless stream of postage stamp requests that require update processing are falling over. This may be all due to ipp005 becoming unresponsive. Unpon further review it sems to have panic'd but not fully died. It still shows up in ganglia.
[753534.365854] Kernel panic - not syncing: Attempted to kill init!
  • ipp005 going down left a bunch of warp log.update files with storage objects but not instance. Changed warp_skycell.pl to delete the existing log.update file.
  • 15:08 Decided to stop the disable the STS.2010.raw label. It is slowing down the ThreePi?.136 destreaking. Once that is done we can do cleanup to free up a lot of space so it has higher priority. Since they are working on different stages the priority ordering doesn't completely order things the way that we want.
  • 15:28 doubled number of hosts in distribution

Sunday : 2011.03.06