PS1 IPP Czar Logs for the week 2011.04.18 - 2011.04.25

(Up to PS1 IPP Czar Logs)

Monday : 2011-04-18

  • 10:36 Bill queued stacks for STS.refstack.20110418
  • 11:08 CZW: Noticed that burntool crashed last night, leaving an exposure in state check_burntool instead of properly processing. Reset state to pending_burntool, and registration started up again to finish. Registration logfile (neb://ipp047.0/gpc1/20110418/o5669g0493o.326098/o5669g0493o.326098.reg.ota67.log) suggests a database error.
  • 11:16 Bill has queued 111 STS exposures for chip-warp processing.
  • 13:27 CZW: Accidentally ran pantasks with doomsday-debug stack information enabled. Trying to kill them all now.
  • 14:43 Bill set STS.2010.a label to inactive to give last night's warps more power to finish.
  • 16:30 Bill set STS.2010.a to active.
  • 20:48 Bill restarted update pantasks because the large number of timeouts was computing the status listing.

Tuesday : 2011-04-19

Serge is czar (but there were no observation last night)

Wednesday : 2011-04-20

Serge is czar

  • 13:55 distribution seemed sluggish so Bill restarted it.
  • 14:18 CZW: concern about diskspace prompted me to look at the replication pantasks to see what shuffle was doing. The replication pantasks was not doing anything, because it failed to load nebulous.site.pro as part of the setup macro. It appears that this file was not transferred over between tag changes, so I've copied the version from /data/ippc18.0/home/ipp/psconfig/ipp-20110218.lin64/share/pantasks/modules into the working tag. This has unstuck the shuffle task.

Thursday : 2011-04-21

  • 02:06 CZW: noticed registration was lagging. register_imfile complained about a read only filesystem: neb://ipp005.0/gpc1/20110421/o5672g0156o/o5672g0156o.ota02.burn.log . Resetting the data_state for that imfile seems to have cleared subsequent exposures, but o5672g0156o seems to be stuck and unable to register the exposure correctly. I'll look at it tomorrow.
  • 02:19 CZW: had to do that again for o5672g0240o. Looks like ipp005 is the problem. My fix command was "regtool -updateprocessedimfile -exp_id 327030 -class_id XY01 -set_state pending_burntool", so if this keeps up and is an issue in the morning, the czar can kick things appropriately.

Bill is acting czar today.

  • 06:20 warp was stuck with a book full of entries in state DONE. warp.reset seems to have cleared the problem. pcontrol is using a whole CPU which is often a sign of trouble. I've turned camera and stack off. Once the stacks that are running finish I will restart stdscience.
  • 06:35 stdscience pantasks restarted.
  • 06:54 ipp005 console said [761674.464312] Kernel panic - not syncing: Attempted to kill init! so I power cycled it.
  • 07:05 many diff failures. ran -revertdiffskyfile and then turned diff.revert.off to investigate. ipp005 is back up.
  • 09:45 The diff failures were due to ipp005 being unavailable. It looks like we may have some MD stack files that are not replicated.
  • 11:40 Tested Gene's fix to a ppSub psphot problem. It worked. Stopped processing to rebuild with updated psModules and psastro.
  • 12:00 Set pantasks' to run.
  • 12:16 All of last night's data is through magic streak detection.
  • 13:27 Started reprocessing of STS with new psastro. Label STS.2010.b. Queued STS.2010.a to be cleaned.
  • 16:30 STS camera runs are falling behind. set.camera.poll 20 to cause more to be able to run at a time.
  • 18:30 experimenting. chip.off
  • 18:45 experiment over chip.on

Friday : 2011-04-22

  • 10:30 sts reference stacks finished. Queued set of test exposures for processing. Set up diffs and magic for the template exposures.
  • 14:10 many faults. Looking at stdscience/pantasks.stdoout.log ippc27 is having trouble with nfs access to ipp037. force.umount fixed it.j
  • 21:35 set ps_ud% runs to goto_cleaned
  • 21:44 restarted distribution pantasks. (pcontrol was pegged at 100% cpu)

Saturday : 2011-04-22

  • 03:00 There is a bit of red on the board this morning. Fixed a couple of nfs errors with force.umount. Fixed corrupt file from --warp_id 185212 --skycell_id skycell.042.
  • 03:22 Two repeat diff failures. Leaving diff.revert.off for now.
  • 06:50 More red sudo /usr/local/sbin/force.umount ipp010 solved some of the problems. Ran difftool -revertdiffskyfile
  • 07:00 The ipp010 problem caused summit copy to get stuck one one exposure. Restarted summit copy to fix the fault counts which were difficult to read due to many timeouts.
  • 08:00 set label STS.2010.b to inactive so that it doesn't ineterfere with nightlyscience processing
  • 14:25 set label STS.2010.b to active. Queued the rest of the STS data from 2010 June 18 - 21 for re-processing.
  • 16:42 noticed alert messages about raid sets degraded for ipp018 and ipp011. Set them to repair mode.
  • 21:43 set STS.2010.b to inactive to insure that it does not affect stdscience processing.

Sunday : 2011-04-23

  • 03:16 We are periodically getting many database timeouts from ippdb01/gpc1. The cluster network load drops very low during these periods. The cpu load listed by top for the mysqld process on ippdb01 is over 600%.
  • 07:45 Space is getting low. queued various labels for cleanup. Biggest were ThreePi?.rerun and CNP.refstack.20110317
  • 08:00 ipp020 is low on space and has lost a disk. Set it to repair mode. It looks like 11 and 18 have finished rebuilding.
  • 08:25 czartool reports that replication (08:24) and distribution pantasks died (08:00) Restarted distribution. I'm not sure what the status is for replication so I am not restarting it at this time.
  • 14:48 CZW: restarted replication.
  • 21:00 Set priority of STS.2010.b higher than nightlyscience so that the exposures queued so far can get finished processing and posted to the distribution server soon. Then I can queue it for cleanup.