(Up to PS1 IPP Czar Logs)

Monday : 2012-10-01

  • 10:05 MEH: restarting stdscience for regular restart, seems 1 nightlyscience is stuck.
  • 10:25 fixing stalled file in registration
    • ipp016 has lost some of its mounts now, removing from processing. may need a reboot..
  • 20:30 (bills) ipp016 had several stdscience jobs in limbo. Killed them all. Stopped stdscience so that we can reset the pending books. I guess ipp016 was diagnosed as a door stop but not rebooted :(
    • MEH: no, may be a doorstop. was taken out of processing, so interesting how new jobs got on it...
  • 20:40 rebooted ipp016 set to down in nebulous
  • 20:45 stdscience restarted. ipp016 set to repair.

Tuesday : YYYY.MM.DD

  • heather was ghost czar (whoever czarred for her, thank you!)

Wednesday : YYYY.MM.DD

  • heather is czar
  • 10:10 (haf) neb-host down for ipp013 and ipp046, as well as removing these guys from pantasks so that haydn can repair those machines

Thursday : 2012.10.04

Bill is czar today

  • 05:40 ipp055 and ipp059 have stuck jobs. stdscience is in severe pcontrol spin. Stopping for restart
  • 05:50 The jobs were in a very strange state. The perl scripts had done all of the work and had updated the database but the scripts were not exiting. Perhaps they got stuck by the "sync" that they do at the end. Killed them off.
  • 05:58 restarted stdscience
  • 06:15 There are LAP jobs that are faulting because they need data on ipp013 or ipp046. They was set to down in nebulous but appears ok. Set them to repair
  • summit copy is a bit behind because we are getting read timeouts from conductor.
  • lots of red on czartool. Many jobs are having replication errors. ipp066 seems to be having problems.
  • 07:45 bursts of faults from various nodes. Set the LAP label to inactive in order to get nightly science finished
  • 08:40 summit copy has 51 exposures to finish. Nothing out of the ordinary in summit copy pantasks. Setting AP label back to active.
  • 09:15 a rawImfile got stuck in burntool (ipp066) fixed it up and now burntool is proceeding
  • 09:23 24 incomplete downloads 98 exposures copied but not registered. 10 of those are darks.
  • 09:44 Set ipp013 and ipp046 to up in nebulous
  • 10:00 somehow camRun 586369 got set to state full without leaving an smf file behind. The associated warp was very confused. Dropped warp_id 567243 and set the camRun back to new.
  • 10:48 Nightly science chip processing has finished. Warp is way behind because ipp046 and ipp013 were not avaiable last night and they had data on them. Turning chip processing off for a while to let the warps catch up.
  • 11:06 Two stacks were running on ippc25. One using 30GB of virtual space and the other 50G. On a 24 G system this isn't going to fly.
  • 11:45 stack_id 1461976 set to drop. We've tried several times, but it eventually blows up virtual memory. Filed ticket 1461976 Reverted the other one 1528228. It's running on stare03. I'll keep an eye on it. - It finished during lunch
  • 13:42 cleanup pantasks set to run
  • 16:20 Turned chip processing back on
  • 21:40 set cleanup to stop for the night

Friday : 2012.10.05

  • 07:30 A quiet night for once! Turning cleanup back on.
  • 09:30 Set o5425g0167o.ota47.fits to be ignored. All instances have been lost.
  • 09:30 Set stdscience to stop in preparation for restart.
  • 09:47 stdscience restarted
  • 09:53 recompiled Ohana with a change to LoadPhotcodesFITS to increase the timeout from 10 seconds to 60. Hopefully this will reduce the number of camera stage failures.
  • 10:31 restarted all pantasks_servers (recompiling Ohana with pantasks running was not a good idea.)

Saturday :2012.10.06

  • 07:10 Bill Another smooth night. Restarted stdscience pantasks.
  • 07:29 Turning chip processing off to let the warps have the processors for awhile. Feel free to turn it back on if I forget ;)
  • 13:26 chip.on
  • 18:20 ipp013 had a kernel OOPS and went into zombie mode.
  • 19:00 ipp013 rebooted. stdscience pantasks restarted

Sunday : 2012.10.07

  • 06:40 Bill: registration has a job that got stuck on ipp020. It has an nfs deadlock with ipp042. Trying to repair
  • 07:20 rebooted ipp020 and restarted the registration pantasks, faulted then reverted the rawImfile that got stuck. This got burntool going.
  • 07:31 The database replication is having problems and so the czartool outputs are out of date.
  • Here is the current status of last night's exposures
    • Summit copy and registration have finished. 494 non qtfocus object exposures
    • chip processing is complete
    • 13 camRuns running
    • 16 warps running
  • 08:15 stdscience restarted