PS1 IPP Czar Logs for the week YYYY.MM.DD - YYYY.MM.DD

(Up to PS1 IPP Czar Logs)

Monday : 2015.04.27

  • 8:12 Bill restarted postage stamp pantasks. Increased the poll limit for dependency checking to 160 to speed up the image updates.
  • 11:00 EAM: stopped ~ipp pantasks for Bill's pstamp upgrade
  • 13:10 EAM: restarted ~ipp pantasks post-Bill's pstamp upgrade
  • 17:25 EAM: Chris has generated separate left and right queues for pv3 full-force (left is running from 11h west, right is continuing from about 1hr going east). I've moved the pv3 full-force processing to ~ippsky/pv3ffrt and ~ippsky/pv3fflt. I've also added a new check nightly queue script which lives in ~ipp/ ~ipp/stdscience and both fforce pantasks execute this script. the fforce pantasks turn off storage nodes if the outstanding nightly jobs are > 200, and shut off compute nodes if jobs > 1000. Finally, I've updated the ippMonitor code to know about fullforce.

Tuesday : 2015.04.28

  • 08:20 EAM: we are having stsci problems again. last night, stsci06 crashed (it took gavin's intervention to reboot as the console was mis-mapped). this morning, stsci15 crashed around 3am; after I booted it, stsci17 crashed within 30 min (around 6:00).
  • 11:20 MEH: ipps08 non-responsive on console, power cycle --
    • ipps00 also been behaving strangely and rebooting
  • 13:00 MEH: ipp063 disk warnings, put neb-host repair
  • 17:45 CZW: restarting ippsky pv3ffrt/pv3fflt.

Wednesday : 2015.04.30

  • 10:46 MEH: ~ipp/raidstatus/ipp063 reports raid ok after disk fail yesterday. neb-host up again

Thursday : 2015-04-30

  • 13:30 CZW: Restart ffUHcray pantasks to ensure the rsync flood on the Cray was completely stopped.
  • 16:00 CZW: I've added ippx016 and ippx081 to the ffUHcray pantasks to help with the ff summary jobs. The input script will add them 4x by default, and I've manually added them another 4x. This 8x may end up the default if no one else runs on those nodes. Looking at the number of ffRuns in state 'new' with an entry in ffResult, we seem to have a backlog of about 11k right now. This may be suppressing the observed rate, as we're not limited by the normal steps but by the summary step.

Friday : 2015-05-01

  • 11:00 EAM: rebooting stsci15 which apparently just crashed
  • 15:00 CZW: restarting the ipp user pantasks servers so they are fresh for the weekend. Gene mentioned that he has new throttling logic in place that should balance stdscience with the two ippsky/pv3ff{rt|lt} servers, but that this isn't fully tested.
  • 17:00 CZW: restarting the ippsky/pv3ff{rt|lt} servers.
  • 17:22 CZW: I'm going to attempt to revert ff runs that have a fault that I suspect makes them likely to succeed on a second attempt (fault=1,2,3).

Saturday : YYYY.MM.DD

  • 10:00 CZW: Stopped ipp, ipplanl, ippsky pantasks due to ippc18 and ippc19 raid issues. Mark and Heather have stopped the ippmd and ippdvo processes as well.

Sunday : 2015.05.03

  • 16:45 EAM : replaced links /home/panstarrs and /home/kiawe to point at /data/ippc19.0/home on all MRTC-B machines (ipp004 - ipp097 ex ipp022, ipp051, ipp096; ippc01 - ippc64; stsci00 - stsci19; stare00 - stare05; ipps00 - ipps15; ippx001 - ippx103 ex ippx071; ippdb00 - ippdb08 ex ippdb04 and ippdb07).
  • 16:55 EAM : started ipp pantasks, roboczar, czarpoll. started ippsky pantasks (pv3ffrt & pv3fflt), but I'm leaving the storage and compute nodes off on those pantasks. (haf: this is cut and paste from the czarlog i accidentally put for 0504, I was off a day for some reason):

(HAF - 21:00 and onwards)

We are starting this day off with a bang! ippc18 / ippc19 had multiple drive failures - no processing (or summit / reg) was running for last night's data, and it was restarted for tonight (by gene?), but with a few problems:

  • summitcopy is downloading the data from yesterday still
  • registration wasn't burntooling, I had to manually add the date in:
    • 2015-05-03 gpc1 14
  • luckily the weather is really questionable, so it's currently stalled at 70 exposures for tonight.
  • stdsci needed it as well:
    • 2015-05-03 - I also added 2015-05-02 (just in case)
  • reg stuck, fixed {{{regtool -updateprocessedimfile -exp_id 908547 -class_id XY47 -set_state pending_burntool -dbname gpc1