PS1 IPP Czar Logs for the week YYYY.MM.DD - YYYY.MM.DD

(Up to PS1 IPP Czar Logs)

Monday : 2014-02-10

  • 09:40 Heather stopping loaders
  • 09:45 MEH: LAP label out of stdsci to try and catchup last nights data. ThreePi?.WS.nightlyscience out as well
  • 09:50 MEH: still seeing loader machines having trouble, putting neb-host repair and taking out of processing as it holds up summit+registration -- ipp043,048--062 many can probably be brought back up one at a time during LAP once things settle down
  • 10:35 MEH: see MOPS stamps showing up, adding c3 to pstamp to help speed those along (should run fine with other normal processing jobs there, no ghost jobs...)
  • 10:50 MEH: turning hosts on in summitcopy that are normally off to try and boost the downloads since many wave4 had to be turned off from fallout from loaders. will want those turned off afterwards unless monitored as they are untested for reliability.
  • 11:15 CZW: A check of the summitcopy pantasks.stdout.log shows that all failures this morning are "Nebulous::Client::stat - unhandled fault - database error: no storage object found" for ota42. This points to an issue with ipp017, so I have set that to repair in nebulous.
  • 11:30 MEH: ipp054 is in a bad state, probably needs to be rebooted -- so out of processing (wave4_ignore) and in neb-host repair
  • 16:45 MEH: turning staticsky off on stare nodes so psps loaders can start using again, will take a couple hours to clear

Tuesday : 2014-02-11

  • 14:00 Bill: removed right ascension limit on staticsky so that all pending jobs will run. Once this set clears out we'll start queueing new ones and when I do that I'll order them by slices.
  • 15:00 CZW: restarted all pantasks for the daily restart. This should pull the new pantasks.pro code that properly removes CRASH-ed job pages.

Wednesday : 2014-02-12

  • 12:20 CZW: I just noticed that we have a very large number of chips/warps in the processing queue. There are now 274 lapRuns active, up significantly from the ~75 a few days ago. It appears that the cleanup task is timing out for the one run still in 'full' state, and when it times out, it doesn't correctly clear out the full run, leading to many more lapRuns being launched that is expected. I've pointed ~ipp/lap/current.queue to the empty off.queue, which will prevent new runs from being launched (so we can taper back down to ~75). I'll manually clear the full run, and figure out how long to set the timeout limit to.
  • 14:40 CZW: A number of pantasks servers are not running. I'm stopping the remaining ones for a fresh restart.
  • 15:20 CZW: Back up and running. The job rate was remarkably stable over the past day or so. The current hypothesis is that the slow downs we've been seeing in the past after 100k jobs may have been due to crash/timeout pages filling up the book (that only requires a 1/1000 rate). With the fix to that a few days ago, we may not get slowdowns anymore. This is something to keep an eye on.
  • 16:45 CZW: The last orphan ippsky jobs have finished, so I have restarted that pantasks as well.

Thursday : 2014-02-13

  • 09:45 EAM : the nebulous apache servers have filled up their tmpdirs with giant log files. I'm moving them off to their /export tmp files, but we should use rotate to move these out of the way...
  • 10:20 EAM : I used a dd command to save only the last portion of the nebulous apache log files. I then removed the old file. I had to stop and restart the apache servers on each of ippc01 - ippc10 to get them to release their hold on /tmp/nebulous_server.log. Now everything is back and running again.
    dd ibs=1024 if=/tmp/nebulous_server.log of=/export/ippc01.0/tmp/nebulous_server.20140213.log skip=20000000
    

Friday : 2014-02-14

  • 09:50 MEH: with ipp026 down for the count, many single instance files will be missing for processing and wasting cpu cycles. X.revert.off in stdsci for a while will help
  • 10:00 HAF: restarted stdsci at request of bill (it was getting sluggish) doing x.revert.off
    • warp only updates can run fast, bump the poll 300->500
  • 15:18 HAF: Restart cleanup - it died some time ago?
  • 16:10 CZW: ipp026 is back up. I've set it to repair in nebulous, and re-enabled reverts in stdscience.

Saturday : YYYY.MM.DD

  • 00:50 MEH: weather not great and LAP will be working on only stacks for quite some time -- queue test reprocessing 3PI exposures to overlap with MD04.V3 as MD04.test3pistack.20140217
  • 12:20 MEH: clearing fault 5 diffims (522999,523015 skycell.2606.092; 523037,523055 skycell.2518.043) to quality 14006

Sunday : YYYY.MM.DD

  • 01:20 MEH: taking over ippc63 for compute3 memory limit test
  • 23:10 MEH: ippc63 back to weak stdsci allocation, pstamp large Njobs and Nfail, restarted and moving along