PS1 IPP Czar Logs for the week 2012-05-14 -- 2012-05-20

(Up to PS1 IPP Czar Logs)

Monday : 2012.05.14

Serge is czar

  • 00:05 Mark: wave1 hosts not very happy when system is full use, may be overloading. manually removing 1 group from stdscience to see how that helps. ipp021 was also being killed by a ppSub using >90% RAM (neb://ipp021.0/gpc1/OSS.nt/2012/05/14/RINGS.V3/skycell.1560.018/RINGS.V3.skycell.1560.018.dif.238554)
  • 06:30 Serge: Nightly processing almost complete (~50 3pi exposures). ipp002 is down. Gavin rebooted it around 7:30.
  • 09:00 Serge: Nightly processing complete.
  • 13:50 Serge: Ganglia hiccups pattern: restarting stdscience
  • 14:20 Serge: Killed 4 ppImage that were drowning ipp039
  • 15:19 Serge: ipp002 has a huge load because of apache(?)
  • 17:00 Bill: stopping pstamp, distribution, and publishing in preparation for rebuilding the ippRequestServer database
  • 18:00 Bill: testing re-ingest was taking too long. I will change the database tomorrow. pantasks' restarted

Tuesday : 2012.05.15

  • 08:15 Bill pstamp, distribution and publishing set to stop. Waiting for mysql slave on ippc19 to catch up then we will rebuild the ippRequestServer database on ippc17 using InnoDB storage engine.
  • 10:11 CZW: Restarting stdscience to hopefully gain some speed and track down jobs taking longer than they should.
  • 10:45 Bill ippRequestServer has been restored with InnoDB tables. New version inserted into replicated database on ippc19. Slave restarted. Now caught up. Thanks Serge.
  • 14:08 Bill dropped 875 chipProcessedImfiles whose state was not full and quality non-zero even though the corresponding rawImfile was in ignored state. Warp updates get stuck if somebody asks for a skycell thinks it needs a chip in that state.

Wednesday : 2012-05-16

  • 00:57 CZW: restarted stack, which apparently committed suicide between 18:00 yesterday and now.
  • 11:00 CZW: restarted stdscience, which was using a full processor for pcontrol.
  • 11:27 CZW: restarting stdscience again, because scrolling up to find a "status" command occasionally finds a "shutdown" command. These are not the same commands.

Thursday : 2012-05-17

  • 07:20 EAM : burntool hung / failed on one exposure, fixed with: regtool -updateprocessedimfile -exp_id 488814 -class_id XY71 -set_state pending_burntool -dbname gpc1
  • 07:25 EAM : summitcopy had many timeouts and was looking messy, so I restarted it
  • 08:05 EAM : reverted two MD nightly stacks:
    stacktool -revertsumskyfile -fault 5 -label MD05.nightlyscience -dbname gpc1
    stacktool -revertsumskyfile -fault 5 -label MD06.nightlyscience -dbname gpc1
    
  • 11:45 EAM : big memory spike on ipp058 with resulting crash; rebooted it
  • 11:59 EAM : restarted all pantasks to clear out cruft

Friday : 2012-05-18

  • 08:30 - 08:45: Bill Some poor user entered coordinates in MD09 for a warp postage stamp request into the web form. Since the coordinates also overlap SAS 6684 jobs were queued. Since the web form has highest priority this has lead to a massive backlog. Changed the labels for that request from WEB to WEB.BIG which has lower priority. Found a QUB stack/stack diff request which was failing because one of the files in one of the nightly stacks from several years ago was lost in the ipp020X disaster of 2010. It eventually would have given up after it faulted 3 times but I put it out of it's misery.
  • Space is getting to be a problem. From the graph it looks like we have a couple of weeks before we are down to zero. So many nodes are empty that we are getting quite a few job faults due to nfs errors. Reverted a few by hand. There are a couple of days worth of space used up for pstamp updates. We can clean that up once we catch up.
  • 09:00 Bill doubled the horsepower working on updates.
  • 14:05 Mark: 5 warp updates from my STS tests allocated to ipp017 have driven it nuts (all wait_cpu and swap hell)? the runs are using 10-20% RAM each for a total of ~75%. killing 1 off at time to see if can finish remaining ones ok.

Saturday : YYYY.MM.DD

Sunday : 2012.05.20