PS1 IPP Czar Logs for the week 2015.02.23 - 2015.03.01

(Up to PS1 IPP Czar Logs)

Monday : 2015.02.23

07:30 EAM : I should have restarted stdscience last night -- it was running slowly in the night and is now somewhat behind. I'm restarting it now.

  • 10:30 MEH: restarting pstamp, >300k jobs
  • 10:50 MEH: ipps (host m0,1) and ippx037-044 (hosts x0/1b) off in stdlocal and staticsky, leave ippx (x0,1,2,3) as allocated in stdlocal and staticsky
    • leave these in staticsky until ~tuesday

Tuesday : 2014.02.24

  • 13:30 EAM: I've stopped ippsky/staticsky so I can rebuild psModules and psphot for ipp-20150115. I have updated the code to provide psf.e1,e2 values for the lensing parameters, and I'd like to re-run the full-force analysis with this change. I'm going to run full-force only under ippsky while working on the ppSub errors mentioned by Chris. I'll put additional x-node servers into stdlocal for now.
  • 13:55 EAM: ipp056 nfs is wedged. it seems to have gotten stuck on ipp078, but nothing seems to clear it. i'm attempting to reboot, and may need to power cycle.
  • 14:05 EAM: regular ipp services restarted in the normal fashion. stdlocal restarted, with 7* m0, m1, x0b, x1b for now. ippsky still running sas fforce. when this is done I'll turn on staticsky with just x0 and x1 nodes.
  • 16:00 EAM: stdlocal seems to want lots of jobs using stsci14. the 35 jobs limit is thus throttling the work there. rather than tweak this, I've taken the x3 nodes from stdlocal and put them in the full-force analysis for sas. when that is done, we can use them depending on the status of warps on stdlocal.

Wednesday : 2015.02.25

  • 01:00 MEH: long running jobs again on ipps00-04(3x), ipps05-s14(2x), ippx037-044(2x), ippc29(1x). ~ippmd/deepstack pantasks to stop and can kill ppStack jobs as necessary, poll is set to number of nodes available
  • 01:45 MEH: in watching deepstack progress, looks like starting about midnight rolling faults from db00 increased and responsible for the drop in processing rate at that time..
  • 06:25 EAM: i've stopped staticsky on ippsky in preparation for using x0 & x1 nodes in relastro today.
  • 06:40 EAM: mysql@ippdb06 crashed, restarted (20ksec behind master)
  • 19:18 HAF: all stuff restarted - registration moved to c07 because c06 is dead for tonight. stdlocal is stopped for now, we need to catch up on nightly. Haydn did computer stuff:
    ipp094 and ipp086 are both back online with new RAID batteries from eBay.
    ipp077 is also back online, but I didn't have the right cables to fix it.
    ippc06 is offline -- I can't get it to see/boot from its hard drives.
  • 19:25 HAF: crap crap crap crap... How do I remove c06 from list of nebulous? it's causing problems for me... lots of faults...
  • 19:36 HAF: reply from Mark on how to fix neb for tonight:
    #set nebservers = ($nebservers http://ippc06/nebulous);

Thursday : YYYY.MM.DD

Friday : 2015.02.27

  • 00:55 MEH: something is harassing ipp093,095 (and to a lessor degree some of the other newer storage nodes) -- load ~100 due to large wait_cpu states
    • pausing deep stacks seemed to ease it some -- so kill -STOP for the night, dont feel trying as stdlocal and staticsky are long jobs now w/o a -kill STOP script ready -- turned stack.poll 800->650 and improved some more, leaving there for the night
  • 07:45 MEH: fault 5 OSS diffim
    difftool -updatediffskyfile -fault 0 -set_quality 42 -diff_id 668277 -skycell_id skycell.2622.060  -dbname gpc1
  • 07:50 MEH: restarting pstamp for QUB stamps -- also clearing a cleaned/update conflict in warp again
  • 15:20 CZW: Attempted to clear the logjam in stdlocal by removing stsci14 from the controller list (all outstanding jobs want to put data there, and are being blocked out by pantasks). This seemed like it would work, as stsci14 isn't accepting data anyway, so it would not get overloaded. However, pantasks won't forget about stsci14, so the jobs are still stuck. They'll eventually clear, I guess.
  • 16:45 CZW: I've stopped stdlanl/remote.poll/exec. We need to make about 4k more stacks, but I'll relabel them to run locally when that happens.
  • 16:47 CZW: I'm restarting stdlocal with the x2 nodes removed so Gene can use them for relastro work over the weekend.
  • 17:00 CZW: ipplanl/pv3stacksummary pantasks is running again on ippc03. This will fix the bad summaries Nigel pointed out, and generate the majority of the ones outstanding.
  • 17:20 CZW: The restart of stdlocal was also supposed to trick pantasks into not blocking us due to jobs wanting stsci14 (which is in repair in nebulous and therefore shouldn't be a block at all). This did not work, so I've temporarily set the unwant parameter to 80 to get the last warp skycells to complete. I'll reset it to 35 when those are done, as at that point there should be enough other stacks available to keep things busy.
  • 18:55 MEH: doing the necessary restart of stdsci for nightly processing
  • 22:40 MEH: ippc53 unresponsive and no console prompt -- power cycle, 208 days since last /dev/md2 disk check..
    • it is also checking ippc53.0.. it will be unavailable for a while.. turning off in stdlocal

Saturday : 2015.02.28

  • 08:10 EAM: I've been having a lot of trouble getting relastro to run smoothly -- it is overloading the memory on some of the x0,x1 nodes. I had been able to double up on some jobs, but with the new rules, I am running out of space. I have taken over a number of x2 nodes as well (ippx049 - ippx060) and put the smaller jobs there. I think this will avoid thrashing.
  • 15:50 EAM: still thrashing on the 48GB nodes, i've split them in half so I am now using up to ippx072

Sunday : 2015.03.01

  • 00:20 MEH: nightly running ok, pstamp could use a restart