PS1 IPP Czar Logs for the week 2013.07.29 - 2013.08.04

(Up to PS1 IPP Czar Logs)

Monday : 2013.07.29

mark is czar

  • 08:30 MEH: doing regular restart of stdsci (turndown in rate)
  • 08:50 MEH: update appears down, restarting. adding 1x compute3 (taking from stdsci)
    • might as well restart other main running ones like summitcopy, registration.
  • 09:15 MEH: and restarting pstamp
  • 10:10 MEH: in preparation for possible power issues with Flossie, stopping all processing and shutting down compute2+3 nodes (and stare which Gene will specifically take care of for dvo/ipptopsps issues). also ippc30 still PSS pseudo-datanode so leave up as well
  • 11:15 MEH: compute3 ippc63--ippc31 shutdown started
  • 11:45 MEH compute2 ippc29--ippc20 shutdown started

  • 12:15 MEH compute2 ippc10,c14,c15,c16 started
  • 19:00 Gene powered down all systems

Tuesday : 2013.07.30

mark is czar

  • 07:30 MEH: Gavin proceeding with long processes of restarting all machines
  • 09:50 MEH: Gene has made the shift of the /data/ipp031.0, /data/ipp032.0 locations to stsci03.0, stwci00.0 respectively with lower level symlinks under nebulous across all stsci machines
  • 10:00 MEH: most machines up, doing secondary check of cpu, memory etc not MIA as we have occasionally found in past (also datetime correct)
    • ipp061 refound its lost 8G RAM from end of March
    • ipp060 lost 8G RAM.. doing shutdown and powerdown/up restart -- 8G back
  • 10:10 Gavin reports
    ippc02 - down due to audible alarm
            (MHPCC are escorting guests and requested we disable it)
    
    ippc53 - checking on it, console is blank but I can see outlet
            is providing 263W.
    
    ipp046 - not powering up after power outlet cycle.
            (suspect BIOS settings not being saved.
            time for button battery replacement..)
    
    • Haydn checking on ippdb04 at ATRC -- up an running now, ganglia thinks it is down still
  • 10:30 MEH: ippdb00,02 mysql not running so need to start -- see Processing
    • apache already running on ippc01-09 (taking ippc02 out of list in ~ipp/.tcshrc)
    • nebdiskd started by ipp@ippdb00
    • neb-host ipp046 down
    • ippc02,c53,046 commented out in ~ippconfig/pantasks_hosts.input
  • 11:00 MEH: restarted czar screens and czarpoll on ippc11
  • 11:10 MEH: still scanning nfs mounts across all system
  • 11:20 MEH: ipp028,033,036 commonly missing, after a couple nfs restarts and minute they are exporting okay -- sudo /etc/init.d/nfs restart
    -- first restart reported error
    exportfs: Warning: /export/ipp028.0 does not support NFS export.
    exportfs: Warning: /export does not support NFS export.  
    
  • 11:40 MEH: neb-host repair for ipp033, 041 (red and full) so if cleanup frees up space still allows space for dvo work (as with others in that wave)
  • 12:00 MEH: ipp046 back up, so back to neb-host repair. ippc02 also back up so back into the .tcshrc nebservers. -- both added back into ~ippconfig/pantasks_hosts.input
  • 12:05 MEH: stdsci pantasks started, will start PSS set next
  • 12:40 MEH: all pantasks started and running again
  • 12:50 MEH: ippc02 taking out of processing until alarm is tracked down, but leave as nebserver. ippc53 rebuilding software raid so leaving out as well.
  • 13:32 Bill: ran the commented out survey.add.relexp and survey.add.relstack commands in stdscience. I can find nothing wrong with the setup. Changed input file to uncomment them for next time stdscience is restarted.
  • 15:12 Bill: queued r band M31 exposures from first half 2010-09 to be processed
  • 15:20 queued m31.rp distRuns that MPG has downloaded to be cleaned.

Wednesday : 2013.07.31

  • 13:25 MEH: looks like LAP stacks won't be made for quite some some, reallocating compute3 power back to stdsci.. stdsci also in desperate need of regular restart.. doing now
    • also adding back in ippc53 taken out yesterday for software raid issue

Thursday : 2013.08.01

  • 09:30 EAM : ipphosts was overloading the stsci machines : set.host.by.skycell was resulting in hosts of 'stsci07.1', which does not exist. thus the rules for runaway overload of a machine are not respected. i've fixed set.host.by.skycell to strip off the .1 leaving just the real host name.
  • 09:31 EAM : ipp025 is hung up on nsf, rebooting (everything currently stopped)
  • 10:20 EAM : the ppSkycell jobs (skycell_jpeg.pl) were overloading the stsci nodes (or contributing to the load). I've deactivated them for now.
  • 13:00 MEH: in the meantime, stealing 1x compute3 from unused stack pantasks for test stacks/diffims in local pantasks
  • 15:00 MEH: to make last nights MD stacks need to add date back ns.add.date 2013-08-01
  • 15:40 MEH: stsci still getting hit hard even with diluting warps by having chips run (70/10 chip/warp %..), stopping cleanup to see if eases things a bit
    • same goes for distribution
  • 16:20 MEH: put stsci00 into repair, 2/3 disks are red. has load >200 for past 30 min..
  • 16:40 MEH: cleanup+dist stopped not helping, putting stdsci to low poll and will raise until things break..
  • 17:00 MEH: stsci00 load finally <100 after ~hour, distribution back on and ok.
    • setting stdsci poll 100, after a bit jumps to maintained loads of ~100
  • 17:45 MEH: enough of the wack-a-mole... try shutdown of mysql on stsci00 -- load dropping and back to normal levels with full processing and neb-host up again.
  • 18:15 MEH: finally all but one edge skycell are finished for the MD stacks.. tweak_ssdiff to get the diffims queued before nightly starts.
  • 18:30 MEH: noticed with chip.off, system not fully loaded even though plenty of diffs and warps to process.. something is keeping them from fully load, poll rate?
    • increasing set.poll from 300 to 500 gets ~100 running, not sure why
  • 20:10 MEH: stsci00 back up over 100 but not reaching 200 this time
    • shutting down mysql on remainging st00-09 machines for now
    • leaving mysql running on stsci04 for now, seems to not be bothered by it until early morning (so will shutdown before midnight)
    • stsci00 still more loaded than the rest
  • 20:55 MEH: slowly catching up, taking ThreePi?.WS.nightlyscience label out of stdsci for a while (until ESS finished)
  • 21:45 MEH: seeing squelched warp loading, and backlog of M31.rp and 3PI before ESS. restarting stdscience to reshuffle and remove non-nightly science labels..
  • 22:00 MEH: adding 4x compute3 (taking back from stack). see N warp running jobs reach 100, and also seeing most neb targets as stsci00,07. problem with the mask that assigns skycell location?
  • 22:20 MEH: stsci04 disk monitoring python triggered, spiking load 30-50 but recovering so far (as normal)
  • 23:00 MEH: try adding in ThreePi?.WS.nightlyscience, M31.rp.2013.bgsub labels -- stsci06,09 load spike >150
  • 23:45 MEH: wonder if the bg.warp.run is reducing normal warp processing, restarting stdsci again to flush from queues
    • seems mostly stable -- leaving all non-critical nightly science label out, ThreePi?.WS.nightlyscience as well to avoid getting a backlog of last night's diffs -- will need to be finished in the morning

Friday : 2013.08.02

Bill is czar today

  • 08:10 MEH: nightly science finished, putting back in the non-nightly labels i took out last night
    ThreePi.WS.nightlyscience
    M31.rp.2013.bgsub
    M31.rp.2013
    LAP.ThreePi.20130717
    
  • 09:07 warp rate has plummeted again. Trying bg.off
  • 09:29 bumping poll back up to 200
  • 10:11 M31 warp and warp_bg jobs are taking 5 x as long to run when the path base is on stsci04. Last night's nightly science does not show this effect. Neither did the warp_bg runs for the 201008 exposures. Gavin had a look and there doesn't seem to be anything wrong with the hardware. Assuming that this is a side effect of being so far behind I'm setting stsci04 to repair. This will distribute the writes to other nodes and may help with the backlog. Turning warp_bg back on lowering poll to 100.
  • 10:25 chip.off
  • 10:48 changed priority of M31.rp.2013.bgsub label to 200 to match LAP. This will give the LAP warp updates priority over the M31 warps which should reduce the total I/O demands. The warp_bg runs will continue to run however so M31 processing will make some progress.
  • 11:21 stsci00 and 01 now have the largest number of pending jobs. neb-host stsci04 up.
  • 11:58 stsci01 set to repair
  • 12:22 pending background warps have completed. stsci04 is backing up now set it back to repair as well.
  • 12:40 summary of current status
    • M31.rp.2013.bgsub has same priority as LAP so LAP runs are running due to lower warp_ids
    • M31.rp.2013 warp_bg processing has caught up so its. So no high I/O load M31 processing is running right now
    • jobs are backed up on stsci01 and stsci04 which are set to repair mode. None of the other nodes have large backlogs. The backlog is shrinking slowly.
    • chip is off since we have 1251 pending LAP warps
    • The ThreePi?.WS.nightlyscience diffs are making slow progress 213 runs pending
  • 12:50 stsci00 set to repair.
  • 13:13 ns.stacks.run has been timing out. Bumped timeout from 480 to 1000 seconds. The job ran and now we have OSS.nightlystacks running. Checked module with larger timeout into the tag and trunk
  • 15:00 MEH: notice stsci00-09 running kernel 2.6.34 but was rebooted to new kernel 3.7.6 in April for a similar data rate oddity problem (IIRC). stopping all processing, Gene talking with Gavin about rebooting those systems into new kerne.
  • 15:30 restarted stdscience set other pantasks to run
  • 16:10 rate since restart is about 40 runs per hour for both warps and diffs. This is a better than before but not terrific. Still have a backlog of about 90 components that want to put data on stsci00
  • 17:00 MEH: looks like it is closer to normal now, rate diffs and warps both >50 (2/3 of compute3 power is allocated in stack now for LAP)
    ./czarplot.pl -rt -p '6 hour'
    
  • stsci08 is the rate limiter now set to repair. other stsci nodes set to up.
  • 17:30 diffs have finished. dropped some faulted skycells -set_quality -42 -set_state full and some runs from July that were still in new state but the input warps have been cleaned. stsci08 has caught up now. All stsci notes are in nebulous state up now.
  • Note to self. REMEMBER to set chip to on before observations begin
  • 18:15 CZW: chip.on in stdscience, as nodes are idle in pantasks. I'd rather just let it fight out for LAP chips/warps. It looks like the majority of runs need chips to be done before they can stack anyway.

Saturday : 2013.08.03

  • 16:04 Bill: ran "survey.add.warp.bg M31.rp.2013 M31.rp.2013.bgsub M31.BGP" in stdscience to queue the remaining 100 or so bg warps from 2010-10. Won't queue more M31 until tomorrow when I will have a chance to keep an eye on it.
  • 16:07 Bill: It looks like stdscience could use a restart (pcontrol > 100% CPU usage, timeouts from patnasks status commands). Set to stop
  • 16:14 stdscience restarted

Sunday : 2013.08.04

  • ~11:30 Bill: queued M31 r band exposures from first half of Nov 2010
  • ~12:10 recovered lost rawImfile neb:///gpc1/20101110/o5510g0141o/o5510g0141o.ota26.fits As usual there was a copy on ippb02 from a failed replication
  • 14:13 /data/ippc30.1 has run out of disk space. I'm a bit behind cleaning up old requests and several > 1TB requests were submitted in the past couple of days. Changed labels on offending requests and started cleaning out some space.
  • 15:05 restarted stdscience and distribution