PS1 IPP Czar Logs for the week 2015.02.02 - 2015.02.08

Monday : 2015.02.02

  • 08:50 EAM: stopped and restarted stdlocal, stdscience is basically done, so i've added in the storage hosts as well
  • 17:54 HAF: stop and restarting stdsci + friends for tonight.

Tuesday : 2015.02.03

  • 06:25 EAM: stopping and restarting stdlocal
  • 06:55 EAM: stsci13 hung on reads from lmsensors. I tried to use reboot, but it just hung. I power cycled and it is coming up
  • 10:55 MEH: reminder -- because we are an aggressive cleanup cycle of 2 days, any faults that need manual clearing have to be done ~daily or they won't clear w/o having to then also manually put the chips+warps back through updates -- for example faulted WS from 2/2 will have its warps cleaned up today at 2pm, so these need to get fixed now
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -fault 0 -skycell_id skycell.0663.010 -diff_id 654186
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -fault 0 -skycell_id skycell.0663.010 -diff_id 654213
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -fault 0 -skycell_id skycell.0663.010 -diff_id 654235
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -fault 0 -skycell_id skycell.0663.010 -diff_id 654286
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -fault 0 -skycell_id skycell.0663.010 -diff_id 655221
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -fault 0 -skycell_id skycell.0663.010 -diff_id 655250
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -fault 0 -skycell_id skycell.0663.010 -diff_id 655267
  • 12:40 MEH: another modification to the data host targeting for nightly processing, adding nodes w/ poor BBU to accept raw OTA since worked with systems <ipp066 (ipp-20141024.lin64/share/pantasks/modules/ipphosts.mhpcc.config). will be used when summitcopy+registration+stdsci are restarted
  • 16:00 CZW stdlocal stopped to avoid possible race conditions as I batch run some LAP scripts.
  • 16:30 CZW stdlocal back up, stdlanl stopped to do the same LAP push, and also clear out runs that have stacks that finished today.

Wednesday : 2015.02.04

  • 07:35 EAM: we have still 5200 stdlocal stacks to do. since we are only doing stacks, I've (a) turned off half of the xnode connections (down to 5 from 9) and (b) bumped up the stack poll (to 750 from 500). I've also turned on the storage hosts.
  • 09:40 EAM: staticsky finished most of the skycells in the queued batch. There were a handful which were still running after >50k seconds. I've killed those jobs -- we can try to re-run them or debug them in the future. I'm am going to queue the galactic plane block for 0h now. The impact on the machines will likely be quite different, so the distribution of machines may need to be modified for the plane.
      0   ippx001    BUSY  118106.27  0 --threads @MAX_THREADS@ --sky_id 764564 --outroot neb://any/gpc1/LAP.PV3.20140730/sky01.20141223//RINGS.V3/skycell.1935.063/ --redirect-output --camera GPC1 --dbname gpc1 --verbose 
      1   ippx026    BUSY  58307.66  0 --threads @MAX_THREADS@ --sky_id 765990 --outroot neb://any/gpc1/LAP.PV3.20140730/sky01.20141223//RINGS.V3/skycell.2159.015/ --redirect-output --camera GPC1 --dbname gpc1 --verbose 
      2   ippx045    BUSY  58120.00  0 --threads @MAX_THREADS@ --sky_id 765999 --outroot neb://any/gpc1/LAP.PV3.20140730/sky01.20141223//RINGS.V3/skycell.2159.024/ --redirect-output --camera GPC1 --dbname gpc1 --verbose 
      3    ipps08    BUSY  58036.93  0 --threads @MAX_THREADS@ --sky_id 766000 --outroot neb://any/gpc1/LAP.PV3.20140730/sky01.20141223//RINGS.V3/skycell.2159.025/ --redirect-output --camera GPC1 --dbname gpc1 --verbose 
      4    ipps14    BUSY  57880.35  0 --threads @MAX_THREADS@ --sky_id 766008 --outroot neb://any/gpc1/LAP.PV3.20140730/sky01.20141223//RINGS.V3/skycell.2159.033/ --redirect-output --camera GPC1 --dbname gpc1 --verbose 
      5   ippx026    BUSY  57877.06  0 --threads @MAX_THREADS@ --sky_id 766009 --outroot neb://any/gpc1/LAP.PV3.20140730/sky01.20141223//RINGS.V3/skycell.2159.034/ --redirect-output --camera GPC1 --dbname gpc1 --verbose 
      6   ippx011    BUSY  57743.95  0 --threads @MAX_THREADS@ --sky_id 766010 --outroot neb://any/gpc1/LAP.PV3.20140730/sky01.20141223//RINGS.V3/skycell.2159.035/ --redirect-output --camera GPC1 --dbname gpc1 --verbose 
      7   ippx013    BUSY  57672.66  0 --threads @MAX_THREADS@ --sky_id 766018 --outroot neb://any/gpc1/LAP.PV3.20140730/sky01.20141223//RINGS.V3/skycell.2159.043/ --redirect-output --camera GPC1 --dbname gpc1 --verbose 
      8    ipps00    BUSY  57650.65  0 --threads @MAX_THREADS@ --sky_id 766019 --outroot neb://any/gpc1/LAP.PV3.20140730/sky01.20141223//RINGS.V3/skycell.2159.044/ --redirect-output --camera GPC1 --dbname gpc1 --verbose 
      12:55 MEH: Haydn needs to power down ipp077, ipp083, ippx044 to do some work but multiple jobs are using them
  • 12:55 MEH: Haydn needs to power down ipp077, ipp083, ippx044 to do some work but multiple jobs are using them
    • ipp077 neb repair, but likely long running jobs will still be writing to for a bit
    • ipp083 like ipp077 neb repair, but also active in stdlocal -- set to off and clearing
    • ippx044 active in ippsky -- set to off and slowly clearing
    • ippsky writing to any disk so long running jobs may be trying to write to ipp077,083 when go down
    • stdlocal and ippsky may also need to read from these so there may be extra faults
  • 15:15 MEH: Haydn finished work --
    • ipp077 neb repair for a bit, had been out of stdsci and stdlocal processing by default until nfs restart was done. looked fine yesterday and reboot also should have fixed issue so back into stdsci and stdlocal
    • ipp083 neb repair for a bit, new BBU is charging. back into stdsci and stdlocal processing
    • ippx044 disk replaced and rebuilding, back into ippsky processing at half power -- console access fixed and rebooted into newer 3.7.6 kernel -- not clear we want to switch yet or want to do so in a controlled all at once? so back off in ippsky
  • 15:20 MEH: summit reports no power and MECO not able to fix until winds drop. off sky tonight
  • 17:00 EAM: deactivated (in input file) the macro which turns off the storage nodes at night. added storage nodes, re-adjusted loading (now that stack is done) so x2 and x3 are back up to full loading (8x)

Thursday : 2015.02.05

  • 07:50 EAM: stdlocal is getting sluggish; restarting it now. staticsky ran out of things to do so I queued 20h-22h (galactic plane). after that finished, I'll queue the galactic pole regions for 02-04h, and then 20h-24h.
  • 09:50 EAM: stdlocal is out of things to do (or trying to queue them up). I've stopped stdlocal and restarted it with only c0,c1a,c1b and storage nodes. I'll let it run along lightly to ensure some stacks get finished and lap advances are done as needed. Meanwhile, I've put all of the xnodes and c2 nodes on staticsky so the cluster is busy.
  • 10:10 EAM: I put too many jobs on x2 and x3 -- memory is a bit overloaded. i've taken 2x back out of each. if those machines start thrashing, we may need to kill some of the jobs and revert, but we should give them a chance to complete first.
  • 10:30 MEH: MECO won't be able to restore power to summit today, no observations tonight
  • 10:55 MEH: doing clean restart of all nightly pantasks
  • 11:15 MEH: note with Bill out for a a bit, the czar will probably need to keep more of an eye on updates and pstamp
    • cleaning up some of the massive remaining ps_ud_MPIA updates and faults
  • 11:30 CZW: stopping stdlocal/stdlanl to do the lap monitor push. I'll set them back to run when the push is finished.
    • 12:08 CZW: all clear.
  • 12:15 MEH: ipp086, like ipp094 do not have functioning BBU, so are in the more limited WT cacheless case -- they can handle nightly raw and random processing but not the massive LAP random from stsci14 in repair and all the untargeted staticsky it seem. will need to be in repair until that is fixed
  • 17:30 EAM: most staticsky jobs are done in the current queue, but a number are thrashing. i'm going to kill them, stop and restart the pantasks, and re-assign hosts back to just the high-mem xnodes, and give the others back to stdlocal.

Friday : YYYY.MM.DD

  • 14:10 CZW: Stopping stdlanl/stdlocal for a reboot before the weekend.
  • 14:40 CZW: Both back up and running. I was able to kick the LAP queues for both as well.
  • 15:00 CZW: Added storage nodes to stdlocal.
  • 15:27 HAF: stsci00 in repair mode in order to do SAS37 (again)
  • 16:00 MEH: with likely nightly data tonight, setting nightly storage nodes 067-081 neb-host up
  • 19:40 EAM: stdlocal is having a hard time keeping all processors spinning -- too many jobs want specific stsci nodes which are maxed out. I'm taking x3 away from stdlocal and giving it to staticsky.

Saturday : 2015.02.07

  • 19:10 EAM: restarting stdlocal. It looks like the summit is clear so I will not leave storage hosts on for the night in stdlocal.

Sunday : YYYY.MM.DD

  • 08:40 EAM : two hanging warp:
    warptool -dbname gpc1 -updateskyfile -set_quality 42 -fault 0 -skycell_id skycell.0513.051 -warp_id  1440383
    warptool -dbname gpc1 -updateskyfile -set_quality 42 -fault 0 -skycell_id skycell.0589.017 -warp_id  1440748
  • 11:25 EAM : stopping & restarting stdlocal