(Up to PS1 IPP Czar Logs)

Monday : 2012-09-17

Bill is czar today.

  • 09:00 LAP processing is stuck due to a diff cleanup issue Chris will repair.
  • 09:56 stopping processing in order to restart all pantasks setting turning ipp037 on as a compute node.
  • 10:30 LAP is moving along nicely now
  • 10:38 repaired a number of missing files (found with deneb-locate). Two were actually lost for good: o5364g0119o.ota65.fits and o5361g0129o.ota65.fits
  • 13:11 ipp020 has lost communication with the virtual /data/ipp037.0 (stsci00) shut it down then power cycled after the usual hang in "unmount nfs file systems" Because I'm lazy today I'm removing the host from processing lists.
  • 15:07 setting lap.off for a bit so we can debug the state of the confused state czartool's chip data without the counts changing out from under us.
  • 15:40 ipp028 has gone out to lunch. Lots of nfs errors in dmesg. Something about general protection fault as well but that did not show up in the console log. Power cycled it.
  • 15:55 turned lap.on but left lap.cleanup.off for a little while longer
  • 17:00 ipp028 is not visible in nebulous but it seems to be up. Setting it to repair state for now.
  • 17:14 Ok I'm convinced that czartool is just getting confused by the goto_cleaned rate. Turned lap.cleanup.on

Tuesday : 2012-09-18

Bill is czar again today

  • 06:40 Processing rate dropped through the night to ~60 per hour from 100 per hour. Restarted stdscience. ipp020 are in host lists but in repair mode for nebulous until the sun is higher in the sky.
  • 06:47 reduced nhosts in stdscience from 6 to 3 on ipp020 (24GB) and from 4 to 2 on ipp028 (16MB). Set them to up in nebulous
  • 09:33 at Mark's suggestion added compute3 hosts to stdscience since deepstack isn't using them
  • 10:50 MEH: stopping all pantasks to rebuild ippTools, restarting stdscience to make the MD label change to MD09.refstack. manually setting ipp020, ipp028 back to lower level as above and adding back in extra compute3.
    • tweaking ssdiffs to try running before noon (test if difftool+refstack label change working), new MD09.refstack.20120831+staticsky run to distribution.
    • 7 "new" SSdiffs for skycell.082 since previous refstack had quality issue with it. seems to work, need to watch with new nightlyscience for over-queue in future.

Wednesday : 2012-09-19

heather is czar

  • 11:36am restarting stdsci

Thursday : 2012-09-20

Mark is Czar

  • 08:40 MEH: probably won't help, but doing stdscience restart. few chips+warps running for >40ks
    • also throwing in an extra compute3 to stdscience since deepstack not running
  • 09:25 EAM : taking ipp037 out of processing -- as part of ongoing testing, I am going to set it to be 'on' in nebulous, and off in processing. but first, I need to ensure the rsyncs are still up-to-date
  • 10:10 MEH: still no LAP stacks to do. until stacks queued, going to micromanage stdscience and try pressing more compute nodes into service..
    • (10:25) didn't take too long to get stacks.. will test again later if happens to be no stacks and look at if any throughput increase or if just more faults.
  • 11:40 MEH: noticing compute2 often going into swap hell. pswarp 20% of 24GB on ippc21 for example, so removing 1xcompute2 (~10 nodes) set from stdscience.
    • even with 1xcompute3 to balance out other moved nodes to keep things smoothed out, still only ~50/hr in stdscience in GP
  • 12:40 MEH: stopping and shutting down all pantasks/processing
    • need to reboot ipp046 and ipp050 due to NFS fault and not being able to access locally exported disk -- ipp046 not coming back up...
    • Gene/Gavin work with ipp037 -- will put neb-host up for disks but keep out of processing. can turn on cleanup for the r/w active test.
  • 17:00 MEH: restarting all pantasks
    • stdscience: -3x ipp020, -3x ipp028, -1x compute2(10), -1x wave1(14), +1x compute3(32) -- stare already permanently removed in setup 8x stare00-04 (40)
      • stare test at some point: 4x in stack, -1 or 2x in stack, +2 or 3 in stdscience again? 1/2 since stack really can use them, stack can hit ~10GB sometimes.
    • cleanup pantasks on ippc07 had stalled pcontrol/pclients from 9/5-9/6 -- killed pcontrol and seems okay and running
  • 17:30 MEH: ippc45, ippc51, ippc61 seem to only have 24GB of RAM now? Explains why has been swapping so much earlier today.
    • manually remove 3x from stdscience
  • 18:30 balance seems to be okay without a lot of swapping, occasionally see stacks >50% on 24GB systems (look like a couple hit the stare nodes @19:30)
  • 20:20 looks like ipp061 has crashed.. power cycled okay -- http://svn.pan-starrs.ifa.hawaii.edu/trac/ipp/wiki/Ipp061-crash-20120920
  • 20:40 also looks like registration is stalled on an entry for 5ks.. ipp010 has mount problem to ipp062.0, and others... will need to reboot at some point.. taking out of processing for now
    • no nightly science data so stopping registration while work out all the mount mess..
  • 22:10 ipp010 and ipp020 have remote mount problems. don't want to reboot unless have too, remote systems can still access them so isolate -- take out of all processing, fault all jobs running on them, set neb-host repair so no new data put on them

Friday : 2012-09-21

Mark is Czar

  • 09:00 MEH: taking ippc45,ippc51,ippc61 out of processing so can be rebooted to try and recover missing RAM
    • Rita rebooted and monitored, full 48GB now back online -- back into processing @11:00
  • 10:00 chip/camera/warp.revert.off for fixing faulted runs -- revert.on @15:00
  • 11:xx MEH: processing off while testing/rebooting ipp010, ipp020 to repair mounts -- mounts fixed, put back into nebulous and processing.
  • 12:00 Serge changed swapiness on ipp011 but not the cause of the large swap use. Bill/Gene notice from rpc.statd, seems to be more on ipp011,012,018 than others. Will restart ipp011 in 20 mins
    • ipp011 back up, processing back on -- forgot to do stack then stdscience one at a time.. beating on the stsci nodes for warps..
  • 13:00 Chris triggered any missing nightly science cleanup for past 180d
  • 14:10 after all these reboots, Serge manually restarting the mysql for dneb-locate.py to use again (ipp010,011,020,037,061)
  • 15:00 MEH: fixing stalled LAP runs to clear some of the older ones (goto_cleaned put some in an odd state) ~150 chips
  • 18:10 MEH: will stop processing in ~30 min to shuffle load a bit again. stare and compute/2 nodes get hammered with stacks at times.
    • memory usage summary
      • stack: ~10-20%/24 typical, can peak 30-50% --> 2.4-5GB, peak 7-15GB+ (dangerous even on wave4)
      • ppImage: ~10%/24, peak ~20%
      • psastro: ~10-20%/24, peak 35%
      • pswarp: ~10%/24, peak ~20%
    • Manual host plan after restart
      • stare: -2 stk +2 stdsci (if no stks don't want idle, though unlikely now with large LAP list and current rates)
      • c: -1 stk
      • c2: -1 stk (instead of stdsci), still overloading so made later in night -2
      • w4: -1 stk (maybe +1 stdsci)
      • c3(deep): +1 stdsci (as before, but if stk becomes underpowered, switch to there but stks can clear during nightly science time)
      • w1: -1 stdsci
      • -3x ipp028 had problems in past -- ipp020 probably fixed so put back in the 3x from before, not sure about ipp028 so keep out
  • 19:50 MEH: ipp044 has nfs problem -- rebooting.. all processing stalled http://svn.pan-starrs.ifa.hawaii.edu/trac/ipp/wiki/Ipp044-nfs-20120921
    • took a while to reboot (3-5 min), raid spin-up?
    • restarted registration and something is borked.. going again after ran regtool -revertprocessedimfile -dbname gpc1
  • 20:30 compute2 overloaded, by stack still. -1 added above list for total of 2 above

Saturday : 2012-09-22

  • 01:30 current allocations are slightly under powered, but still at ~50/hr stdsci, ~200/hr stk without much swapping for LAP w/o nightly science. durning the little nightly science, reached ~100/hr (~60 LAP, ~40 nightly)
  • 11:33 Bill: Things look like they are going along relatively smoothly except ippc25 has a ppStack process that is using 74.2 GB of virtual memory! stack_id 1461976. z band skycell.0861.004 in Sagittarius. I put it out of it's misery. The i band stack has zillions of stars. The stacks of nearby M22 don't look very good skycell.0861.003
  • c25 also has a psastro working on writing an smf with a million stars in it exp_id 193786. Last years LAP processing had 5 million!. Strange, this exposure has was apparently never processed by nightly science. Nor were any of the other 3PI exposures from that night.
  • 13:00 MEH: noticing a disk space available dip around 9am today, ~15TB. someone take out one?
  • 15:00 MEH: before daily restart of stdscience, trying to push stdscience throughput: +1 c3, +2 stare (42) -- though likely more useful in stk @85 nodes right now but stare RAM cannot handle many large stack peaks
  • 18:00 MEH: stdsci seems to be pushing ~60/hr now, adding another wave3 in (oddly wave1 normally has 6x while wave3 only 4x but wave3 has more RAM) -- adds another 14 to stdsci for total 555. manually adding allowed 1 to run on ipp037, but killed ppImage running there.
  • 18:45 stdsci have trouble keeping 200 runs loaded even with poll=600, restarting. Manual setting up this round:
    • stdsci: (add slowly to avoid startup spike with set.poll)
      • -1 w1
      • -3x ipp028
      • +2 stare
      • +1 c3
      • +2 stare
      • +1 w3 (make sure ipp037,041,045 don't run any jobs)
      • 555 running nodes
    • stk: losing ground, try adding another compute3 into, all these already setup
      • -2 stare
      • -1 c
      • -2 c2
      • (+1 c3)
      • 118 running nodes

Sunday : 2012-09-23

  • 15:00 MEH: stdsci stuggling to keep over 200 when poll set to 600. daily restart of stdsci with same manual setup from yesterday (still haven't tested with a full nightly science run yet). will also clean out some 30-50ks chip/cam runs.
    • appears ipp020,063 has some remote mounts that are borked.. looks to have also stalled registration of darks from this morning.. don't want to reboot on weekend, so try taking out of processing, clearing jobs and restarting registration...
    • 16:15 seems to be back up and running, 9/24 darks in progress..