PS1 IPP Czar Logs for the week 2013.10.28 - 2013.11.03

(Up to PS1 IPP Czar Logs)

Monday : 2013.10.28

  • 12:00 heather: restarted stdsci - it had sine waves on ganglia
  • 19:58 Bill: is running a pantasks_server on ipp064 out of ~bills/sas.30 running sas.30 staticsky. Once one set of compute3 nodes goes to off state in deepsky I will add one set of compute3 to the pantasks.
  • 21:00 EAM : rebooted ipp041 as it has been having some NFS oddness
  • 21:04 Bill: added compute3 to my sas.30 pantasks. Set ippc32, 33, 34, 35, 38, 40, 42, 45, 49, 58, and 62 to off since they are still working on stacks
  • 23:30 MEH: doesn't look like will be much nightly, fixing LAP cam+warps that have languished a week to finish the stacks..

Tuesday : 2013.10.29

  • 09:45 Bill: queued SAS.20131030 (EXTENDED_SOURCE_FITS_POISSON = FALSE)
  • 11:20 MEH: regular restart of stdsci
  • 12:29 Bill: set compute3 to off in stdscience to use for sas.30
    • 12:59 switched sas.30 to 2 x wave4 + 1 x compute 2 to avoid memory problems with deepstacks
    • put compute3 back into stdscience
    • stack is stopped as I am using the hosts
  • 16:55 MEH: ipp061 nfs/mounts have been stuck for a while.. can't even access own export disk. restarting nfs and working again and backlogged jobs clearing
  • 18:07 Bill: SAS.30 processing is done. Restarted stack pantasks
  • 18:30 MEH: went though a large number of warps, doing a regular restart of stdsci before nightly

Wednesday : 2013.10.30

Bill is czar today

  • 04:50 restarted pstamp and update pantasks
  • 11:45 MEH: stdsci needs regular restart, also turning MD03,04 diffs with new refstacks -- tweak_ssdiff to get the MD03 marginal from last night
  • 14:50 MEH: need to get through backlog of chip cleanup, other stages off
  • 18:30 MEH: again >100k in warps, restarting stdsci before nightly starts
  • 18:36 Bill: restarted registration and summit copy pantasks (their pcontrols were spinning)
  • reverted faulted chip after
    copied valid /data/ippb03.1/nebulous/42/5a/2169045416.gpc1:20120421:o6038g0140o:o6038g0140o.ota64.fits 
    over corrupt /data/ipp043.0/nebulous/42/5a/2021195867.gpc1:20120421:o6038g0140o:o6038g0140o.ota64.fits
  • 20:20 MEH: looks like ipp061 is unhappy..
  • 20:30 Bill killed hung summitcopy and registration processes on ipp061 took it out of the host lists. Sent email to heather about overload by 96GB ippdvo java process
  • 20:51 java killed but ipp061 is still in a bad way. There are some stack processes there that are probably stuck forever. Tried restarting nfs but it still can't see some nodes. Machine probably needs a power cycle
  • 01:00 MEH: cleared md09 stack fault 5, stack_id 2881582

Thursday : 2013.10.31

Bill is czar today

  • 05:55 ipp038 has crashed. Nothing on the console. Attempted power cycle, but no response.
  • 06:25 set ipp038 to down in nebulous to prevent processes from looking for detrend files with instances there
  • 06:42 many processes are stuck and have been since 05:02 about the time ipp038 died. Setting stdscience to stop to make it clear which processes to kill
  • 07:24 old stdscience processes killed. stdscience restarted with poll limit 42
  • 07:30 Looks ok. poll limit 200
  • 07:43 several processes may be stuck. will wait a while longer
  • 08:17 All pantasks set to stop.
  • 08:34 letting things settle cleared out all stdscience processes except those on ipp061. Taking that out of processing (forgot this morning). Set all pantasks to run except for the stackers. Going to wait for a while to make sure things are stable before setting those to run again.
  • going out for a bit to get something to eat.
  • 09:26 restarted stack pantasks. Deepstack processes are proceeding slowly. stdscience processes are taking cycles which is fine but using memory which is causing the ppStakcs (some approaching 60GB ram) to swap out. Turning compute 3 down in stdscience.
    • MEH: this has been fine in the past -- particularly since updates as the stdsci processing

Friday : 2013.11.01

Bill is czar today

  • 08:52 restarted stdscience pantasks. Left compute3 on.
  • mid morning: stopped processing so that Haydn could reboot stdsci14.
  • 1:45 ipp038 is back up. Set to repair in nebulous
  • 15:00 removed wave4 nodes from stdscience. They are being used to rerun sas.30 staticsky in a pantasks running from ~bills/sas.30
  • 15:28 things are starting to hang up. Setting stdscience and the stackers to stop
  • 15:33 Gavin is going to reboot ipp060-62 with the newer kernel
  • 16:45 everything is running again. We're still getting a significant number of faults. Now they seem to be failures to update the database due to deadlock or dropped connection.
  • 19:32 since rebooting ipp060-62 progress has been smoother. Still, 50% of deepstack jobs have been failing. Currently 355 good 323 fail.
    • large failure is expected. this is not just deep but other refstacks etc having to be manually killed for various reasons as well (ie counter in pantasks not really meaningful)
  • 22:18 reverted fault 2 stacks for label MD04.alldeeptest.20131010
    • thanks, but manually managing this and not necessary to revert
  • 22:41 sometimes it's better not to look. The postage stamp server is clogged because Niall has a request that wants the same data that LAP is trying to process. Unfortunately for Niall, LAP has a lot of data in the queue and his jobs aren't running. This has caused the server to stop because he has the dependency queue filled up due to time of request, but is deadlocked by LAP. Lowered the priority of the PSI.BIG label to be lower than WEB.BIG so (hopefully) another user can get their jobs requested. (HINT: small jobs have a lower chance of getting blocked)
    • yes that seems to be working
  • 22:52 ... except the MD labels having higher priority than LAP will block Niall's postage stamp jobs for approximately forever. will revisit in the morning.
    • not forever.. but a week or so left as defined priority processing by Ken -- maybe flushing the LAP queue is needed until then, or a suggested optimization i had off and on was MD at night and LAP chip-warp during the day to generate stacks that then make use of the mostly idle stack pantasks at night while STS wasn't being loaded.
  • 23:02 staticsky is nearly finished for sas.20131102. Added wave4 nodes to update since there is no data from the summit tonight.
  • 23:17 stdscience restarted with nominal host assignment

Saturday : 2013.11.02

  • 07:31 The updates required for pstamp req 312206 finished. Reset lap label back to 200. MD labels have priority
  • 10:10 setting stdscience to stop to prepare for daily restart.
    • processes are slow to finish. I think that there are problems with ipp041 again. Setting to down in nebulous so that processes won't try and use detrend files there.
  • 11:16 got stuck in topcat making plots. stdscience pantasks restarted
  • 22:00 MEH: looks like reg has stuck file on ipp065 for >1hr.. 28 behind. summitcopy, stdsci also -- looks like mount/nfs trouble?
    • 22:50 so going to restart nfs and take ipp065 out of processing -- multiple rpc.statd restarts required..
  • 23:55 MEH: MD01.pv2 has ~76 faults due to only good copies being on ipp041..

Sunday : 2013.11.03

  • 09:36 Bill: restarting stdscience. Set ipp041 to repair. It will probably take a little while for it to hang things up again. I'm going out so won't be watching.
  • 16:00 MEH: last bulk MD update started -- MD10.pv2.20131103, other field chips being sent to cleanup soon