PS1 IPP Czar Logs for the week 2014.09.01 - 2014.09.07

(Up to PS1 IPP Czar Logs)

Monday : 2014.09.01

  • 14:30 EAM : stsci12 was down per Mark's message -- unresponsive on console. I power-cycled it and it came OK.
  • 22:00 EAM : ipp036 was down, no message on console. power cycled it

Tuesday : 2014.09.02

  • 10:15 MEH: nightly rate low and net i/o <1G after Gene turned off local pv3 processing, so adding 2x c2 into stdsci (at least to recover systems turned off) --
  • 13:10 MEH: nightly not likely to finish before cleanup triggered @2pm, turning off cleanup pantasks until nightly finished so not to bog down the 5 data nodes further
  • 14:10 ipp040 being taken down for PS swap -- Gene set to stop and control reap
    • MEH: neb-host down
    • MEH: oddly unwant also 5, returned back to 20 when started up again
    • MEH: w/ ipp040 down, finding missing flats on ippb02
      /data/ippb02.1/nebulous/04/c5/1024845079.gpc1:flatcorr.20100124:GPC1.FLATTEST.305:GPC1.FLATTEST.305.XY15.co.fits
      
    • MEH: oddly, normal hosts off are back on in stdsci.. ipp032,060,064.. and newer off ones to beware of crashes.. 047,034,035,036.. including stsciXX -- so maybe better cmd to use instead of this after control reap all? -- need to do normal restart of stsci instead?
      control run all
      
  • 15:15 MEH: stdsci behaving badly, books full of junk.. restarting..
  • 17:15 MEH: summitcopy all PENDING.. had reap done? not logged.. so just restarting to start the nightly downloads
  • 18:30 MEH: suspect possible problem/unbalanced issue w/ ipp071 -- putting into repair seems to improve processing, so may leave out for some of nightly and see
  • 19:10 MEH: nightly finally finished, diffims seemed to have been jumbled and many published towards end. putting in the WS labels until tonight's nightly starts, turning cleanup on to watch how things go with ipp071 in repair.
    • generally higher loads on ipp067-ipp070 when ipp071 in repair, though ipp069 more so as it has just slightly less than ipp070 so more is put there? in any case seems to be able to handle it better. maybe 2-3TB should be shuffled to ipp071 and see if that helps?
      • have ~2.4TB of MD stack tests on ipp060 that should be moved, so will try that
      • or maybe just leave in repair and start rsync of ipp047?
      • or maybe 0.5-1TB off the stsci nodes for addstar space?
  • 20:00 MEH: will try PV3.ipp processing back on until nightly starts to see how that goes

Wednesday : 2014.09.03

  • 07:10 MEH: ipp047 didn't make it very long, down ~6am
  • 08:50 MEH: restarted NFS on ipp071, neb-host up to see if flushed a problem -- ipplanl/stdlocal pantasks well past needing restart and underloaded, so not a good test (too irregular) -- so will just do w/ ipp071 in repair tonight if data
  • 10:10 MEH: pstamp running for a while, restarting
  • 13:55 MEH: possible update problems expected if files on disks w/ no space (e.g., stsci11.1) --
  • 17:45 MEH: ipp071 slowing processing seems to be due to NFS lost/recover connections on many systems but very extreme on ippc0x systems
  • 19:20 MEH: forgot to add back in x.WS.nightlyscience today, will need to do in morning before cleanup -- sample before nightly showing NFS restart on ipp071 not help
  • 20:15 MEH: ipp071 in repair and processing rate well improved
  • 21:45 MEH: ipplanl/stdlocal processing stalling again, set stop to monitor rate now w/o
    • 20 exposures behind downloading, ~10/hr loss?
  • 00:30 only ~22 exposures behind, most processing cleared during colddunkins

Thursday : 2014.09.04

  • 06:50 MEH: nightly will finish ~7am now for ~570 exposures. ipplanl/stdlocal back to run
  • 08:25 MEH: putting WS.nightly back in -- will be trying different ipp071 (068-070) repair/up alterations
  • 12:55 MEH: ipp047 down for ~1ks -- already neb-host down and by default out of processing
  • 14:20 MEH: rebuild ops tag ipp-20130712 for WS trange -- restarted stdsci, taking ipp034,035,036 partially out of processing as have been trouble in past week
  • 19:10 MEH: ipp071 still up for a bit to monitor rate again, then likely to repair
  • 22:50 MEH: very little data, closed for high humidity -- tweak_wsdiff to run now and look at rates before humidity drops again
  • 23:50 MEH: ipp071 repair in case nightly data

Friday : 2014.09.05

  • 23:50 MEH: seeing what appears to be a similar issue w/ ipp069 now while ipp071 is in repair -- slower processing, very high load on ipp069, many server ipp069 not responding on the apache ippc0x nodes..

Saturday : YYYY.MM.DD

Sunday : YYYY.MM.DD