PS1 IPP Czar Logs for the week 2015.02.09 - 2015.02.15

Monday : 2015.02.09

  • 08:35 EAM: stdscience and stdlocal both running slow, restarting both.
  • 19:30 EAM: summit is reporting clouds, wind, and other reasons for no data tonight. i've turned on the storage nodes. I've also added a macro to activate / de-activate the auto-off:
    pantasks: off <-- storage hosts will NOT automatically be turned off at night
    pantasks: on  <-- storage hosts WILL automatically be turned off at night

Tuesday : 2015.02.10

  • 08:00 EAM: restarted stdlocal
  • 16:00 EAM: stopped staticsky, rebuilt psModules with the sign fix for lensing parameters, started sas pantasks (1x x0,x1,x2,x3), restarted staticsky with 3x x0,x1,x3
  • 19:50 EAM: restarting stdlocal

Wednesday : 2015-02-11

  • 11:30 CZW: LANL DST day. I've stopped stdlanl for a restart.
  • 12:30 MEH: QUB has not been getting any of the OSS.WS in (g),r,i filters because not setup fully.. doing so now and distribution may be busy for a while after doing a regular restart of the pantasks
  • 14:00 EAM: restarting staticsky (accidentally stopped when sas rerun finished). I'm also taking ippx001-ippx004 out of rotation to use for dvo tests of relastro
  • 20:10 EAM: stdscience is getting close to the limit; stopping everything under ~ipp to restart everything
  • 22:30 MEH: ipp077 down for >1hr? probably good to fix now rather than morning -- disk available but unable to log into, console shows kernel panic
    • turn off in processing.. set neb repair for night
    • no console prompt showing up.. needs a power cycle --
    • afternoon many errors? but wrong date so unknown -- is this the OS disk again?
      <Feb/12 03:28 pm>[631239.579016] journal commit I/O error
    • ipp077-20150211-crash.log

Thursday : 2015.02.12

  • 14:45 : restarted stdlocal
  • 18:33 : HAF : restart of stdsci/cleanup/pstamp/registration/summitcopy for start of night processing

Friday : 2015.02.13

  • 10:50 MEH: restarting pstamp, temporarily adding s4+s5 as a boost for a large number of QUB updates
  • 13:30 EAM: stopping & restarting stdlocal (>120k chip jobs)
  • 15:00 MEH: ipp010 looks to have crashed ~20 min ago
    • neb-host down
    • console only has <Feb/13 02:37 pm>ipp010 login: [3544735.219248] Disabling IRQ #78 -- no prompt, so power cycle
    • had mount issues after power cycle, restarted nfs a couple times and back up

Saturday : 2015.02.14

  • 07:30 EAM : restarted stdlocal
  • 23:50 MEH: ps1 open and taking data, turn off storage nodes in stdlocal -- on

Sunday : 2015.02.15

  • 08:50 MEH: manually reverted distribution fault so doesn't get really stuck when diffs get cleaned tomorrow
    disttool -revertrun -dbname gpc1 -fault 2 -label OSS.WS.nightlyscience
  • 20:20 MEH: looks like ipp044 has been down for >1hr and has blocked all nightly processing from needing detrends
    • neb-host down -- so we can do nightly processing
    • will try to power cycle, if not come back or if crashes again will turn power off
    • reboots w/ keyboard error and then onto LiveCD? -- leaving power off
    • now begins the long cleanup mess..
    • will try and clean up some of the stalled chips in stdlocal as well, but really can't spend too much time on a sunday night -- ~24 seemed to be stalled on XY65 but detrends should have been fine on stsci... last log entry on "burntool state vs burntoolStateGood : -14 vs 14". leaving because need to work on other things