PS1 IPP Czar Logs for the week YYYY.MM.DD - YYYY.MM.DD

(Up to PS1 IPP Czar Logs)

Monday : 2014.06.23

  • 20:30(EDT) MEH: restarting pantasks, been running a while and need some stamps quickly

Tuesday : 2014.06.24

  • 10:00(EDT) MEH: compute3 nodes/c2 mostly idle so using for staticsky for couple hours. and kicking registration so not waiting to revert and un-stall itself.

Wednesday : 2014.06.25

Thursday : 2014.06.26

  • 07:55 (EDT) MEH: registration 170 exposures behind, kicked and moving again
    crash for: register_imfile.pl --exp_id 760419 --tmp_class_id ota47 --tmp_exp_name o6834g0217o --uri neb://ipp030.0/gpc1/20140626/o6834g0217o/o6834g0217o.ota47.fits --logfile neb://ipp030.0/gpc1/20140626/o6834g0217o.760419/o6834g0217o.760419.reg.ota47.log --bytes 23063040 --md5sum 5d3046529d950808f4a0b6d8a1f15b66 --sunset 03:30:00 --sunrise 17:30:00 --summit_dateobs 2014-06-26T08:41:43.000000 --dbname gpc1 --verbose
    job exit status: CRASH
    job host: ipp030
    job dtime: 8.739945
    job exit date: Wed Jun 25 23:07:27 2014
    hostname: ipp030
    
    -- had to manually do 
    regtool -updateprocessedimfile -exp_id 760419 -class_id XY47 -set_state pending_burntool -dbname gpc1
    
    
  • 09:10(EDT) MEH: still far behind in processing, compute3 nodes idle so throwing into processing for a few hours
  • 1!:25(EDT) MEH: stdsci pantasks struggling to stay loaded, needing regular restart and doing others as well
  • 12:00(EDT) MEH: once nightly finished, using idle compute3 nodes for small test MD.PV3 processing

Friday : 2014.06.27

  • 02:40(EDT) MEH: ipp030 having issues with becoming unresponsive/high load, turning off in processing
  • 18:10(PDT) MEH: Haydn looking into ipp030 crash around 0710(HST) this morning. -- setting neb-host down and removing from pantasks_hosts.input
    • some stdscience jobs need to probably be cleared and might as well restart stdsci
      -- both have files on ipp030 so won't finish until back up
      system failure for: diff_skycell.pl --threads @MAX_THREADS@ --diff_id 570810 --skycell_id skycell.0615.037 --diff_skyfile_id 33349104 --outroot neb://stsci17.1/gpc1/ThreePi.nt/2014/06/27/RINGS.V3/skycell.0615.037/RINGS.V3.skycell.0615.037.dif.570810 --redirect-output --run-state new --inverse --dbname gpc1 --verbose
      
      system failure for: diff_skycell.pl --threads @MAX_THREADS@ --diff_id 570810 --skycell_id skycell.0615.037 --diff_skyfile_id 33349104 --outroot neb://stsci17.1/gpc1/ThreePi.nt/2014/06/27/RINGS.V3/skycell.0615.037/RINGS.V3.skycell.0615.037.dif.570810 --redirect-output --run-state new --inverse --dbname gpc1 --verbose
      

Saturday : YYYY.MM.DD

Sunday : YYYY.MM.DD

  • 00:50(PDT) MEH: looks like ipp020 has crashed.. neb-host down, but get error trying to log into ipp020 console and can't help fix this
  • 05:00 HAF: summitcopy jammed up -investigations show nebulous is hanging, restarted apaches on c01- c10 seems to have fixed nebulous. keeping an eye on summitcopy (it seems to be going now?)
  • 9:00 HAF:registration jammed up (of course, after I had unhinged a few and went to bed). I restarted registration and I fixed the confused ones. Ugh. Still watching it..
  • 10:30 HAF: some stdsci processing is jammed up from ipp020 + neb being stupid - I'm neb-rm --moving the offending files out of the way.
  • 11:00 HAF: cam stage had a persistent faulter - chased it down to a chip that was inaccessible (on ipp020) -- reran the chip to recreate the files, this cleared up the camera stage fault.
  • 11:00 HAF: FYI, we are *still* downloading chips from last night, due to everything being broken. The czar is tired...
  • 18:15 HAF: the czar has continued unsticking anything that landed on ipp020. Ugh. The majority of things are now unstuck, with the exception of a handful of diffs. Heather is tired. It is getting dark. I'm done for the day.