PS1 IPP Czar Logs for the week 2013.12.09 - 2013.12.15

(Up to PS1 IPP Czar Logs)

Monday : 2013.12.09

  • 17:50 EAM : light processing today (chris cleared an error-ed out stack). I restarted the gpc1 database and rebooted ipp007 to clear NFS problems with no trouble. I tried to do the same for stsci13 and it hung in the reboot. Haydn will power cycle tomorrow since we do not have console power access. stsci06 also needs to be rebooted, but I am going to wait until stsci13 is back up. I restarted processing with stsci13 and stsci06 in neb repair.

Tuesday : 2013.12.10

  • 00:30 MEH: email from Serge, system not running -- status of processing, only 2 MD01 chips and stuck faulted
    240 summit exposures since 2013-12-10
    82 incomplete downloads
    102 exposures copied but not registered
    • as logged in czarlog over weekend, ipp057, 063, 050 out of processing due to lock problems, got back in processing and lock problems ensued..
    • looks like ipp031, 036 may be problematic as well, out they go..
    • ipp031,032 may be behaving oddly, lost in nebulous disk use plot...
    • czarpoll and roboczar were stopped on ippc11 -- restarted
    • ipp015,011 as before as well. add ipp054 to the list out of processing
  • 01:50 registration caught up to the bad weather.. processing moving w/o problems so far
  • Bill 11:28 Need to rerun SAS.20131205 skycal pantasks is running out of ~bills/sas.32. Should get finished pretty quickly
  • 11:57 : stsci06 needs to be rebooted: not locking files. I'm stopping processing to reboot a number of machines: ipp031, ipp036, ipp050, ipp057, ipp063, ipp011, ipp015, stsci06
    • ipp036 was running neb-replicate, but since file locking was failing, it probably is not getting far. it was still on --min 0
    • ipp050 reports a failed battery on raid
    • done with ipp031 : no problems, glock works again
    • done with ipp036 : no problems, glock works again (12:31)
    • done with ipp050 : no problems, glock works again
    • done with ipp057 : no problems, glock works again (12:45)
    • done with ipp063 : no problems, glock works again (12:51)
    • done with ipp011 : required a power-cycle, glock works again
    • stsci06 hung on reboot (just like stsci13 yesterday). power cycle still does not work for me, so I sent a message to Gavin (13:17)
      • stsci06 rebooted by Gavin (14:07)
    • done with ipp015 : no problems, glock works again (14:20)
  • 14:45 : EAM : launched rsync jobs to return stsci copies of ipp031 & ipp032 data to those machines, removing them from processing.

Wednesday : 2013.12.11

  • 08:30 email from Serge, no data processed last night
    Current Time 2013-12-11 18:52:14
    o6637g0671d  2013-12-11 17:06:42 01:45:32 ago
    673 summit exposures since 2013-12-11
    625 exposures copied but not registered
    first exposure:  o6637g0001d 685258
    last exposure:   o6637g0671d 685930
  • 09:50 Rita rebooting ipp018 to swap drive -- ipp018 manually out of processing and neb-host down
    • leave out of processing and neb-repair to not detract from rebuilding
  • 12:00 MEH: unsuccessful in trying to trace through way to drop bad exposure to kick up registration burntool
    -- exposure c6637g0003b (685306) -- set rawImfile and newExp to drop
    regtool -dbname gpc1 -updateprocessedimfile -set_state drop -exp_id  685306 -class_id 
    pztool -dbname gpc1 -updatenewexp -exp_id 685306 -set_state drop
    -- exposure o6637g0048d (685307) -- actually a BIAS and not a DARK -- set rawImfile and newExp to drop
    regtool -dbname gpc1 -updateprocessedimfile -set_state drop -exp_id  685307 -class_id 
    pztool -dbname gpc1 -updatenewexp -exp_id 685307 -set_state drop
    -- o6637g0049d (685305) -- also set newExp and rawImfile to drop but probably is okay
    pztool -dbname gpc1 -updatenewexp -exp_id 685305 -set_state drop
    -- oddly, seems o6637g0047d  -- a DARK, has been burntool'd, has burntool state -14. maybe also needed to be dropped?
  • 12:09 CZW: A bad exposure confused regtool. Added define.sunset macro in registration pantasks to allow registration to have dateobs_begin after this bad exposure to allow regtool to be not confused.
  • 12:30 MEH will take over as czar since Heather on travel
  • 13:10 MEH: WSdiff for 3PI have more moderate time crunch, remove ThreePi?.WS.nightlyscience label until nightly finished and will tweak_ssdiff then as well.
    • LAP.ThreePi?.20130717 label also out now stacks caught up
  • 17:30 MEH: ssdiff started so finished when have to do stdsci restart once base nightly caught up in ~1hr before new nightly starts, WSdiff will be added after restart
  • 18:30 MEH: stdsci restarted --
    • LAP and WSdiff out for a while longer
    • ipp018, 031, 032 manually out of processing
    • slowly neb-host up systems put down for lock problems but now rebooted -- stsci13, stsci06, ipp011, 063, 057, 058, 015 ok -- 023 lock problem --
    • unclear if should be put back up from repair -- ipp007, 055, 061, 065?
    • ipp018 waiting for raid rebuild to finish
  • 20:45 MEH: caught up with nightly from last night except WSdiff, adding label back in. tonight nightly proceeding
  • 21:30 MEH: MOPS stamps arriving, adding 1x c3 to pstamp to help push through since behind

Thursday : 2013.12.12

  • 09:40 EAM : nightly processing is mostly done except for WS nightly science. Looking through the logs and such, I'm finding a couple problems:
    • the query for WS diffs is taking way too long for the label CNP. Here is the basic query:

difftool -definewarpstack -good_frac 0.2 -warp_label CNP.nightlyscience -stack_label CNP.refstack.20110318 -set_dist_group CNP -set_label CNP.nightlyscience -set_workdir FOO -set_reduction WARPSTACK -available -dbname gpc1

I suspect the list of things to match is too huge (CNP has many exposures)

  • we had a large number of glockfile problems from various machines to ipp023. I will reboot it this morning.
  • 10:15 EAM : a bunch of hosts have been out of nebulous for a while because of things like locks and low space. we have started to clear out some space generally, so I am putting as many of these as possible in the system. distributing the write load is probably helpful at this point.
  • 10:30 EAM : rebooted ipp023 to clear lock problems.

Friday : 2013-12-13

  • CZW 16:56 I set ipp041 and ipp040 to repair in nebulous. ipp041 was (and still is even in repair) throwing "nfs: server ipp041 not responding, still trying"/"nfs: server ipp041 OK" errors all across the cluster. After ipp041 was in repair, ipp040 had a load spike, so I set it in repair as well.

Saturday : YYYY.MM.DD

Sunday : YYYY.MM.DD