(Up to PS1 IPP Czar Logs)

Monday : 2012.11.05

  • heather is czar (9:24). ipp missed her - hundreds of exposures stuck in registration... first attempt to unstick:
    src/ipptrunk-20120913/tools/regpeek.pl
    regtool -updateprocessedimfile -exp_id 544697 -class_id XY27 -set_state pending_burntool -dbname gpc1   
    

arrgh sticking again:

regtool -updateprocessedimfile -exp_id 544700 -class_id XY27 -set_state pending_burntool -dbname gpc1
regtool -updateprocessedimfile -exp_id 544704 -class_id XY27 -set_state pending_burntool -dbname gpc1
regtool -updateprocessedimfile -exp_id 544732 -class_id XY27 -set_state pending_burntool -dbname gpc1
regtool -updateprocessedimfile -exp_id 544746 -class_id XY27 -set_state pending_burntool -dbname gpc1
regtool -updateprocessedimfile -exp_id 544751 -class_id XY27 -set_state pending_burntool -dbname gpc1
regtool -updateprocessedimfile -exp_id 544771 -class_id XY27 -set_state pending_burntool -dbname gpc1
regtool -updateprocessedimfile -exp_id 544869 -class_id XY27 -set_state pending_burntool -dbname gpc1
regtool -updateprocessedimfile -exp_id 544873 -class_id XY27 -set_state pending_burntool -dbname gpc1
regtool -updateprocessedimfile -exp_id 544924 -class_id XY27 -set_state pending_burntool -dbname gpc1
regtool -updateprocessedimfile -exp_id 544927 -class_id XY27 -set_state pending_burntool -dbname gpc1
regtool -updateprocessedimfile -exp_id 544936 -class_id XY27 -set_state pending_burntool -dbname gpc1
regtool -updateprocessedimfile -exp_id 544949 -class_id XY27 -set_state pending_burntool -dbname gpc1
regtool -updateprocessedimfile -exp_id 544951 -class_id XY27 -set_state pending_burntool -dbname gpc1
regtool -updateprocessedimfile -exp_id 544970 -class_id XY27 -set_state pending_burntool -dbname gpc1

  • 09:35 (Serge): Publishing nightlyscience made on 2012-11-02 and 2012-11-03 to IPP-MOPS_DEV (format version 3) for their tests.
  • 10:45 (heather): gene rebooted ipp065, nfs was cranky - same machine as burntool failures...
  • 11:30 (heather): gene says ipp012 is also nfs cranky - we have stopped all, waiting for queue to clear out, then gene will reboot 12 and heather will rebuild ippconfig (requested by chris)
  • 21:15 (MEH): rebooting ipp033 for Heather Ipp033-crash-20121105
    • looks like summitcopy/registration pantasks got into odd state, going to restart them

Tuesday : 2012-11-06

Serge is czar. Heather is the Election czar.

  • 06:30 (Serge): Still 89 exposures to download from the summit but things seem to run smoother than yesterday.
  • 08:30 (Serge): All exposures successfully downloaded. A few 3PI still have to be processed.
  • 09:45 (Serge): ns.add.date 2012-11-05
  • 10:15 EAM : remove ipp023 - ipp026 from processing for final rsyncs
  • 12:15 (Serge): Nightly processing for both previous nights complete

Wednesday : 2012-11-07

Bill is czar today

  • 10:00 Summit copy has fallen behind by about 140 exposures, it appears that some jobs are getting stuck waiting for glockfile to complete for files on ipp009. I've set ipp009 to repair in nebulous.
  • 10:26 restarted summit copy to clear the counts. We are getting a very high number of timeouts from -pendingfoo modes. Seems to be all tools.
  • 10:50 shut down and power cycled ipp009 to clear up broken locks
  • 10:54 restarted stdscience, stack, and distribution pantasks. Set update to stop.
  • 11:10 ipp009 came up with no network. Gavin is investigating.
  • 11:35 Gavin swapped the ethernet interface on ipp009 and it is back online. Restarted czarpoll and roboczar.
  • 11:53 restarted update pantasks but removed ps_ud_MOPS.2 so that MOPS doesn't compete with itself.
  • 12:56 Now ipp012 is being a pest. Many jobs stuck waiting for glockffile to finish. 64 rpc.statd processes. Setting to repair in nebulous
  • removed LAP and MD05.GR0.20121030 labels from stdscience. After nightly processing is done we need to shut everything down and check on the status of things.
  • 13:30 Added LAP and MD05 labels back in to stdscience.
  • 13:45 After removing ipp012 from nebulous no jobs remained stuck waiting for it. They eventually cleared out. There are 64 rpc.statd processes so I'm going to reboot it.
  • 14:06 ipp012 is back up. Set to up in nebulous but off in stdscience for now.
  • 14:41 ipp012 is blocking processing again. it's rpc.statd is running at 100% cpu. Set to repair in nebulous.
  • 14:45 Set ThreePi? camera and diff stage dist runs from 2010-2011 to be cleaned
  • 15:30 MEH: restarting stdscience to setup new MD03.refstack, used tweak_ssdiff and WS+SSdiff running
  • 18:34 summit copy is already behind. added another set of hosts
  • 18:45 MEH: set summitcopy to stop, was overloading summit. not sure which were added, looked like maybe wave3+4? so set those host off and turning summitcopy back on. summit can only "handle" ~30 active downloads IIRC
  • 19:30 HAF: emails from summit regarding summitcopy - restarted summitcopy before checking logs.. all is well now..
  • 23:34 HAF: emails from bill - we are falling behind on summitcopy - various dialogs on this, but no clear answers.. We are allowed to have 30 connections, but only have 24 at the moment. I turned on a few hosts (4) to see if that makes it better (?). It's currently at 29 incomplete (and rising)

Thursday : 2012-11-08

Bill is czar again today

  • 07:00 Just after 00:26:57 HST ippc17 crashed. There was nothing on console, nothing in /var/log/messages. Power cycled it. It took a awhile to come up because it forced fsck checks since they hadn't been done in over 200 days. Perhaps it was triggered by the massive distribution cleanup that I have going.
  • 08:02 restarted publishing pantasks. Doubled the number of hosts working on it since it is behind because of the ippc17 crash (no data store)
  • 08:33 set neb-host ipp009 to up. Hang on...
  • 09:40 after consulting with Gene set stsci00.X back to up state Also set ipp040, ipp041 and ipp018 to up
  • 09:49 almost immediately started getting "cannot lock file on ipp018" errors set it back to down.
  • 09:54 turned wave3 hosts on in update to help with the postage stamp backlog
  • 11:08 recovered missing raw file neb://ipp015.0/gpc1/20100603/o5350g0053o/o5350g0053o.ota74.fits found on ippb02
  • 13:45 MEH: turning off cleanup until resolve MD03.20121031 cleaned warps but not cleaned chips in order to do the missing WS diffims... also missing 20121101, updated, and WS diffims finished for both. oddly, 20121103, 20121107 seem to be all under the same data_group with 20121102
    • 14:55 cleanup back on
  • 14:20 Serge: Stopped publishing
  • 14:50 Serge: Restarted publishing
  • 15:19 Stopped stdscience in preparation for daily restart. Restarted at 15:29

Friday : 2012-11-09

Mark is czar

  • 07:00 MEH: nightly data processed except maybe a few publishing jobs.
  • 08:45 or all publishing jobs. restarted publishing. jobs are still stuck. 11 finished in >5 minutes. seems to slowly be moving.. what has changed?
  • 08:50 Serge: Queued a bunch of warp-stack OSS diffs with the new recipe for mops (label: OSS.J.TEST)
    difftool -dbname gpc1 -definewarpstack -good_frac 0.2 -warp_label OSS.nightlyscience -data_group OSS.20121107 -stack_label ecliptic.rp -set_label OSS.J.TEST -set_data_group OSS.J.TEST.20121109 -set_workdir neb://@HOST@.0/sch/OSS.J.TEST.20121109 -available -set_reduction SWEETSPOT_WS -set_dist_group NODIST -rerun -simple
    
  • 11:25 MEH: when checking remapped mounts, finding some systems stalling on automount for more than just the remapped ipp023,024,025,026 system disks.
    • ipp023,024,025,026 taken out of wave2 processing set.
    • many had long delay (~3 minutes) in automounting other data disks: ipp006, 007, 010, 011, 012, 013, 014, 015, 016, 017, 018, 020, 028, 038, 040, 033, 034, 035, 039, 042, 043, 044, 045, 046, 047, 048, ippc01,02,06,c22
  • 12:10 MEH: finally looks like all mounts are go.. restarting all pantasks
    • MD SSdiffs were missed in the morning, used ~ipp/stdscience/tweak_ssdiff to run now
  • 12:40 MEH: everything is stopped while Gene fixes some of the symlinks -- back online now
  • 13:20 MEH: stopping processing again to build Bill's fix to the MJD time -- back online
  • 15:30 MEH: chip.off while Gene is fixing some permissions on the swapped data disks (ipp024 in particular), will help catch up on the huge backlog of LAP warps
  • 18:00 MEH: chip.on for nightly science, many dirs still left to chown so will be misc faults from LAP and other reprocessing for now

Saturday : 2012-11-10

  • 09:00 MEH: chip.off again to push through LAP warps
  • 13:05 doing regular restart of stdscience
  • 23:00 LAP and off/on nightly science processing took a fast 50% rate hit for some reason, down from 100 to 50..

Sunday : 2012-11-11

  • 10:00 chip.off again to push more warps through
  • 17:20 will do regular restart of stdscience and have chip.on for nightly science.