Monday : 2013.09.23

mark is czar

  • 07:30 MEH: nightly done, regular stdsci restart
  • 09:30 also doing restart of summitcopy, registration, distribution, publishing, stack for the week
  • 20:44 MEH: looks like 2 exposures of STD obs_mode in OPEN filter.. (o6559g0084o, o6559g0085o) -- chip stage has no FLAT detrend and will set to drop in a bit

Tuesday : 2013.09.24

mark is czar

  • 07:4 MEH: nightly finished, regular restart of stdsci
  • 10:00 MEH: cleaning up som eLAP stuff..
    -- fault 5
     stacktool -dbname gpc1 -updaterun -set_state drop -state new -label LAP.ThreePi.20130717 -fault 5
    -- fault 4 after several revert etc
     stacktool -dbname gpc1 -updaterun -set_state drop -state new -label LAP.ThreePi.20130717 -fault 4
    -- fault 2 -- broken warps.. some required chip updates
    perl ~ipp/src/ipp-20130712/tools/ --warp_id 742361 --skycell_id skycell.0740.096 --redirect-output --dbname gpc1
    chiptool -setimfiletoupdate -chip_id 791999 -set_label LAP.ThreePi.20130717 -dbname gpc1
    /home/panstarrs/ipp/src/ipp-20130712/tools/ --warp_id 748664 --skycell_id skycell.1442.095 --redirect-output
    chiptool -setimfiletoupdate -chip_id 795325 -set_label LAP.ThreePi.20130717 -dbname gpc1
    /home/panstarrs/ipp/src/ipp-20130712/tools/ --warp_id 751839 --skycell_id skycell.2247.076 --redirect-output
    chiptool -setimfiletoupdate -chip_id 783099 -set_label LAP.ThreePi.20130717 -dbname gpc1
    /home/panstarrs/ipp/src/ipp-20130712/tools/ --warp_id 739674 --skycell_id skycell.0739.019 --redirect-output
    --> chip_id are >> than currently updating id, may want to use a different label to get them updated -- then will manually clean chips 
  • 12:05 MEH: ps2iq overloading ippc18 with outputs, all processing and such stalled. we need a policy of not running intensive programs involving ippc18 (on ippc18 and/or to the homedir). if found, will be killed
  • 12:50 MEH: cleanup counters wacky, doing a restart
  • 17:40 MEH: to catch up warps and push stacks before restarting stdsci and nightly data

Wednesday : 2013.09.25

Thursday : 2013.09.26

  • 07:38 Bill: Lots of timeouts for postage stamp tasks. Looks like a database problem. Also quite a few in the update pantasks. Restarting them both
  • 14:22 Bill: stdscience could use a restart. Set to stop

Friday : 2013.09.27

  • heather 9am: sine waves on ganglia, restarting stdsci
  • heather 12pm: something happened overnight to ipptopsps? Any idea? someone reboot a machine or something?

Saturday : 2013.09.28

  • 07:05: Serge reports that registration is backed up.
    • Bill looked at the registration pantasks and saw that the status page was a mess. Many many job timeouts. Restarted pantasks and burntool started running.
    • At about the same time Heather repaired the fault.
  • 07:18 Bill: stdscience needs to be restarted. Set to stop
  • 07:34: restart complete. Didn't wait for some LAP camera runs to finish. Removed lap label for a bit to avoid running those jobs twice.
  • 07:55: all back to normal
  • 10:06: Bill dropped a warp skyfile which repeatedly pops assertion failure with warptool -updateskyfile -skycell_id skycell.2008.031 -warp_id 842911 -set_quality 4 -fault 0
  • 14:00 MEH: stsci13 appears down around 1200.. forgot should've taken out of neb when noticed so can at least do nightly science tonight if no one around to reboot with the special raid issues
    • probably could restart stack, pstamp to clean stalled jobs -- will need to clean up hanging jobs and mounts.. MOPS will need PSS in morning so needs to be done if stsci13 not rebooted before then
  • 14:30 MEH: stacks cleared, but since many faults from warps only on stsci13...
  • 15:50 MEH: clearing enough update+pstamp so that MOPS can get their stamps for TODAY, some QUB ones still hanging around
  • 16:30 MEH: dumped and restarted cleanup.. so little space, need any we can recover at this point.. there will be stalled, won't kill and don't want to do -9 right now..
  • 17:10 MEH: in case no stsci13 before tonight, still trying to get things running. registration stuck on dark imfile -- cleaned and restarted registration and summitcopy
  • 17:15 MEH: will leave stdsci last as it will be a major PITA to clean out
  • 16:10 MEH: neb having all kinds of fun timouts/faults in summitcopy, 900+ connections in nebdb.. time to start clearing all out before nightly starts..
  • 20:00 MEH: good thing bad weather, things still screwy.. LAP label to be removed from stdsci, clearing/killing and
    • won't/don't really want to kill unless have too, not sure what state in and issues will cause
  • 20:40 MEH: registration rawExp somehow borked for o6564g0013d.660484 and why not finishing 13d, loaded exp_id but NULL/UNKNOWN values for everything else.. cleared with regtool -dbname gpc1 -revertprocessedexp -exp_id 660484
  • 20:50 MEH: nightly darks finished downloading and registering, stack running, pstamp finishing up QUB, cleanup running, stdsci running w/o LAP label
    • also looks like a lot of somethings cleared accessing nebulous as well, now down to order 100 connections

Sunday : 2013.09.29

  • 11:00 MEH: looks like some MD09 chip/warps stalled in stdsci since midnight -- look to have completed but pantasks still hadn't gotten the message.. -- all on ipp056, cleared
  • 11:10 MEH: LAP stacks were kept running last night and mostly caught up with whatever warps available
    • now nightly done, LAP warps could be tried for whatever chips/cam files available
    • camera should probably remain off until jobs can be checked to avoid any strange update cases
  • 13:20 MEH: ipp056 having some mount issues, taking out of processing
    • LAP chips running though queue whatever possible -- camera still off, looking at sample and see some have new SMF some have just one old SMF...
  • 14:20 MEH: LAP label out again, restoring stdsci back for nightly science tonight
    • for the record, the systems and cam_id in limbo --
      ------ ippc19 ippc22 ------ ipp 22415 1 0 Sep28 ? 00:00:11 perl /home/panstarrs/ipp/psconfig/ipp-20130712.lin64/bin/ --exp_tag o5296g0048o.155940 --cam_id 507044 --camera GPC1 --outroot neb://any/gpc1/LAP.ThreePi.20120706/2012/08/04/o5296g0048o.155940/ --redirect-output --run-state update --reduction LAP_SCIENCE --dbname gpc1 --verbose
      ------ ippc19 ippc23 ------ ipp 16288 1 0 Sep28 ? 00:00:11 perl /home/panstarrs/ipp/psconfig/ipp-20130712.lin64/bin/ --exp_tag o5272g0225o.148686 --cam_id 505071 --camera GPC1 --outroot neb://any/gpc1/LAP.ThreePi.20120706/2012/08/04/o5272g0225o.148686/ --redirect-output --run-state update --reduction LAP_SCIENCE --dbname gpc1 --verbose
      ------ ippc19 ippc25 ------ ipp 28791 1 0 Sep28 ? 00:00:08 perl /home/panstarrs/ipp/psconfig/ipp-20130712.lin64/bin/ --exp_tag o6014g0191o.469525 --cam_id 507001 --camera GPC1 --outroot neb://any/gpc1/LAP.ThreePi.20120706/2012/08/04/o6014g0191o.469525/ --redirect-output --run-state update --reduction LAP_SCIENCE --dbname gpc1 --verbose
      ------ ippc19 ippc26 ------ ipp 13779 1 0 Sep28 ? 00:00:07 perl /home/panstarrs/ipp/psconfig/ipp-20130712.lin64/bin/ --exp_tag o6061g0029o.487029 --cam_id 507827 --camera GPC1 --outroot neb://any/gpc1/LAP.ThreePi.20120706/2012/08/04/o6061g0029o.487029/ --redirect-output --run-state update --reduction LAP_SCIENCE --dbname gpc1 --verbose 
      ipp 16911 1 0 Sep28 ? 00:00:07 perl /home/panstarrs/ipp/psconfig/ipp-20130712.lin64/bin/ --exp_tag o5272g0226o.148687 --cam_id 505081 --camera GPC1 --outroot neb://any/gpc1/LAP.ThreePi.20120706/2012/08/04/o5272g0226o.148687/ --redirect-output --run-state update --reduction LAP_SCIENCE --dbname gpc1 --verbose
      ------ ippc19 ippc32 ------ ipp 1836 1 0 Sep28 ? 00:00:12 perl /home/panstarrs/ipp/psconfig/ipp-20130712.lin64/bin/ --exp_tag o5649g0218o.318035 --cam_id 507049 --camera GPC1 --outroot neb://any/gpc1/LAP.ThreePi.20120706/2012/08/04/o5649g0218o.318035/ --redirect-output --run-state update --reduction LAP_SCIENCE --dbname gpc1 --verbose
      ------ ippc19 ippc34 ------ ipp 29621 1 0 Sep28 ? 00:00:10 perl /home/panstarrs/ipp/psconfig/ipp-20130712.lin64/bin/ --exp_tag o5649g0207o.317968 --cam_id 505084 --camera GPC1 --outroot neb://any/gpc1/LAP.ThreePi.20120706/2012/08/04/o5649g0207o.317968/ --redirect-output --run-state update --reduction LAP_SCIENCE --dbname gpc1 --verbose
      ------ ippc19 ippc35 ------ ipp 5363 1 0 Sep28 ? 00:00:15 perl /home/panstarrs/ipp/psconfig/ipp-20130712.lin64/bin/ --exp_tag o6000g0382o.464660 --cam_id 505123 --camera GPC1 --outroot neb://any/gpc1/LAP.ThreePi.20120706/2012/08/04/o6000g0382o.464660/ --redirect-output --run-state update --reduction LAP_SCIENCE --dbname gpc1 --verbose 
      ipp 10691 1 0 Sep28 ? 00:00:09 perl /home/panstarrs/ipp/psconfig/ipp-20130712.lin64/bin/ --exp_tag o5649g0224o.317984 --cam_id 505086 --camera GPC1 --outroot neb://any/gpc1/LAP.ThreePi.20120706/2012/08/04/o5649g0224o.317984/ --redirect-output --run-state update --reduction LAP_SCIENCE --dbname gpc1 --verbose
      ------ ippc19 ippc40 ------ ipp 23459 1 0 Sep28 ? 00:00:11 perl /home/panstarrs/ipp/psconfig/ipp-20130712.lin64/bin/ --exp_tag o5645g0375o.316082 --cam_id 507048 --camera GPC1 --outroot neb://any/gpc1/LAP.ThreePi.20120706/2012/08/04/o5645g0375o.316082/ --redirect-output --run-state update --reduction LAP_SCIENCE --dbname gpc1 --verbose 
      ipp 30756 1 0 Sep28 ? 00:00:07 perl /home/panstarrs/ipp/psconfig/ipp-20130712.lin64/bin/ --exp_tag o5267g0305o.146948 --cam_id 506994 --camera GPC1 --outroot neb://any/gpc1/LAP.ThreePi.20120706/2012/08/04/o5267g0305o.146948/ --redirect-output --run-state update --reduction LAP_SCIENCE --dbname gpc1 --verbose
      ------ ippc19 ippc45 ------ ipp 24388 1 0 Sep28 ? 00:00:10 perl /home/panstarrs/ipp/psconfig/ipp-20130712.lin64/bin/ --exp_tag o5608g0274o.302452 --cam_id 506998 --camera GPC1 --outroot neb://any/gpc1/LAP.ThreePi.20120706/2012/08/04/o5608g0274o.302452/ --redirect-output --run-state update --reduction LAP_SCIENCE --dbname gpc1 --verbose
      ------ ippc19 ippc47 ------ ipp 27450 1 0 Sep28 ? 00:00:12 perl /home/panstarrs/ipp/psconfig/ipp-20130712.lin64/bin/ --exp_tag o5649g0235o.317996 --cam_id 507050 --camera GPC1 --outroot neb://any/gpc1/LAP.ThreePi.20120706/2012/08/04/o5649g0235o.317996/ --redirect-output --run-state update --reduction LAP_SCIENCE --dbname gpc1 --verbose
      ------ ippc19 ippc48 ------ ipp 20297 1 0 Sep28 ? 00:00:15 perl /home/panstarrs/ipp/psconfig/ipp-20130712.lin64/bin/ --exp_tag o6000g0364o.464638 --cam_id 505120 --camera GPC1 --outroot neb://any/gpc1/LAP.ThreePi.20120706/2012/08/04/o6000g0364o.464638/ --redirect-output --run-state update --reduction LAP_SCIENCE --dbname gpc1 --verbose
      ------ ippc19 ippc50 ------ ipp 12079 1 0 Sep28 ? 00:00:10 perl /home/panstarrs/ipp/psconfig/ipp-20130712.lin64/bin/ --exp_tag o5253g0478o.141759 --cam_id 507099 --camera GPC1 --outroot neb://any/gpc1/LAP.ThreePi.20120706/2012/08/04/o5253g0478o.141759/ --redirect-output --run-state update --reduction LAP_SCIENCE --dbname gpc1 --verbose
      ------ ippc19 ippc53 ------ ipp 8138 1 0 Sep28 ? 00:00:15 perl /home/panstarrs/ipp/psconfig/ipp-20130712.lin64/bin/ --exp_tag o5608g0255o.302433 --cam_id 506997 --camera GPC1 --outroot neb://any/gpc1/LAP.ThreePi.20120706/2012/08/04/o5608g0255o.302433/ --redirect-output --run-state update --reduction LAP_SCIENCE --dbname gpc1 --verbose
      ------ ippc19 ippc56 ------ ipp 5723 1 0 Sep28 ? 00:00:10 perl /home/panstarrs/ipp/psconfig/ipp-20130712.lin64/bin/ --exp_tag o5710g0052o.341497 --cam_id 507918 --camera GPC1 --outroot neb://any/gpc1/LAP.ThreePi.20120706/2012/08/04/o5710g0052o.341497/ --redirect-output --run-state update --reduction LAP_SCIENCE --dbname gpc1 --verbose
      ------ ippc19 ippc60 ------ ipp 16825 1 0 Sep28 ? 00:00:11 perl /home/panstarrs/ipp/psconfig/ipp-20130712.lin64/bin/ --exp_tag o5645g0358o.316065 --cam_id 507046 --camera GPC1 --outroot neb://any/gpc1/LAP.ThreePi.20120706/2012/08/04/o5645g0358o.316065/ --redirect-output --run-state update --reduction LAP_SCIENCE --dbname gpc1 --verbose