PS1 IPP Czar Logs for the week 2014.09.08 - 2014.09.14

(Up to PS1 IPP Czar Logs)

Monday : 2014.09.08

Tuesday : 2014.09.09

  • ken noted problems - downloads were slow (not sure why), but they were proceeding. Heather turned off addstars to decrease load on ippdb01 (from 15 - (11 or 12ish))
  • 07:40 MEH: local stdlanl stopped + c2 hosts off -- nodes can be used to boost nightly processing catchup (worked on last thursday, ipp067,068,069,070 were able to handle load, have handled local stdlanl being left on during nightly processing as well, so..), also loading in the ippsXX nodes
  • 08:00 MEH: stdsci also needs a regular restart, polls and active jobs remain low.. -- wow so much better
  • 08:10 MEH: also tweak_wsdiff to not queue jobs until this afternoon
  • 11:00 HAF: turned addstars back on
  • 11:20 MEH: looking ~noon to finish nightly, since took things apart will put hosts back and set local stdlanl back to run when finish. leaving ippsXX nodes in stdsci for the WS diff unless instructed differently.
  • 12:20 MEH: nightly finished, publishing nearly so -- ps_ud_WEB.BIG,ps_ud_WEB.UP,OSS.WS.nightlyscience labels back into stdsci, c2 back to local stdlanl and set to run, WS running with 3x ippsXX machines in stdsci
    • was running stdsci 430 nodes over the regular ~315, saw ~30% rate increase (would be less if downloads still going). local stdlanl ~160 node nominal.
    • by ~1230 ippdb01 was being overloaded again, local stdlanl part of problem?
  • 16:20 MEH: cannot rebuild nightly ops tag w/ SHUTOUTC mod until WS are finished otherwise will have issue w/ already made warps -- have to rebuild nightly ops tag tomorrow

Wednesday : 2014.09.10

  • 03:30 Bill: summit copy is 238 exposures behind.
    • In gpc1 database lots of jobs in the process list from summit copy and addstar.
    • checkexp script takes >~ 30seconds to run.
    • pcontrol was spinning so restarted summit copy pantasks.
    • removed the old dates from registration pantasks
  • 04:14 stdscience is proceeding slowly due to many load tasks getting timeouts
  • 07:20 Bill set pstamp to stop
  • 7:40 HAF stopped addstars until we can figure out database problems
  • 7:45 EAM : stopped and restarted mysql, will restart stdscience in 2 min
  • 7:47 EAM : processing is back on
  • 11:15 EAM : removed WS labels for now, rebooting ipp035 which crashed.
  • 12:12 HAF: registration stuck (I keep getting txt messages about that) fixed with regpeek and :
    • regtool -updateprocessedimfile -exp_id 791809 -class_id XY23 -set_state pending_burntool -dbname gpc1
    • regtool -updateprocessedimfile -exp_id 791832 -class_id XY23 -set_state pending_burntool -dbname gpc1
  • 14:40 MEH: some fault 5 diffims cleared so MOPS can get the detections for those exposures
  • 14:15 Bill: pstamp pantasks restarted

Thursday : 2014.09.11

  • 05:51 Bill: removed ps_ud% labels from stdscience pantasks with macro del.update.labels
    • 08:03 added them back in
  • 08:10 MEH: tweak_wsdiff so finished earlier.. so can rebuild the ops tag -- oops, messed up the tweak, restart stdscience and then tweak again
  • 09:50 MEH: MOPS missing v3-v4 diffim for OSSR.R23S4.0.Q.i ps1_16_1727 (o6911g0423o,o6911g0440o), appears a v2-v3 diffim happened.. manually queued with
    difftool -dbname gpc1 -definewarpwarp -warp_id 1023866  -template_warp_id 1023880 -backwards -set_workdir neb://@HOST@.0/gpc1/OSS.nt/2014/09/11 -set_dist_group SweetSpot -set_label OSS.nightlyscience -set_data_group OSS.20140911 -set_reduction SWEETSPOT -simple -rerun -pretend
  • 13:40 MEH: rebuilding ipp ops tag ipp-20130712 to add FPA.SHUTOUTC keyword for MOPS
  • 14:10 MEH: not quite ready before init day, so waited and then restarted stdscience -- everything running again
  • 17:00 MEH: summitcopy experiencing many crash -- only 50 ota would download, restarted and okay. doing same for registration as well

Friday : 2014.09.12

  • 10:00 HAF shutdown stdscience, pstamp, stack, registration, summitcopy in prep for haydn's work. neb-host down ipp019, 018, 014, 016, 008, 037, 013, and 012 (aka cab1 machines), CZW shutdown PV3 related processing
  • 10:45 Haydn at ATRCB, shutting down cab1 now. He plans to call Heather and shutdown jaws after cab1 stuff.
  • 1:00 HAF: Haydn is done with cab1 + jaws. Heather restarted stdscience, registration, summitcopy, stack, pstamp, set neb status to up for cab1 machines.
  • 3:00 HAF: restarted 5 addstars (stsci01 - stsci05) - load on ippdb01 is 3.5. I'll add more tomorrow if all goes well
  • 18:30 MEH: cleanup not started/running -- looks like Gene took it out of the auto startup list in August so was missed -- starting normally now
  • 23:40 MEH: looks like ipp034 has crashed, not responding and cannot connect from console either, so not a gmond problem -- nothing on console since last crash ~9/10, moderate load and in repair -- must be rebooted ASAP or will hold on to files in summitcopy+registration and stall nightly.
    • rebooted okay, taking out of processing..
    • if follows historical pattern, expect to see ipp035,036,037 crash in near future..

Saturday : 2014.09.13

Sunday : 2014.09.14

  • 09:20 MEH: taking out the WS label until caught up
  • 20:00 MEH: gmond segfault on ipp069, restarted
  • 20:10 MEH: stdsci in need of regular restart, polling very underloaded and it is easier to get the cmds through before pantasks becomes too unresponsive...