PS1 IPP Czar Logs for the week 2013.08.26 - 2013.08.31

(Up to PS1 IPP Czar Logs)

Monday : 2013.08.26

mark is czar

  • 08:30 MEH: nightly is finished, going to do regular restart of stdsci and try fix some LAP chips again
  • 09:04 Bill restarted update and pstamp pantasks (their pcontrols were spinning)
  • 09:13 Bill recovered two missing raw files from virtual bit bucket ippb02: o5536g0128o.ota44.fits and o5536g0071o.ota44.fits
    • the remaining 4 m31 faults have bad burntool state due to failures rerunning burntool. These will have to wait until ipp047's disk is back up
  • 09:25 MEH: turning compute3 back off in pstamp now that MOPS is finished. adding goto_cleaned.rerundiff label back to cleanup
  • 10:30 MEH: more MOPS stamps, 1x compute3 off in stack and back on in pstamp (and 1x in cleanup)
  • 10:40 MEH: diffim faults (from 8/24) -- setting ( also still set while ipp047 down, will set back on for nightly ~8-9pm)
    -- two 3PI WWdiff before fix to use good_frac (ie these warps had <0.1 good_frac) -- good_frac still not in ops build..
    difftool -dbname gpc1 -updatediffskyfile -set_quality 14006 -fault 0 -diff_id 470166 -skycell_id skycell.0705.092
    difftool -dbname gpc1 -updatediffskyfile -set_quality 14006 -fault 0 -diff_id 470287 -skycell_id skycell.0795.078
    -- one 3PI WSdiff -- with only 4 stamps for the diffim but images well covered -- log says FWHM for warp is ~2.8 and warp has many sources with odd masking at cores (8448 CR+blend), suspect undersize psf determination on warp and need to retest -- 4 stamps are on good sources, so looks like issue in chip->warp and other skycells have similar issues it seems, so dropping diffim
            470401 	skycell.1517.042 	ThreePi.WS.nightlyscience
    difftool -dbname gpc1 -updatediffskyfile -set_quality 14006 -fault 0 -diff_id  470401 -skycell_id skycell.1517.042
    -- two MD09 WSdiff -- both warp and refstack have nan for PSF.. something wrong here -- warps look odd (odd blotches size of sources) and high background ~22k and other skycells swamped with likely false detections it seems as well, so exposure poor anyways and dropping these
     	470147 	skycell.026 	MD09.nightlyscience
     	470159 	skycell.076 	MD09.nightlyscience
    difftool -dbname gpc1 -updatediffskyfile -set_quality 14006 -fault 0 -diff_id 470147 -skycell_id skycell.026
    difftool -dbname gpc1 -updatediffskyfile -set_quality 14006 -fault 0 -diff_id 470159  -skycell_id skycell.076
  • 19:00 MEH: as mentioned at meeting, adjusted compute3 allocation
    • suggestion was just+1x into stdsci (for total of 3x) while keeping stack same (total of 3x)
    • +3x (for total of 5) in stdsci, -1x (for total 2) in stack seems to be stable (and cleanup also using 1x right now) -- stack rate is ~500 and load is okay even when doing M31 reprocessing at same time -- nightly rate same ~50-60 while LAP updates being done as well
    • made modification to the ippconfig/pantasks_hosts.input and will load this way by default now -- cleanup still also using 1x compute3, manually added
  • 22:30 MEH: mentioned at meeting, ipp034 is @half RAM (12/24G) and so made hosts_wave2_weak group in pantasks_hosts.input. basically loads as normal wave2 but half as many in stdsci

Tuesday : 2013.08.27

  • 08:50 MEH: Serge pointed out faulting diffim, another low good_frac warp -- the needs to be updated/rebuilt with the fix in the ops tag
    difftool -dbname gpc1 -updatediffskyfile -set_quality 14006 -fault 0 -diff_id 471203 -skycell_id skycell.1582.032
  • 09:10 MEH: while sitting in DRAVG, might as well do the regular restart of stdsci
  • 09:20 MEH: and LAP prio to push warps into stacks after nightly
  • 09:25 Bill: queued last of obs_mode = 'm31' exposures. Still need to find and queue the overlapping 3PI exposures.
  • 09:33 Bill: rebuilt corrupt chip_bg file with tools/ --chip_bg_id 12076 --class_id XY55 --redirect-output
  • earlier Bill: queued m31.rp.2013.bgsub data to be cleaned. Did not change the label to goto_cleaned to avoid accidentally rerunning exposures in the future. Added label to the cleanup pantasks. Leaving the background preserved data for now.
  • 09:50 MEH: testing overload with revised stdsci+stack with LAP -- can hit stsci machines a bit hard initially but recovers okay. preparing to fix ipp047 LAP chip faults
    • like nightly data, the setup is slightly overloaded and produces extra faults (expecting from nfs issues) -- taking one out of stdsci should be sufficient if there is a concern (or possibly even the manually added one in cleanup may be enough easement)
  • 10:10 EAM : ipp064 had a kernel panic, rebooting
  • 10:45 Bill: the last of the M31 data has been queued. 549 exposures outstanding
  • 11:10 MEH: finished scanning for replacement for missing LAP chips, will flip LAP prio back to 200 once finished and warps pushed to stacks
    -- another set of chips that only exist on ipp047
    | neb://ipp047.0/gpc1/20111120/o5885g0116o/o5885g0116o.ota67.fits | 
    | neb://ipp047.0/gpc1/20111125/o5890g0407o/o5890g0407o.ota67.fits | 
    | neb://ipp047.0/gpc1/20111126/o5891g0203o/o5891g0203o.ota67.fits | 
    | neb://ipp047.0/gpc1/20111126/o5891g0117o/o5891g0117o.ota67.fits | 
    | neb://ipp047.0/gpc1/20111126/o5891g0127o/o5891g0127o.ota67.fits | 
    | neb://ipp047.0/gpc1/20111126/o5891g0222o/o5891g0222o.ota67.fits | 
  • 12:23 Bill: This one's only copy is on ipp047 too neb://ipp047.0/gpc1/20111122/o5887g0088o/o5887g0088o.ota67.fits
  • 19:32 Bill: setting to avoid getting in the way of nightlyscience.

Wednesday : 2013.08.28

  • 05:48 Bill: nightly science is nearly done. Setting bg.on
  • 08:47 Bill: restarted the distribution, summitcopy, and publishing pantasks. Stopping stdscience for restart
  • 09:12 Bill: restarted stdscience, stack, registration, pstamp, and update pantasks.
    • stack is idle so added one set of compute3 hosts to pstamp and one each of compute2 and compute3 to update

Thursday : 2013.08.29

Friday : 2013.08.30

  • 08:30 MEH: two lone faulted 3PI.nightly chips stalled from in stdsci, manually reverting in case off for a reason -- for for LAP chips on ipp047, could the chip.revert task be given a trange so nightly is not held until someone looks at in morning?
  • 11:45 CZW: restarted stdscience.
  • 17:00 MEH: manually removing 1x compute3 from stack for local stack tests, will be returned when finished

Saturday : 2013.08.31

  • 00:30 MEH: wweeeee looks like ipp033 is down since a little after midnight.. looking at console, nothing. back up
    • reverted OSS chips that would have been stalled from faults until morning.. so setup chip.revert to be valid 00:00-01:00 and 06:00-07:00, by then lap chips tend to just be all faults anyways
  • 13:15 MEH: all LAP faulted with broken chips, good time for regular restart of stdsci
  • 16:20 MEH: more chips on ipp047 only
    | neb://ipp047.0/gpc1/20111126/o5891g0226o/o5891g0226o.ota67.fits | 
    | neb://ipp047.0/gpc1/20111126/o5891g0228o/o5891g0228o.ota67.fits | 
    | neb://ipp047.0/gpc1/20111125/o5890g0508o/o5890g0508o.ota67.fits | 
    | neb://ipp047.0/gpc1/20111125/o5890g0496o/o5890g0496o.ota67.fits | 
    | neb://ipp047.0/gpc1/20111126/o5891g0260o/o5891g0260o.ota67.fits | 
    | neb://ipp047.0/gpc1/20111126/o5891g0257o/o5891g0257o.ota67.fits | 

Sunday : 2013.09.01

  • 06:50 Bill: pstamp server pantasks has become sluggish. Restarted it.
  • 10:50 MEH: LAP stalled, good time for regular restart of stdsci