PS1 IPP Czar Logs for the week 2013.08.12 - 2013.08.18

(Up to PS1 IPP Czar Logs)

Monday : 2013.08.12

  • 08:47 Bill: update pantasks has stuck jobs (88,000 seconds) probably due to the down node. Set pantasks to stop in preparation for clearing out the jobs
  • 09:10 Bill restarted update and pstamp pantasks
  • -9:20 Bill droped skycell that was repeatedly faulting: difftool -updatediffskyfile -set_quality 42 -fault 0 -diff_id 463628 -skycell_id skycell.1226.089
  • 14:00 CZW: added do.more.summary macro to the stack pantasks, which allows the stack.summary tasks to run continually. I've limited the number of simultaneous run tasks to 10 to try and keep the load from getting too high (this task needs to read 100 stack skycells, so it's somewhat memory intensive).
  • 20:00 Bill: earlier queued 105 M31 exposures to be processed. Now have second thoughts. Compromise: removed M31.rp.2013 label from stdscience. Left M31.rp.2013.bgsub in. This will do background subtracted chip - warp processing, but will defer background uncorrected chip, chip_bg, and warp_bg processing until the label is added back in.

Tuesday : 2013.08.13

  • 01:55 MEH: stsci19 has been down for ~3hrs.. stalling all processing. setting neb-host down and see if can triage things until morning
    • not looking good. may be more than just stsci19, often seeing timeout when checking mounts to ipp052.0-ipp066.0 -- leaving until morning
  • 07:55 Bill: turning revert tasks off in stdscience faulted tasks just fault again
  • 08:00 Bill: summit copy has 132 incomplete downloads 66 copied but not registered
    • killed a couple of stuck registration processes (> 20000 sec)
    • ran pztool -clearcommonfaults
    • making some progress now
    • regtool -updateprocessedimfile -class_id XY53 -exp_id 642199 -set_state pending_burntool -burntool_state 0
  • 08:25 ipp058 has 14 "sync" processes all hung which is causing several scripts to not exit.
  • 08:35 restarted registration pantasks. hosts ipp058 and ipp065 set to off because they hang in sync
  • 08:39 registration is running many more processes after restart. burntool is making progress again. XY14 and XY27 are behind
  • 08:49 Bill restarted pstamp pantasks
  • 09:32 Bill ran pztool -clearcommonfaults to revert faulted imfile from o6517g0198o ota24 (files were copied perviously fault was "can't lock file"). May require further intervention.
  • 09:38 EAM : a number of machines (ipp014, ipp015, ipp016, ipp019, ipp020, probably others) have ipp052 mounts making things slow. I rebooted ipp013, but it will be a pain to reboot all of these. Haydn is checking into ipp052, so we'll leave it for now.
  • 13:51 Bill: added label M31.rp.2013 to stdscience
  • 15:25 MEH: tweak_ssdiff to get the remaining missed nightly SSdiffs processed
    • SSdiffs running -- reset back to default time

Wednesday : 2013.08.14

  • 06:35 Bill: revert tasks are off in stdscience. Issued them by hand to clear a few handfulls of faults.
  • 10:15 Bill started pstamp pantasks which was going sluggishly. Added a set of compute3 nodes to help speed things along
  • 10:27 Bill ippc02 is out of space in / so it is failing nebulous requests. Deleted all files in /tmp
  • 11:40 Bill removed compute3 hosts from pstamp added them to update which is the current bottleneck for pstamp. MPIA has some requests that have been open for 4 days.
  • 15:05 Bill set ipp052 to repair in nebulous. Not sure if there is any reason not to set it to up.
  • 19:50 Bill restarted summit copy, registration, distribution, and publishing pantasks. Their pcontrols were spinnning

Thursday : 2013.08.15

  • 10:30 Bill: restarted registration pantasks. A stuck process on ipp025has frozen burntool processing at exposure o6519g0369o. So there are 151 exposures left to process.
  • 10:40 Bill: rebooted ipp025
  • 11:00 Bill: There is a lap warp that stdscience thinks is running on ipp025. Once nightly science data finishes processing and the warp book is empty we should do warp.reset or probably just restart stdscience.
  • 12:59 Bill: dropped 2 diff skyfiles that repeatedly fault due to failure to measure fwhm
    difftool -updatediffskyfile -set_quality 3006 -diff_id 465627 -skycell_id skycell.1219.025 -fault 0
    difftool -updatediffskyfile -set_quality 3006 -diff_id 465688 -skycell_id skycell.1311.047 -fault 0
  • 13:15 MEH: stdsci unwant parameter was 7, set to 30 to help get more nodes to do warps
  • 14:50 MEH: nightly MD stacks finished, tweak_ssdiff to get SSdiffims done -- reset back to default @15:20
  • 17:10 MEH: MD02.ref select exposure processing into the reprocessing pool

Friday : 2013.08.16

  • 11:00 MEH: nightly mostly done, time for regular restart stdsci
  • 18:00 MEH: pausing processing to update MD/DEEP stack ippconfig and ippScripts
  • 18:50 MEH: with MD02ref chip->warp to do, not as many LAP stacks so reallocating 3x compute3 in stack back to 4x compute3 in stdsci

Saturday : 2013.08.17

  • 10:30 MEH: deepstack running, compute3 being manually managed

Sunday : 2013.08.18

  • 17:30 MEH: stdsci struggling to keep fully loaded, time for regular restart before nightly