PS1 IPP Czar Logs for the week 2014.11.24 - 2014.11.30

(Up to PS1 IPP Czar Logs)

Monday : 2014.11.24

  • 04:05 : ipp034 was down, rebooting
  • 05:02 Bill: Restarted registration pantasks lots of timeouts and errors in the counts hard to follow
    • burntool process 23811 on ipp004 is stuck trying to process o6985g0360o.ota14 don't know what to do
    • 05:15 restarted stdscience pantasks which has the effect of turning on update labels. They have been off and there are several pstamp jobs waiting for a few files
    • 05:45 stopped registration
      • killed stuck burntool process
      • regtool -dbname gpc1 -updateprocessedimfile -exp_id 823590 -class_id XY14 -set_ignored -set_state full -burntool_state -14
      • set pantasks to run and now burntool is moving along

Tuesday : 2014.11.25

  • 06:10 MEH: ipp036 unresponsive, taking out of processing and power cycle (nothing on console) --
    • all but a job in summitcopy cleared.. restarting summitcopy to clear
  • 06:15 MEH: looks like registration has a long running burntool, may be the one Bill fixed earlier??
      0    ipp004    RESP  87573.44  0 --camera GPC1 --exp_id 823627 --class_id XY14 --this_uri neb://ipp058.0/gpc1/20141124/o6985g0397d/o6985g0397d.ota14.fits --continue 10 --previous_uri neb://ipp058.0/gpc1/20141124/o6985g0396d/o6985g0396d.ota14.fits --dbname gpc1 --verbose 
  • 16:07 CZW: restarted stdlanl, as it was cranky (and hadn't been restarted in a long time).
  • 23:41 HAF: I hate burntool, xy14 is stuck , per bill's instructions regtool -dbname gpc1 -updateprocessedimfile -exp_id 824673 -class_id XY14 -set_ignored -set_state full -burntool_state -14 to see if I can unstick..
  • 1:46 HAF: burntool stuck again, but while investigating it unstuck itself... ok.
  • 1:54 HAF: regtool -updateprocessedimfile -exp_id 824781 -class_id XY64 -set_state pending_burntool -dbname gpc1
  • 1:58 HAF: regtool -updateprocessedimfile -exp_id 824794 -class_id XY64 -set_state pending_burntool -dbname gpc1
  • 2:04 HAF: regtool -updateprocessedimfile -exp_id 824807 -class_id XY64 -set_state pending_burntool -dbname gpc1
  • 2:04 HAF:regtool -updateprocessedimfile -exp_id 824795 -class_id XY64 -set_state pending_burntool -dbname gpc1
  • 2:22 HAF: regtool -revertprocessedimfile -dbname gpc1 -exp_id 824846
  • 2:22 HAF: regtool -revertprocessedimfile -dbname gpc1 -exp_id 824855
  • 2:22 HAF: regtool -revertprocessedimfile -dbname gpc1 -exp_id 824866
  • 2:29 HAF: regtool -updateprocessedimfile -exp_id 824848 -class_id XY12 -set_state pending_burntool -dbname gpc1
  • regtool -updateprocessedimfile -exp_id 824858 -class_id XY12 -set_state pending_burntool -dbname gpc1
  • 2:30 HAF: czar is tired, czar is going to bed.... I hope things keep going.... I don't see the random neb stuff anymore...

Wednesday : 2014.11.26

  • 05:20 EAM : I cleared out another problem burntool.
    regtool -updateprocessedimfile -burntool_state -14 -exp_id 824673 -class_id XY14 -set_ignored -set_state full -dbname gpc1
  • 12:10 HAF: reports from serge that this warp is cursed, so I set it aside: update warpRun join warpSkyfile using (warp_id) set state = 'full', fault = 0, quality = 42 where warp_id = 1226449 and skycell_id ='skycell.1783.001' ;
  • 21:50 MEH: looks like ipp035 down/unresponsive since ~2310, nothing on console but under load and in repair, taking out of processing and attempting power cycle -- check forced on /dev/sda3..
    • back up, summitcopy and stdsci jobs faulted okay, registration not.. will restart pantask to clear..
    • 38 exposures behind in processing now moving again

Thursday : 2014.11.27

  • 1:12 HAF: regtool -dbname gpc1 -updateprocessedimfile -exp_id 825562 -class_id XY14 -set_ignored -set_state full -burntool_state -14
  • 1:17 HAF: regtool -updateprocessedimfile -exp_id 825567 -class_id XY14 -set_state pending_burntool -dbname gpc1
  • 1:18 HAF: regtool -revertprocessedimfile -dbname gpc1 -exp_id 825616 (note that we are now behind 83 exposures to download, 58 to register)
  • 1:43 HAF: regtool -dbname gpc1 -updateprocessedimfile -exp_id 825577 -class_id XY14 -set_ignored -set_state full -burntool_state -14 (87 to download, 72 to register)
  • 1:48 HAF: regtool -dbname gpc1 -updateprocessedimfile -exp_id 825578 -class_id XY14 -set_ignored -set_state full -burntool_state -14 (89/74)
  • 1:53 HAF: regtool -dbname gpc1 -updateprocessedimfile -exp_id 825579 -class_id XY14 -set_ignored -set_state full -burntool_state -14 (90/76)
  • regtool -dbname gpc1 -updateprocessedimfile -exp_id 825580 -class_id XY14 -set_ignored -set_state full -burntool_state -14
  • regtool -dbname gpc1 -updateprocessedimfile -exp_id 825581 -class_id XY14 -set_ignored -set_state full -burntool_state -14
  • and I restarted registration
  • ran a few of xy14 by hand
  • 2:48 HAF: regtool -updateprocessedimfile -exp_id 825593 -class_id XY14 -set_state pending_burntool -dbname gpc1 (and several more like that too)
  • I think the problem is related to 0331 and how it died - it just went into a continuos loop to try to reburntool it. Is that expected?
  • 3am heather has a way to fix (it doesn't work always)
    • find the ones that are confused: select exp_id,exp_name, data_state from rawImfile where tmp_class_id = 'ota14' and exp_id > 825500 order by exp_name;;
    • 0331 / 0347 get confused repeatedly (back to pending_burntool)
    • heather sets those back to set ignored a few times
    • heather runs by hand the stuck one (361 in this case) -- this unhinged it for a bit longer...
    • why does burntool hate me?
    • when things get stuck again, repeat.
  • 3:15 HAF: regtool -updateprocessedimfile -exp_id 825634 -class_id XY14 -set_state pending_burntool -dbname gpc1
  • 3:18 HAF: regtool -updateprocessedimfile -exp_id 825649 -class_id XY14 -set_state pending_burntool -dbname gpc1
  • 3:23 HAF: regtool -updateprocessedimfile -exp_id 825671 -class_id XY14 -set_state pending_burntool -dbname gpc1
  • 3:35 HAF: regtool -updateprocessedimfile -exp_id 825656 -class_id XY14 -set_state pending_burntool -dbname gpc1
  • 3:35 HAF: regtool -updateprocessedimfile -exp_id 825677 -class_id XY14 -set_state pending_burntool -dbname gpc1
  • 3:35 HAF: regtool -revertprocessedimfile -dbname gpc1 -exp_id 825742
  • 3:38 HAF: regtool -dbname gpc1 -updateprocessedimfile -exp_id 825677 -class_id XY14 -set_ignored -set_state full -burntool_state -14
  • 3:40 HAF: regtool -revertprocessedimfile -dbname gpc1 -exp_id 825745
  • 3:46 HAF: regtool -updateprocessedimfile -exp_id 825724 -class_id XY14 -set_state pending_burntool -dbname gpc1
  • 4:00 HAF: regtool -updateprocessedimfile -exp_id 825756 -class_id XY13 -set_state pending_burntool -dbname gpc1
  • 4:18 HAF regtool -revertprocessedimfile -dbname gpc1 -exp_id 825781
  • 4:18 HAF: regtool -dbname gpc1 -updateprocessedimfile -exp_id 825756 -class_id XY14 -set_ignored -set_state full -burntool_state -14
  • 4:21 HAF regtool -updateprocessedimfile -exp_id 825785 -class_id XY12 -set_state pending_burntool -dbname gpc1
  • 4:27 HAF regtool -updateprocessedimfile -exp_id 825787 -class_id XY41 -set_state pending_burntool -dbname gpc1
  • 4:30 HAF regtool -updateprocessedimfile -exp_id 825787 -class_id XY03 -set_state pending_burntool -dbname gpc1
  • 4:50 HAF : no problems for 20 minutes, and we are caught up on registration. hopefully it stays that way
  • 10:50 HAF: registration fell behind again after I went to bed and was fixed (mysteriously?)
  • 10:50 reverted some warps for oss a few times (richard wainscot noticed they were stuck), they seem to have resolved (?)
  • 11:08 HAF: nope. this cursed one has been set aside update warpRun join warpSkyfile using (warp_id) set state = 'full', fault = 0, quality = 42 where warp_id = 1229191 and skycell_id ='skycell.1946.002' ;
  • 14:00 HAF: final registration problem needed to rerun neb://ipp044.0/gpc1/20141127/o6988g0578o/o6988g0578o.ota65.fits -- not sure what happened or why it got lost -- it just looks like the log file ended (prematurely, no errors or anything). Rerunning fixed it.

Friday : 2014.11.28

  • 04:50 EAM: ipp036 crashed, no errors on console, rebooting
  • 04:55 EAM: ipp036 is back up. summitcopy is far behind -- we are still downloading exposures from 03:27, so 1.5h behind. The average download time is currently 60s per chip and we are using 51 parallel connections, which works out to 71 seconds per exposure. since the start of observations, we have been taken 616 object images in 11.75 hours, which is 63 seconds per exposure. this is a problem!

Saturday : 2014.11.29

  • 05:00 EAM: i've launched a big relastro run using ippx002 - ippx043.
  • 22:40 MEH: stdsci pantasks in need of a restart, polling is underloaded and ~50% underused..
  • 23:30 MEH: as usual, touched something and machine goes down -- ipp035 unresponsive ~2325 -- nothing on console, power cycle
    • something is driving ipp035/6 to unresponsive more than other nodes -- wasn't really under load until restarted stdsci, should probably just remove from stdsci and leave in summit+registration.. -- both ipp035+036 out of stdsci
    • all stalled jobs cleared this time w/o needing to restart pantask etc -- registration stalled needing
      regtool -updateprocessedimfile -exp_id 827836 -class_id XY54 -set_state pending_burntool -dbname gpc1

Sunday : 2014.11.30

  • 07:45 MEH: odd warp issue and diff fault due to cannot build growth curve
    • warp -- missing chip file, not sure how this can happen, but clearing to finish nightly
      warptool -dbname gpc1 -updateskyfile -fault 0 -set_quality 42 -warp_id 1234917 -skycell_id skycell.1425.044
    • and then 5 other skycells need chip XY54 -- recovered chip w/ and warps okay
      perl --chip_id 1294435 --class_id XY54 --redirect-output
    • diff
      difftool -updatediffskyfile -fault 0 -set_quality 42 -diff_id 610788 -skycell_id skycell.1434.035 -dbname gpc1