PS1 IPP Czar Logs for the week 2015.03.20 - 2015.04.05

Monday : 2015.03.30

  • HAF: warptool -revertwarped -fault 4 -label OSS.nightlyscience -dbname gpc1
  • HAF: still broken:
    cannot build growth curve (psf model is invalid everywhere)
    Backtrace depth: 11
    Backtrace 0: p_psAssert
    Backtrace 1: pmGrowthCurveGenerate
    Backtrace 2: psphotMakeGrowthCurve
    Backtrace 3: psphotChoosePSFReadout
    Backtrace 4: psphotChoosePSF
    Backtrace 5: psphotReadoutFindPSF
    Backtrace 6: (unknown)
    Backtrace 7: (unknown)
    Backtrace 8: (unknown)
    Backtrace 9: __libc_start_main
    Backtrace 10: (unknown)
    Unable to perform pswarp: 4 at /home/panstarrs/ipp/psconfig/ipp-20141024.lin64/bin/ line 563.
    Running [/home/panstarrs/ipp/psconfig/ipp-20141024.lin64/bin/warptool -addwarped -tess_id RINGS.V3 -path_base neb://ipp081.0/gpc1/OSS.nt/2015/03/30//o7111g0405o.894133/o7111g0405o.894133.wrp.1520051.skycell.0849.005 -hostname ippc56 -dtime_script 19.999971985817 -warp_id 1520051 -skycell_id skycell.0849.005 -fault 4 -dbname gpc1]...
  • HAF manually setting quality for those 2 to 42:
    warptool -revertwarped -fault 4 -label OSS.nightlyscience -dbname gpc1
    warptool -addwarped -tess_id RINGS.V3 -path_base neb://ipp081.0/gpc1/OSS.nt/2015/03/30//o7111g0405o.894133/o7111g0405o.894133.wrp.1520051.skycell.0849.005 -hostname ippc56 -dtime_script 19.999971985817 -warp_id 1520051 -skycell_id skycell.0849.005 -quality 42 -fault 0 -dbname gpc1
    warptool -addwarped -tess_id RINGS.V3 -path_base neb://ipp081.0/gpc1/OSS.nt/2015/03/30//o7111g0539o.894269/o7111g0539o.894269.wrp.1520184.skycell.1027.036 -hostname ipp088 -dtime_script 17.9999828338623 -warp_id 1520184 -skycell_id skycell.1027.036 -fault 0 -quality 42 -dbname gpc1
  • 1605 Bill: restarted postage stamp server pantasks. Modified to look for staticsky cmf for stacks if a skycal result is not given by the parser.
  • 20:00 EAM: stopping and restarting stdscience (80k+jobs)

Tuesday : 2015.03.31

  • 04:00 Bill: mysqld on ippdb05 crashed. Log file saved to ~ipp restarted it
    • 04:40 The burntool process was quite messed up by the DB crashed. Reset the states of about 20 rawfiles.
    • 04:45 restarted summitcopy pantasks so that the error counts are reset
  • 05:00 Bill restarted czarpoll and roboczar on ippc11. they exited when the database went down
  • 16:16 CZW: ipplanl/pv3stacksummary is running repair stack summary jobs. This pantasks uses storage nodes, but the jobs are very lightweight, and did not bother things last time they ran.

Wednesday : 2015.04.01

  • HAF 7am registration jammed, did this:
    regtool -updateprocessedimfile -exp_id 895632 -class_id XY16 -set_state pending_burntool -dbname gpc1
    regtool -updateprocessedimfile -exp_id 895653 -class_id XY16 -set_state pending_burntool -dbname gpc1
  • 0913 Bill: restarted pstamp pantasks

Thursday : 2015.04.02

  • HAF 530: registration stuck, fixing:
    regtool -updateprocessedimfile -exp_id 896283 -class_id XY36 -set_state pending_burntool -dbname gpc1
  • 06:15 EAM : stdscience is running slow; stopping to restart.
  • 06:30 EAM : restarted stdscience

Friday : 2015.04.03

  • 19:35 EAM : ipp017 crashed, no message on console. attempting to reboot. no luck rebooting. per policy, i am powering that machine down now.
  • 22:05 MEH: pstamp could use a restart
  • 22:10 MEH: no nightly has processed... looks like ipp017 while down has neb-host up still.. probably the block..
    • also has jobs running for >9ks -- great, cleaning up a mess now..
    • >160 exposures behind
  • 22:40 MEH: 4 diffim fault 5, must clear before warps get cleaned up..
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.2122.067 -diff_id 897737 -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.0685.025 -diff_id 898385 -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.0685.025 -diff_id 898401 -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.0685.025 -diff_id 898469 -fault 0

Saturday : 2015.04.04

  • 05:20 Bill: ipp017 is down. Ganglia says it has been down for nearly 12 hours. Console does not work for me. Several camRuns are stuck trying to talk to the reference catalog there.
    • 05:24 Set ipp017 to down in nebulous and commented out the CATDIR.017 entry in site.config
    • stopping stdscience. I'm going to wait for the queue to clear and then try to kill the stuck jobs
  • 05:44 killed stuck camera jobs. Started new stdscience pantasks
  • 06:25 EAM : ippdb00 was close to full. i've purged binlogs to mysqld-bin.005115, giving us about 98G of space.
  • 14:35 MEH: restarting distribution to try and clear the fixed old diffims stalled.. except ones now stalled by ipp017 -- and making sure ipp017 is off in other pantasks..
  • 19:55 EAM: stopping & restarting pv3diff & pv3diffleft.

Sunday : 2015.04.05

  • 05:45 EAM : ipp054 crashed (kernel panic), rebooting. (success)
  • 16:00 EAM : many nodes have hung nfs mounts due to ipp017. I've tried to clear them, but ipp088 is blocked and has a load of > 100. I'm rebooting it to clear out its problems.
  • 20:25 EAM : i've cleared out all of the hung nfs mounts due to ipp017. There were various jobs stuck since Apr 03 which I needed to kill. (Note: I did not touch the ipps nodes or hi-mem x nodes).
  • 00:45 MEH: ipp088 is neb-host down still... seems to be up so putting into repair at least so not outright killing MD processing