PS1 IPP Czar Logs for the week 2014-02-17 - 2014-02-23

(Up to PS1 IPP Czar Logs)

Monday : 2014-02-17

  • 11:30 MEH: ipp056 dumping regular messages, setting to repair until more info -- Haydn reports disk on way to failing, swapped out and raid rebuilding
    Feb 17 11:18:38 ipp056 MR_MONITOR[15606]: <MRMON113> Controller ID: 0  Unexpected sense: PD = --:--:10 - Unrecovered read error, CDB =  0x28  0x00  0x5b  0xc7  0xe8  0x8f  0x00  0x00  0x04  0x00 , Sense =  0xf0  0x00  0x03  0x5b  0xc7  0xe8  0x91  0x0a  0x00  0x00  0x00  0x00  0x11  0x00  0x00  0x00  0x00  0x00 
    Feb 17 11:18:39 ipp056 MR_MONITOR[15606]: <MRMON057> Controller ID: 0  Consistency Check corrected medium error: (VD 0 Location 0x5bc7e834, PD --:--:10 Location 0x5bc7e834)
    Feb 17 11:18:39 ipp056 MR_MONITOR[15606]: <MRMON110> Controller ID: 0  Corrected medium error during recovery: PD --:--:10 Location 0x5bc7e834
  • 12:20 MEH: also looks like the ippdb02 mysql backup to ippc63.1 encountered odd state and aborted at 09:31.. will need to fix so can start mysql again on ippdb02..
    • script gets week number and of course 08,09 are problems as bash interprets 0x as octal notation..
    • modified script and crontab, will restart backup process 13:31 and mysql normally restarted when finished..

Tuesday : 2014-02-18

  • 08:00 MEH: ippdb02 backup to ippc63 still running, should finish later today
  • 11:33 Bill: reverted staticsky runs with fault 2 (33 runs) and 3 (11 runs) since the information in the logs indicates that the failures were due to cluster problems.
  • 14:00 MEH: odd pstamp/update faults w/ data_group M31.rp.2013.20110102, for a bit
  • 14:11 CZW: repointed the LAP queue file to the active one. This has been off since the timeout issue last week (still unresolved). Sudden thought while typing this out: the point where it ran into problems was skycell.2211/i, which is the 3664th entry in the old queue file. Since it needs to check the previous entries to ensure they haven't been done yet, this can probably stretch out to 10 minutes just in checking old entries. I've cut the currently active queue down to a three hour slice, to see if that keeps the number of things to check in check.
  • 18:00 MEH: ipp064 volume degraded, already in neb-host repair for dvo work
  • 20:50 MEH: ipp023 is unresponsive.. looks like crashed little after 20:20, neb-host down.. -- nothing on console, like ipp026, another weak node with known history of crashing..
    • console only shows it appears also crashed and rebooted on Sat 2/15, adding detail to ipp023-crash-20140215
  • 22:00 looks like ipp023 down has pissed off ippc01 as well -- display says general protection fault occurred, will need to power cycle as well..
  • 23:35 MEH: ippc01 power cycled and back up, mess of stalled jobs and mounts cleared. restarting pantasks slowly -- ippc01-crash-20140218
  • 00:20 MEH: there is a mix of undocumented manual setting/changes not tracked for normal loading using the ~ipp/ippconfig/pantasks_hosts.input.... have to untangle before stdsci+stk+staticsky can be started..
  • 00:30 MEH: summitcopy 500 read timeouts ongoing..

Wednesday : 2014-02-19

  • 09:30 MEH: manaully raising some node usage for stack
  • 10:00 MEH: clearing out some LAP burn.tbl faults found missing on ippb02 ( since ipp023 is down). will be a mess of cam+warp+stack faults while ipp023 is down
  • 10:57 Bill: reverted all faulted staticsky runs. Looks like many of the fault 4s were due to system crashes or memory problems. Rerunning them all to get a better handle on which are due to bugs in the code.
  • 12:20 MEH: Haydn replaced ipp023 mobo, back up. neb-host repair for time being. all x.revert.on again -- '
  • 15:50 MEH: ipp037,041,045 seem to be running 3x staticsky, but have only 8 core.. and are data nodes, manually reducing to 2x..
  • 16:15 Bill: set staticsky to stop in preparation for rebuilding psphotStack with a couple of bug fixes
    • psphotStack rebuilt setting to run. Reverted the fault 4s caused by the bug. Some will work now. Others will get fault 3.
  • 16:46 Bill: manually edited some staticskyRuns that have inputs with <1% good_frac.
    • sky_id 459346 deleted input stack_id 2480292
    • sky_id 459737 deleted input stack_id 2561398
    • sky_id 462496 deleted input stack_id 3000094
    • sky_id 460690 set to drop only has 1 input good_frac 0.000507
    • sky_id 461050 set to drop all inputs too sparse
  • 18:40 MEH: sky looking clear and NEO night, so manually rebalancing system to push more stdsci (if other not-czars are turning things on/off, it really needs to get logged here)..
  • 19:15 MEH: ipp040,043,050,052,053 taking excessive time in registration again, >>100s, neb-host repair and off in registration for now -- too many other problems, no time to isolate this issue in processing or disk -- likely related to previously reported odd behavior with reg+stdsci over the past couple weeks.
  • 20:10 MEH: turning off nodes on in summitcopy (wave1, ipp037,041,045,047) -- total 46 or so running
    • ipp019 has problem with network, cannot manually wget from conductor there -- same problem last night but just for ipp019 now
      request failed: 500 Can't connect to (connect: Network is unreachable) at /home/panstarrs/ipp/psconfig/ipp-20130712.lin64/bin/dsget line 150.
  • 22:50 MEH: ipp025 spiked @2248 and now unresponsive.. neb-host down.. -- ipp025-crash-20140219
    • 23:15 power cycle and back up, neb-host repair and mostly out of processing
  • 00:45 MEH: as long as nothing else crashes, stdsci rate ~50-100/hr should catch up

Thursday : 2014-02-20

Bill is czar today

  • 07:00 a few chips left to go and one camRun which faults repeatedly. Apparently a file on ipp025 is corrupt.
    • regenerated corrupt file with perl ~bills/ipp/tools/ --chip_id 948704 --class_id XY40
  • 07:08 removed LAP label from stdscience
  • 7:21 for some reason the remaining chip jobs are not getting queued. There are items in the book in run state but no controller jobs. Setting stdscience to stop to prepare for restarting.
    • 07:24 ran chip.reset to clear out the book and set stdscience to run. Now the 26 remaining chip jobs are running
  • 10:00
    • restarted pstamp pantasks with an additional hosts.pstamp
    • Ran second hosts.publish in publishing pantasks. Getting lots of faults there. Each run requires accessing many cmfs. With cluster so busy it's likely that at least one will be on an unresponsive node.
    • Set staticsky to stop
    • removed ippc28 from stdscience to give the long running psphotStack job a chance to swap back in.
  • 10:25 restarted stdscience with nominal host configuration and (minus c28 and c63) and LAP label
  • 10:46 set staticsky to run
  • 12:50 "fixed" 2 staticskyRuns from the bottom of the survey that were failing with fault 3
    • dropped failing input from sky_id 459022 stack_id 2388154 only has 2% good_frac with: delete from staticskyInput where sky_id = 459022 and stack_id = 2388154;
    • dropped failing run staticskytool -updaterun -set_state drop -sky_id 459190 -set_note 'all inputs have < 2% good_frac dropping'
    • dropped that one too 477799
  • 14:55 MEH: again, too many other problems to deal with, unable to trace long running jobs in registration related to ipp040,043,050,052,053 as has been reported happening over the past couple weeks. manually out in registration still, put back neb-host up and will see how things go.
  • 17:00 MEH: using ippc63 for MD stack tests, making sure manually off from normal stdsci processing (compute3_weak)
  • 21:18 The czar has decided to make night life easier. lap label removed from stdscience. staticsky set to stop. Once the queue clears the staticsky nodes will be added to stdscience. It is already running some of the groups that were removed from stack last night.

Friday : 2014-02-21

Bill is czar today

  • 07:30 restarted stdscience with default set of hosts except set ippc63 to off. Started up staticsky.
    • MEH: will be regularly using ippc63 to try and catchup on some tests, so fully commented out usage in ~ipp/ippconfig/pantasks_hosts.input to not have to worry about
  • 09:41 Since stdscience is running the last of the STS data, stack has nothing to do so added stack hosts to stdscience.
  • 13:30 CZW: I set some old detrends to state "ignore". This shouldn't cause any problems for processing (they shouldn't be used anyway), I just wanted to make sure that there was a czar log entry.
  • 13:45 CZW: replication pantasks is running again. This will add a slight io increase across the cluster as the script ssh's and md5sums.
  • 16:50 restarted summitcopy, registration, stack, and cleanup pantasks
  • 17:10 MEH: someone/thing is trying to kill ipp016, 020... looks like massive RAM used by ppImage.. trying to log in..
    -- ipp016 killed @30%, 50%
    ipp        921   710 53 16:20 ?        00:28:37 /home/panstarrs/ipp/psconfig/ipp-20130712.lin64/bin/ppImage -file neb://ipp043.0/gpc1/20090701/o5013g0132o/o5013g0132o.ota24.fits neb://ipp016.0/gpc1/STS/STS.rp.2013/2009/07/01/o5013g0132o.83697/ -recipe PPIMAGE CHIP_AUXMASK -dumpconfig neb://ipp016.0/gpc1/STS/STS.rp.2013/2009/07/01/o5013g0132o.83697/ -recipe PSPHOT CHIP -recipe PPSTATS CHIPSTATS -stats neb://ipp016.0/gpc1/STS/STS.rp.2013/2009/07/01/o5013g0132o.83697/ -R PPIMAGE.CHIP FITS.TYPE NONE -burntool neb://ipp043.0/gpc1/20090701/o5013g0132o/o5013g0132o.ota24.burn.tbl -threads 4 -image_id 55849015 -source_id 33 -tracedest neb://ipp016.0/gpc1/STS/STS.rp.2013/2009/07/01/o5013g0132o.83697/ -dbname gpc1
    ipp      31476 31372 59 16:40 ?        00:19:07 /home/panstarrs/ipp/psconfig/ipp-20130712.lin64/bin/ppImage -file neb://ipp043.0/gpc1/20090701/o5013g0141o/o5013g0141o.ota24.fits neb://ipp016.0/gpc1/STS/STS.rp.2013/2009/07/01/o5013g0141o.83707/ -recipe PPIMAGE CHIP_AUXMASK -dumpconfig neb://ipp016.0/gpc1/STS/STS.rp.2013/2009/07/01/o5013g0141o.83707/ -recipe PSPHOT CHIP -recipe PPSTATS CHIPSTATS -stats neb://ipp016.0/gpc1/STS/STS.rp.2013/2009/07/01/o5013g0141o.83707/ -R PPIMAGE.CHIP FITS.TYPE NONE -burntool neb://ipp043.0/gpc1/20090701/o5013g0141o/o5013g0141o.ota24.burn.tbl -threads 4 -image_id 55849615 -source_id 33 -tracedest neb://ipp016.0/gpc1/STS/STS.rp.2013/2009/07/01/o5013g0141o.83707/ -dbname gpc1
    -- ipp020 -- wasn't able to log in
  • Bill killed the ppImage on ipp020 and it started to respond. Removing sts label for the weekend.]
  • 08:49 more http 500's from conductor. This time from an assortment of nodes. All jobs succeeded after revert.
  • 22:25 earlier set.ra.max from 120 to 180 in staticsky. This will process runs at low declination for awhile but once it gets to 10 degrees declination or so it will start processing in the order that the lap runs were done.

Saturday : 2014-02-22

  • 05:50 bill: summitcopy and registration are doing fine stopping staticsky and adding some compute3 nodes to stdscience to help it move along
  • 08:08 bill: OSS is finished, 3pi is through camera, turning staticsky back on and removing the compute3 nodes that I added from stdscience
  • 12:28 bill: restarted stdscience.
  • 19:01 bill: set one set of compute2 and compute3 hosts to off in staticsky. Three psphotStacks, a ppStack, and a ppImage might be a bit aggressive.
  • 20:04 bill: staticsky set to stop

Sunday : 2014-02-23