(Up to PS1 IPP Czar Logs)

Monday : 2012.12.10

  • 08:40 Bill dropped a chip that had no detections and popped an assertion failure chiptool -dropprocessedimfile -set_quality 99 -chip_id 725138 -class_id XY76
  • 08:42 set sts.rerun.20121129 back to active 565 pending warps. They are slow because our pantasks host assigment is designed for maximum use of cpu cores, but sts fields are memory hogs.
  • 09:50 MEH: ipp011 neb-host up to see if can crash it again
  • 12:00 MEH: rebooting ipp026 from weekend trouble, adding back into processing and neb-host up
  • 13:00 MEH: ipp011 neb-host repair, being a glock tarpit (holding +20 mins). things slowly clearing now.
  • 16:30 MEH: clearing some remaing stalled STS jobs on ipp066. doing a reboot of ipp066, restart of stack and stdscience and distribution
  • 18:35 MEH: looks like ipp029 down... nothing on console.. apparently didn't like having nothing to do..
    • of course, not rebooting..
    • ipp029 neb-host down, taking out of processing (should be already)
    • nightly science now proceeding @19:20

Tuesday : 2012.11.11

  • 09:13 Bill stopped distribution. There are some recurring faults that need to be debugged. May be related to ipp029 outage.
  • 09:15 bill set skycal to off in preparation for an adding a column to the skycalResult table
  • 09:35 Serge: Restarted mysql servers on all storage nodes but: ipp024, ipp025, ipp026, and ipp029 (down). Need to play the ~schastel/dev/FilesMonitoring/NodeRelated/maintenance/refresh_status_table_creation_mhpcc.sql script on them.
  • 10:25 Serge: When ipp029 is back, install the following crontab (as schastel user): ~schastel/dev/FilesMonitoring/NodeRelated/deployment/scripts/crontab.ipp029
  • 10:45 Bill: stopped processing for rebuild for schema change
  • 11:15 MEH: with processing stopped, did a double restart of rpc.statd on ipp020 (was using large amount of swap space)
  • 11:45 Bill: Set pantasks to run. Restarted stdscience pantasks without compute3 started deepstack with 2 x compute3
  • 20:50 MEH: apparently stack pantasks isn't running.. MD nightly waiting to run. can't see any crash messages, was it not restarted earlier today? not sure why roboczar didn't report it.
    • Bill notes likely silent crash from tag being rebuilt earlier, why roboczar didn't report (and haven't seen in previous full group shutdown warnings in a while)
    • MEH had checked wrong list -- roboczar wasn't watching stack or publishing -- adding to czartool/czarconfig.xml, now it watches stdscience distribution summitcopy registration pstamp stack publishing

Wednesday : 2012.12.12

Bill is the czar today

  • 07:40 added a third set of compute3 hosts to deepstack
  • 08:10 queued 73000 skycal lap skycal runs for the region RA < 227 dec < 10
  • 16:51 set ipp029 to repair mode (should have done this much earlier)
  • 19:15 stdscience is sluggish. Setting to stop in preparation for restart

Thursday : 2012.12.13

Bill is czar today

  • 05:43 restarted stdscience with some tweaks to skycal.pro to start jobs more often if there is work to do
  • 10:55 noticed that ipp020 has become a job staller. Killed running jobs there and set host to off in pstamp and stdscience
  • 11:00 restarted ipp020 rpc.statd
  • 11:08 dropped the 620 skycalRuns in the kepler region that had excess matched detections. Set the 124 associated staticskyRuns to be re-run (set result.fault skycalRun.state = 'new' and revert. These will now run with the new psphotStack.
  • 12:27 added more hosts to deepstack pantasks. 2 x stare, 1 x compute2 1 x wave4. Will need to remove these before nightfall
  • 13:14 restarted rpc.statd on ipp019
  • 16:57 set the extra hosts added to deepstack at 12:27 to off

Friday : 2012.12.14

Mark is czar

  • 06:20 Bill: set stare. compute2, and wave4 hosts to on in deepstack
  • 08:00 MEH: ipp020 hung mounts, hanging on to ISP dark image registration and cleanup run from yesterday. restarting registration, will restart ipp020 in hour or so.
    • after a full nfs/rpc stop for a bit and then start, rpc restart, all mounts are back and nfs doesn't report export of cache tainted @09:20, no reboot as hanging jobs cleared.
    • all mounts check out okay while system mostly idle
  • 10:25 ipp020 mount trouble may be back -- and cleared, maybe after did mysql restart there for the hanging python refresh_table.py -- will need to keep an eye on
  • 11:13 bill cleaned up the staticsky mess that he made and queued new runs for the affected skycells
  • 11:15 MEH: mysql running amok on most of data nodes, python refresh_table.py running from past 3 days, suspect stalled (no DB activity on machines) and will need to start restarting mysql on all machines..
    • ipp028 mysql using 300% CPU and 11 refresh_table.py processes
    • ipp039 mysql using 300% CPU and 4 refresh_table.py processes
  • 13:06 MEH: finished resetting all mysql on systems with hanging python, ipp034 started script again @1252 and is active. will need to watch over weekend and see if more pileup/duplicates happen
    • @14:30 seeing more systems running refresh_table.py process now, ipp034 still active and updating DB
  • 16:50 MEH: Rita reports weather/power stability better at ATRC, can put ippb machine back into nebulous. forget which were up/repair, need to make sure details get into czarlogs. seem to remember ippb03 wants to be in repair, ippb02 is overfull so repair, ipp00,01 up. replication pantasks isn't active so not critical..
  • 17:00 MEH: seeing many data hosts running refresh_table.py now with mysql using 100% of a CPU, if this happens often then will need to just run during not nightly science hours and relieve nodes of jobs to accomodate (no chip->warp, stack processing going on right now so okay)
  • 18:45 MEH: nightly science started, need to turn off deepstack adds (stare, wave4) for stdscience
  • 20:40 MEH: ipp034 now has TWO refresh_table.py running @19:52, mysql using 200% CPU..
  • 21:30 Serge: Fixed deadlock in files monitoring. Script is restarted once a day now (and not once an hour... oops). Delay between two iterations over nebulous directories set to 10 minutes instead of 10 seconds.
  • 21:55 MEH: ipp034 running just second started refresh_table.py
  • 22:05 MEH: summitcopy having issues with o6276g0125o-gpc1-ps1-chip-ota51 since ~19:08, duplicate entry and not enough fields .. not blocking registration/processing so far..
    ...
    line 629: not enough fields:  610           217.665950           217.745549           217.745629           217.762466    1310679279.092246    1310679279.094205  415.07  173.72    1.61    1.36    1.91         7882.08                14.92    3.28 883 414 123  at /home/panstarrs/i
    pp/psconfig/ipp-20121026.lin64/lib/DataStore/FileSet.pm line 226
    ...
    line 1243: not enough fields:  608           217.780222   -5    0  FFFFFFFFFFFFFFEF  at /home/panstarrs/ipp/psconfig/ipp-20121026.lin64/lib/DataStore/FileSet.pm line 226
    
    
     -> p_psDBRunQueryPrepared (psDB.c:956): unknown psLib error
         Failed to execute prepared statement.  Error: Duplicate entry 'o6276g0125o-gpc1-ps1-chip-ota51' for key 1
     -> go (pzgetimfiles.c:244): unknown psLib error
         database error
    
    failure for: pzgetimfiles -uri http://conductor.ifa.hawaii.edu/ds/gpc1/o6276g0125o/index.txt -filesetid o6276g0125o -inst gpc1 -telescope ps1 -dbname gpc1 -timeout 650
    job exit status: 1
    job host: localhost
    

Saturday : 2012.12.15

  • 07:14 Bill:reported bad fileset listing to camera alias
  • ramping up horsepower on deepstack 2 x stare 1 x wave4 on
    • MEH: looks like compute2 as well -- nightly SSdiffs about to start (going to be a lot of them), probably want to have the stare,wave4,compute2 off in stdscience and setting so now until nightly science tonight
  • Bill 18:18 set deepstack to stop in preparation for restart. Restarted stdscience.
  • 18:41 deepstack restarted with 3 x compute3 only

Sunday : 2012.12.16

Bill is watching things today

  • 05:45 no data last night too windy. Enabling 2 x wave 4 2 x stare, 1 x comput2 in deepstack, same set set to off in stdscience
  • in the am processed 20 M31 exposures as a test for MPG
  • 11:08 added some more hosts to deepstack compute3 compute2 and stare
  • 11:46 removed 1 of the comute2 and stare instances. Too close to memory limit and I have to go out for a bit.
  • 14:00 MEH: refresh_table.py found running twice for a while on a machine occasionally (ie ipp028) but mysql limited to 100% CPU with one process locking the tabe, older jobs clearing okay. appears stable again with Serge tweaking it from the other side of the world
  • 18:57 restarted deepstack with just the 3 x compute3 hosts