PS1 IPP Czar Logs for the week 2013.07.15 - 2013.07.21

(Up to PS1 IPP Czar Logs)

Monday : 2013.07.15

  • 10:25 Bill Mops reports slow postage stamp response. I restarted pstamp pantasks. It's pcontrol was spinning. Added 2 x compute3 and compute2.
  • 11:05 Bill Mops is done. Updates are now the bottleneck. Restarted update pantasks. added two sets of compute 3 nodes. Removed them from pstamp
  • 12:40 MEH: if 2x compute3 moved to update, need to turn off in stdsci or may overload -- loading up MD10.refstack.20130715 chip->warp now

Tuesday : 2013.07.16

  • 08:49 Bill: The postage stamp server is behaving strangely today. MOPS has submitted a number of requests but each one is taking a very long time to parse into jobs. The result is we have a number of requests parsing (limited to 10 to avoid database overload) and no jobs to run. Once the requests parse they finish quickly. I'm not sure what to do? Perhaps the mysql on ippc17 is tired.
  • 08:55 Bill: Had a look at ippc30 (where the postage stamp working directories live) top showed many apache processes and little free memory. Ran my program eatmem and the amount of memory cached dropped as expected. After the program finished we had 15G free and the apache processes went away in top and nfsd showed up. Restarted apache
  • 09:04 Bill: Sample req_id 287529 submitted 08:03. Parsing begins 08:54. Parsing finishes 09:02 200 jobs complete 09:03. request finishes seconds later.
  • 09:07 All caught up now. Strange.
  • 10:40 new batch of MOPS pstamp requests. Same poor parse performance. Restarted pstamp and update pantasks.
  • It appears that the speed of inserts into the pstampJob table is now a bottleneck
  • 15:45 Bill restarted mysql on ippc17 and the pstamp pantasks
  • 17:45 MEH: taking 4x compute3 from stdsci for local stdsci and deepstack runs
  • 23:40 MEH: stdsci in dire need of regular restart..
    • ipp057 very unhappy.. load>100, still not able to log in.. problem started ~23:20 -- finally got login, killed ppImage jobs. removing ipp057 from sumitcopy+registration, but was neb-host repair that helped reduce the load..
    • often having wave4 issues, with DVO work seems to be overloading, taking 2x out of stdsci -- Gene notes that the rsync was running since afternoon and just writing to ippcXX, maybe new 3.7.6 kernel will help (need to check status with Gavin), or maybe just keeping the 2x wave4 out of stdsci is enough w/o having to put into neb-repair
    • returning 4x compute3 to stdsci until morning

Wednesday : 2013.07.17

mark is czar

  • 07:10 MEH: 57 images stuck in registration.. can run burntool manually and picks up, but regtool -pendingburntoolimfile not finding
  • 08:00 MEH: chip.revert.off, imfile having burntool state problem -- related to problem or did someone try fixing earlier this morning and not put in czarlog?
    • lots of info dumped to registration log around 01:20, something happened then..
  • 08:30 MEH: some how o6490g0448o had burntool_state -1 and data_state full... re-ran ipp_apply_burntool_single.pl and seems to be registering now
    | exp_id | exp_name    | data_state       | fault | burntool_state | class_id |
    +--------+-------------+------------------+-------+----------------+----------+
    | 634289 | o6490g0448o | full             |     0 |             -1 | XY14     | 
    | 634290 | o6490g0447o | full             |     0 |            -14 | XY14     | 
    
  • 12:00 MEH: Serge wanted more power in pstamp, taking 2x compute3 from stdsci until tonight observations start
  • 13:10 MEH: rebuilding ops ippconfig for MD_REF/DEEP_1DG additions to recipes/reductionClasses.mdc
  • 15:10 MEH: rebuilding ops ippTasks, ippScripts for a registration bug fix
  • 15:12 Bill: The trigger of the burntool problem was the fact that another pantasks ~ps2iq/registration had register.add.date commands with gpc1 database listed. This combined with some incomplete error handling in the burntool script caused the database to get set to the bad state. Integrated changes to ipp_apply_burntool_single.pl to improve the error handling and removed the offending lines from ps2iq's input file
  • 19:50 MEH: looks like registration is having trouble.. Nfail=7698

Thursday : 2013.07.18

  • 03:36 Bill: added --camera GPC1 to register_imfile.pl's call to ipp_apply_burntool_single.pl. This was the cause of the failures. Needed to edit db to change burntool_state from check_burntool to pending_burntool for the 11 rawImfiles that were left in a bad state by this problem.
  • 04:46 Bill: burntool has caught up just in time. We are about 20 minutes late starting processing of the MSS exposures
  • 05:27 Bill: noticed that ns.stacks.run was timing out. Bumped timeout to 1000
  • 05:45 Bill: OSS stacks from 2013-07-17 just got queued.
  • 11:03 Bill: queued one last M31 test with the 20 test exposures labels M31.test.20130718.bgsub and M31.test.20130718
  • 14:00 Bill: using ipp049 mysql to recover the detectionEfficiency table which I deleted yesterday. operations is using it for some reason. Stopped this because I realized there will not be enough disk space.
  • 16:40 Bill: restarted stdscience and distribution pantasks. M31 reprocessing has begun. There are two labels M31.rp.2013.bgsub (chip - warp) and M31.rp.2013 (bg preserved chip, chip_bg, warp_bg)
  • 21:36 Bill: lowered m31 label priority to 200 to match LAP. Since LAP updates have smaller chip_id they will get priority. M31.rp.2013 data queued for r filter 2009-08-31 - 2010-01-31 413 chipRuns (x 2) pending.

Friday : 2013.07.19

mark is czar

  • 07:50 MEH: stdsci has reach 100k, time for regular restart. compute3 back in until needed in deepstack again
  • 08:20 MEH: lots of M31 chip faults, looks like burntool problem? and usual missing chips
  • 09:20 MEH: chip.revert.off... appears ipp053 has lost OTA76 copies?
  • 09:40 MEH: fixed top 7 out of apparently 130.. quite a few, why now and did something break on ipp053 that may have troubled other data?
  • 10:29 Bill: Removed M31.rp labels from stdscience. My query picked up a number of non-science images with the M31 obs_mode. (qtfocus, magic tests). These apparently were not burntooled with version 14. They are uninteresting though. Many of the missing XY76 images are from science exposures though.
  • 11:10 MEH: looks like ippc08 is down.. nothing on console, power cycle -- second power cycle and booted, nice long ram check. and back up
  • 12:30 Bill: installed a new version of chip_imfile.pl which moves the neb-repair of broken raw images so that it runs even if burntool is going to be applied by ppImage. Added m31 labels back in. but removed them to check on some more faults
  • 12:51 Bill: dropped several runs whose raw files have bad burntool state but not before generating a lot of faults. Chip job fault count is now 10170
  • 12:55 MEH: ipp057 is getting abused... taking out of stdsci+stack.. -- mysql 85% RAM, need to restart mysql, also ipp058 -- both machines over-overloaded.. but may have messed up dvopsps. as datanodes, they cannot be abused or all processing gets abused
  • 14:55 MEH: ippc08 down again.. trying power cycle.
    • taking out of processing
    • restarting stdsci to clear some very long running jobs as well

Saturday : 2013.07.20

  • 10:30 MEH: restarting stdsci to include MD10.refstack now that regular night SSdiffs finished
  • 12:55 Bill: dropped 61 warpRuns that for exposures that turned out not to be M31 science exposures

Sunday : 2013.07.21

  • 13:30 MEH: stdsci well past need for normal restart, doing before nightly science