PS1 IPP Czar Logs for the week 2013.08.05 - 2013.08.11

(Up to PS1 IPP Czar Logs)

Monday : 2013.08.05

Bill is czar today.

  • 09:00 queued M31 r band exposures from 2010-11-15 - 2012-12-31
  • 09:30 Rita is shutting down ipp031 and ipp032
  • 12:40 MEH: 2x compute3 off in stack and 1x compute3 deepstack for MD01 refstack - will turn off and on when finished/as needed
  • 13:06 stopping processing for schema update (ippRelease and relExp tables only)
  • 13:38 rebuilt tag with schema changes psbuild -start glueforge . Restarting pantasks except for deepstack whose running processes survived the rebuild. Since Ohana was not rebuilt the pantasks processes should be fine.

Tuesday : 2013.08.06

  • 09:02 Bill queued m31 rband exposures from 2011
  • 13:50 MEH: if compute3 is off in stack, then they are in use for deepstack. will try to squeeze in runs while M31 (ie no LAP stacks) running
  • 15:07 Bill rebuilt tag to add relExp.mcal column to schema. Restarted stdscience.
  • 15:50 Bill is running a script on ippc30 that will run a couple hundred thousand releasetool -updaterelexp commands. It is unlikely that this will affect anything.

Wednesday : 2013.08.07

  • 09:04 Bill nightly science is bogged down with warps wanting to put data on stsci07 and stsci17. Setting and Perhaps we should turn off the tasks when processing nightly science.
  • 09:25 CZW: Since the load on the stsci nodes is basically zero, I increased the "unwant" parameter in stdscience: "controller parameters unwanted_host_jobs = 10". This was 7, and my hope is that we can get through the backlog for the stsci node jobs by increasing the number of remote jobs that will write to those disks. This assumes that the stsci nodes continue to have low loads.
  • 11:58 Bill set bg.on
  • 16:24 CZW: Stopped processing to pick up some changes in ippScripts: - Set verbose = 0 for regtool -checkburntool, as this dumps a lot of useless information into the log files; - Add b1jpeg to the list of things cleaned as these are binned at an unreasonably large resolution. Restarted stdscience to counter the slow slow-down that happens when it runs many jobs.

Thursday : 2013.08.08

mark is czar

  • 09:00 Gene stopping/restarting stdscience and adding more verbosity and running delay times to publishing
  • 10:00 Bill briefly stopped stdscience and pstamp to pick up changes to releasetool and the postage stamp server code. This may have caused a few job errors due to the sql files getting reinstalled while jobs are in progress.
  • warp and chip stage postage stamps for calibrated 3PI.PV1 exposures will now have information about the photometric calibration added to the fits headers.
  • 10:40 MEH: chip.on again stdsci, everything is backlogged in warp so won't be any new chips but the few remaining
  • 11:40 CZW: set "controller parameters unwant = 20" in stdscience, to see if that clears the warp bottleneck. This seems to have launched enough jobs to fill all processing cores in stdscience, without changing the load on the stsci nodes significantly.
  • 12:35 MEH: after Chris bumped unwant to 25, warp+diff using full 445 running jobs. oddly, setting once nightly warps finished, the diff was limited to 300 running jobs. when warp+diff running, breakdown was ~ barf... stsci0-9 overloaded.. bg also seems to be queuing lots of jobs along with skycell_jpeg.. warp/dif will have to wait until nightly data tonight.
    • turning things up slowly still also causes stsci to overload, unwant may need to be turned down.
  • 13:40 MEH: still overloading system, note warp.summary is running skycell_jpeg and Chris says should not be running (gets turned on with warp.on/off). He will disable in task so never run again.
  • 15:19 Bill: rebuilt psModules with a bug fix to pmPSFtryFitPSF ah hour or so ago. This fixed all but one of the outstanding diff faults. Turned the remaining one from a fault to a quality error with difftool -updatediffskyfile -fault 0 -set_quality 14006 -diff_id 461121 -skycell_id skycell.2631.092
  • 16:00 MEH: with the massive backlog of warps (and only warps), going to add extra compute3 to stdsci and raise the unwant to 30+ to see if can push through faster (at least the M31 reprocessing)
    • 9x compute3 in stdsci is working fine, unwant=35 (30 was similar but seemed to have trouble keeping fully loaded?) and have 542 warp+bg.warp jobs regularly running. the warp rate in czarplot is ~60/hr average (not peaks)
  • 17:50 MEH: up to 10x compute3 in stdsci, little help on M31 warp+bg, LAP warps started now and rate passing 250/hr and system having trouble keeping loaded. reached ~450 exp/hr. will reconfigure for nightly obs around 19:00.
  • 20:00 MEH: registration jobs taking 100s to run.. and LAP stack processing stalling/unstable..
    • ipp027 unhappy with mounts, taking out of processing and put into repair for now
    • restarting nfs couple times cleared it --
  • 21:30 MEH: nightly+LAP running, compute3 doesn't appear to be fully utilized and +1x into stdsci for faster, lower RAM jobs
  • 23:30 MEH: registration appear to be stuck on exposure o6513g0214o with pending_burntool.. 20 exposures behind and climbing.. probably from o6513g0213o stuck in summitcopy...
  • 00:00 MEH: ipp028 is borked.. will need a reboot but will wait until morning if possible -- ipp028 out of processing if possible (WS distribution, cleanup, LAP chip jobs stuck) and in repair. restarted summitcopy, registration moving forward again (72 exposures behind..)

Friday : 2013.08.09

mark is czar

  • 07:30 MEH: just s few remaining 3PI diffs, will need to address ipp027,028 issues
  • 10:00 MEH: while Rita shutting down ipp032 for disk upgrade, going to reboot ipp027 (mounts lost again)
  • 10:20 MEH: ipp027,028 were running older 3.6.28 kernel but defaulted into 3.7.6 at reboot -- this expected? wave1+2 are now set by default to boot into 3.7.6 kernel.
    • ipp027 has actually been out of processing for a while because problematic, if new kernel fine will want to try adding back into processing
    • not going to reboot all wave1+2 into new kernel, will just do so as they have issues and need a reboot
  • 10:30 MEH: regular restart of stdsci
    • to get some warps->stacks going -- have to manually set unwant again, going up to 45 now with 2x compute3 added -- running ok but still hard keeping all nodes queued (300-350 out of ~380), poll at 800 but often ~500 (poll rate need increase?)
      controller parameters unwant = 45
    • also raising the camera.poll 40
  • 12:20 MEH: stacks moving through, new chips showing up so chip.on. warp rate held at ~450/hr and stack ~400/hr and camera up to 200/hr before stacks triggered.
  • 14:40 Bill: Serge reported that the postage stamp server is sluggish. pcontrol was using a lot of time so I restarted the pantasks. Also reduce the poll limit for dependency checking from 64 to 32 with the command "set.dependent.poll 32"
  • 15:00 MEH: Serge asked for more pstamp nodes for MOPS -- adding 3x compute3, taking -1x compute3 from stdsci and stack
  • 16:20 MEH: Bill made bugfix change to ippTools/share/camtool_pendingexp.sql, stopping processing
    • seems several things M in tag svn for ippTasks and ippScripts, only rebuilding ippTools
    • will do regular restart of stdsci for processing
  • 17:10 MEH: activating MD01.refstack.20130731 and starting the backlog of diffims..

Saturday : 2013.08.10

  • 09:30 MEH: time for regular restart of stdsci

Sunday : 2013.08.11

  • 09:30 MEH: ipp052 down for ~1.5hrs, nothing on console. power cycling.. -- not booting.. leaving power off.
  • 10:30 MEH: regular restart of stdsci and taking ipp052 out of all processing