(Up to PS1 IPP Czar Logs)

Monday : 2012.08.27

  • 09:05 EAM : processing was running a bit slow. there were no long-running jobs outstanding. I'm doing a quick test to see if simply restarting the controller improves the throughput. I stopped all jobs pantasks: stop, then waited until the controller was empty. I then exited the controller: pantasks: control exit true, and then restarted it by reloading all of the hosts: pantasks: hosts.stdscience. I am waiting for the hosts to get connected before re-starting the processing.
  • 12:45 (Serge): Stopped all processing. All pantasks server shut down.
  • 13:05 (Serge): Stopped mysql server on ippdb01. Rsyncing ippdb01:/var/lib/mysql to ippdb03:/var/lib/ippdb01
  • 13:10 (Serge): Stopped mysql server on ippdb03
  • 15:00 (Serge): Restarted replication on ippdb03 and CzarPoll? on ipp009

Tuesday : 2012.08.28

  • 08:45 (Bill): turned off chip revert so that we can see if any chips are failing due to missing files
  • 09:15 (Serge): Stopped replication on ippdb03. Dumping to /export/ippdb03.0/backups/dump_ippdb01.20120828.sql.bz2
                    Master_Host: ippdb01.ifa.hawaii.edu
                    Master_User: repl_gpc1
                Master_Log_File: mysqld-bin.023064
            Read_Master_Log_Pos: 44362377
                Replicate_Do_DB: czardb,gpc1,ippadmin,ipptopsps,megacam,ssp,uip
    
  • 11:25 (Serge): Replication restarted on ippdb03. Copying and then ingesting the dump into ipp001.

Wednesday : 2012.08.29

Bill is acting czar today.

  • 07:20 regenerated serveral burntool tables. See tools/fixburntool. Use with care.
  • 07:30 stopping stdscience in preparation for restart. (throughput had dropped to 50 per hour)
  • 08:31 stopped warp revert to debug a repeating assertion failure
  • 09:52 stopping processing to pick up bug fix in psLib
  • 10:03 processing set to run warp revert back on
  • 13:35 (Serge): Replication restarted on ipp001

Thursday : 2012.08.30

Mark is Czar

  • 06:20 small amount of nightly science almost through, then will do the sub-daily restart of stdscience to increase LAP rate back up. Based on the rate plot from yesterday, restart probably needs to be done every 9-10hrs (120/hr to reach ~100k in pantasks) to keep the rate >100/hr, so ~5pm at the latest.
  • 07:30 LAP thin on chips to run, set.poll 600 to keep pantasks mostly busy even if just warps.
  • 14:03 (Bill) rebuilt ppStack with chisq rejection fix
  • 14:50 MEH: Chris noticed stsci nodes running pretty hard. 99% jobs are warps/stacks, so dropping poll in stdscience from 600 to 400 to see if overdriving systems/NFS (so if no chips, not all of stdscience 550 nodes will be processing warps).
  • 15:45 restarted stdscience, only warps to do and barely kept 200/400 loaded.. start re-monitor warp rate from here. Serge needs ippc63 removed from processing why nebulous replication there catches up.
  • 16:40 increasing the poll >450-500 doesn't seem to help the warp rate (if warps only are running), so setting to 500 again.
  • 17:00 added compute3 from deepstack (since inactive) into stack pantasks to help push through the +8k LAP stacks..
  • 23:50 stdscience rate dropping <100, preparing for restart.

Friday : 2012.08.31

Mark is czar

  • 06:40 healthy amount of nightly science data still downloading, ~130 exposures left to go. fixing 5 LAP chips.
  • 08:10 fixing 4 more LAP chips. seems like someone else may have crossed on one in progress after ~08:40... dangerous..
  • 08:55 (Bill) fixed 3 more LAP chips
  • 09:25 MEH: nightly science finally finished downloading, remaining 3PI finishing up processing.
  • 10:10 stdscience finished and pantasks having trouble keeping LAP loaded, restarting. difftool command seems to get stalled and prevents stdscience from shutting down and have to kill. Chris thinks has to do with keeping warps online now and having to scan through all of them so should be rewritten.
  • 12:00 taking compute3 back out of stack to re-activate deepstack for MD09 refstack redo.
  • 15:00 like yesterday, using lap_science.pl --monitor_mode to help keep stack+diffims loaded
  • 15:10 deepstack running MD09 revised refstack version, should run over weekend.
  • 17:40 spot checking MD01 diffims from last night, odd stripe feature of cell block size showing up (input stack_id=1283588, input warp_id=530995 skycell.036). Chris tracked down to ota14, cell 31 and 3/8 chips show it, and in past observations often seen in y-band only. Cell may need to just be masked at least in y-band.
  • 17:50 stdscience struggling to stay loaded (warp.skycell.run@70.5k, others <50k).. restarting

Saturday : 2012.09.01

  • 09:45 MEH: restarting stdscience
  • 10:05 fixing couple LAP camera and several LAP chip faults
  • 11:45 fixing more LAP chips - done
  • 15:30 LAP warps way behind again, turning chip.off to try and push hard on warps.
  • 18:50 stdscience restart before nightly science begins

Sunday : 2012.09.02

  • 19:59 Bill noticed that a postage stamp job was stuck in the queue for more than 24000 seconds. The script was running on ipp016 but the log file was empty and nothing was happening. I killed it but got tired of waiting for pantasks to give up on it. Restarted the pstamp pantasks and the job completed promptly.