PS1 IPP Czar Logs for the week 2013.03.11 - 2013.03.17

(Up to PS1 IPP Czar Logs)

Monday : 2013.03.11

  • 09:40 Bill noticed that two distRuns from the other night had not completed. The problem was a missing output file (subkernel) from the input diffRun. The instance was in nebulous but the file on stsci07 was not. Reran the component using tools/rerundiffskyfile.pl. This worked
  • 13:06 Bill set all pantasks to stop in order to rebuild ippTools with a change to pstampRequest
  • 13:35 Bill: Restarted stdscience and set other pantasks to run
  • 13:50 Bill: regenerated two lost burntool tables and recovered o5518g0719o.ota26.fits

Tuesday : 2013.03.12

  • 15:11 Bill: I am running new task and script to process lapGroups which are collections of lapRuns that have "finished" and have each of the 5 filters. staticskyruns for all filter combinations

Wednesday : 2013.03.13

  • 08:45 Bill: cleanup seems to be running very slowly. It's pantasks_server and pcontrol are cpu hogs. Set to stop in preparation for restarting it.
  • 08:55 Bill: set staticsky to off in the stack pantasks. It is currently working on skycells in the galactic plane and is running into memory problems on the compute2 nodes. After it settles out will start up the deepstack pantasks for a couple of days.
  • 10:50 CZW: stopped cleanup at Serge's request. This should allow ippdb02 to catch up within 72000 seconds.
  • 11:15 CZW: restarted stack and stdscience to allow Bill to fork staticsky processing back to the deepstack pantasks. The restart was needed to ensure the compute3 resources are allocated appropriately.

Thursday : 2013.03.14

Bill is czar today

  • 19:07 Bill: stdscience pcontrol is spinning. Setting stdscience to stop in preparation for restart.
  • 19:47 restarted stdscience, registration, and summitcopy

Friday : 2013.03.15

Bill is czar today

  • 09:49 ipp020 has been down for a couple of hours. Power cycled it. Beginning of crash dump was
    <Mar/15 08:35 am>[3690682.353980] ggeneral protection fault: 0000 [#1] SMP 
    <Mar/15 08:35 am>[3690682.354269] last sysfs file: /sys/class/i2c-adapter/i2c-0/name
    
  • 12:20 Since staticsky has moved out of the galactic plane, doubled the number of compute3 hosts working on staticsky in deepstack
  • 14:27 We are getting very low on space. There are about 40,000 chipRuns in state goto_cleaned. Many of these are sparsely populated because they were updated by postage stamp requests. I have changed the label for all of the runs to goto_cleaned.wait except for those with data_group like LAP.ThreePi?.20130706% or data_group like 'ThreePi?.201303% (recent nightly data). This reduces the amount of runs to be cleaned to 19761, but should yield more bytes cleaned per run.
  • 18:30 MEH: looks like ipp020 is having trouble, chip updates taking >1hr to finish. taking out of processing
  • 19:30 MEH: fixing 3 bad LAP burn.tbl
  • 20:50 MEH: stdscience reaching point for regular restart, running underloaded. restarting before any nightly science

Saturday : 2013.03.16

  • 18:20 MEH: restarting cleanup and removing extra wave4 before nightly science and LAP backlog caught up.
    • now that LAP/update cleanup caught up, >4k stacks to run. maybe compute3 should be reallocated back to stack
  • 18:50 MEH: doing normal restart of stdscience before nightly science starts to keep rate up for LAP if no nightly science
    • removing ipp020 again since was behaving badly yesterday
  • 20:15 MEH: ippdb02 caught up from cleanup, going to swap the goto_cleaned.wait label on chips back to goto_cleaned from Friday

Sunday : 2013.03.17

  • 14:50 MEH: many many stacks to run. looks like staticsky finished the GC? turning off in deepstack and manually re-allocating to stack and stdscience for a while
  • 18:40 MEH: starting stdscience regular restart again