PS1 IPP Czar Logs for the week 2013.02.04 - 2013.02.11

Monday : 2013.02.04

  • 11:30 Bill: removed 137 skycal runs from 20110806 label that got their labels changed to 20120706 by mistake. None of these ran
  • 11:45 Bill: turned on skycal. Shut down deepstack. We're done with staticsky for the LAP stacks finished last year. Will queue some new ones later.

Tuesday : 2013.02.05

  • 13:59: CZW: Moved ipp006 and ipp007 from "wave1" to "wave1_dvo" in the pantasks.hosts file. I also re-enabled the compute3 machines for stdscience. Restarting stdscience for these changes to take effect.

Wednesday : 2013.02.06

  • 16:30 Serge: Queued 4 exposures for stdscience reprocessing (label: MopsSynthetics?)

Thursday : 2013.02.07

  • 09:25 Bill queued 5 filter staticsky runs for skycells with 4h < RA < 6h. Adding staticsky to the stack pantasks
  • 10:45 Bill reverted 4 stacks that got fault 2 due to nfs issues.
  • 10:50 Fixed some broken files
    recovered lost instance of o5507g0448o.ota25.fits using
    reran burntool for o5464g0516o.ota26
    recovered lost instance of o5463g0418o.ota23.fits
    reran burntool for o5170g0154o.ota32.fits 
  • 10:50 Serge: restarted stdscience
  • 14:27 Bill: set stack and stdscience pantasks to stop in preparation for load rebalancing
  • 14:30 Bill: set 2 x compute3 to off in stdscience and added them to stack. Set 2 x compute2 to off in stack. Not adding them to stdscience for now since they are still overloaded with existing processing.
  • 14:38 Bill fixed burntool table for o5456g0481o.ota26.fits
  • 14:45 Bill is running some M31jobs on compute3 and wave4. It should only take a few minutes.
  • 16:05 Bill's jobs are done. Setting 2x compute3 back to on stdscience
  • 16:10 restarted distribution. The job counts were all mangled by timeouts.

Friday : 2013.02.08

  • 09:45 Bill stopped stdscience. massive job failures. Looks like an nfs or nebulous problem
  • 09:51 Bill: The problem is that stsci00.2 is 100% full yet nebulous was still trying to put files there. Set it to repair. Leaving stdsciene stopped for a bit.
  • 10:05 Bill restarted stdscience. to allow last nights warps to have the cpus
  • 10:33 Bill: added a few lines to to fix a missing prepare_output(). Warps are done, Turned chip back on. Set stack back to run, but turned staticsky off to allow the nightly stacks to finish

Bill is claiming to be czar for yesterday and today

  • 11:08 staticsky back on
  • 12:24 staticsky run 391667 is failing because one of it's inputs files do not exist. That happened because stackRun 1927024 ran twice on December 7. I am rerunning the stack using tools/runstackskycell. Cleanup set to stop to insure that the input files don't get cleaned up.
  • 12:33 restarted cleanup. If the warps haven't been cleaned since Dec 7 then they probably won't go away in the next 15 minutes
  • 13:00 there are about 15000 chip runs in state error_cleaned. All of the ones that I checked were caused by missing or unavailable mdc files. Next week I'll revise the code to just skip the component and stop setting error_cleaned in this case. There were also a couple of assertion failures from censorObjects. This needs to get handled more gracefully: chip_ids for those 57345, 115119, and 118619
  • 13:30 We're getting occasional fault 2 errors from stack. Since stack.revert only reverts -fault 2 errors turned stack.revert.on

Saturday : 2013.02.09

  • 09:00 Bill restarted summit copy and registration after killing two jobs that had been stuck for hours on ipp057. The affected files were darks so they did

not affect last night's processing.

Sunday : 2013.02.10