PS1 IPP Czar Logs for the week 2011.01.17 - 2011.01.23

(Up to PS1 IPP Czar Logs)

Monday : 2011.01.17

Tuesday : 2011.01.18

  • (bills 12:30) Two STS dist runs were stuck.
    • One was stuck due to a corrupted destreaked diffSkyfile. Set quality to 42 let the run complete and set it back.
    • The other distRun was faulted at the advance stage. disttool -revertrun fixed that. I thought we had a revert task for that but apparently not.
  • Set STS.20101202 data (August) to be cleaned.
  • Queued STS.refstack.20101105 to be distributed. By
    1. changing dist_group to STS
    2. enabling the disabled distribution interest with: disttool -updateinterest -set_state enabled -stage stack -dest_name sts -filter i\%
    3. adding label to distribution survey list
    4. added label to distribution pantasks.

Wednesday : 2011.01.19

  • (bills) 08:03 queued 2010-09-01 - 2010-09-02 STS data for processing. Added label to survey tasks for WSdiff, magic, destreak, and dist. (The STS science team has requested to drop the forced photometry reduction for now).
  • 08:42 ran --diff_id 102590 --skycell_id skycell.1771.069 --redirect-output to fix corrupted file.
  • 08:42 reverted diff_id 102558 skycell_id skycell.1260.089. It faulted again with an assertion failure
  • 09:23 At Gene's request, changed distribution hosts to remove ipp009, ipp012, and ipp014 from the lists. Restarted distribution pantasks.

Thursday : YYYY.MM.DD

Friday : 2011-01-21

Czar: Serge

  • Serge/07:10: Everything seems to be normal. Science exposures are not all registered at that time though.
  • Bill 07:40 Getting lots of destreak revert errors. It turns out that there was a bug in the new version of the script that we started using yesterday. Fixed the problem.
  • Serge/09:58: MD05 data seem to be stopped at warp stage (stuck for 1.5h at least)
    • From Bill: There are no MD05 reference stack
  • Bill has queued chipRuns to make new reference stacks for all filters. The label is MD05.refstack.20110121
  • Bill/15:08 processing is getting stuck at magic. The book magicToProcess was full of runs in state DONE. Ran magic.reset to attempt free things up. Didn't help. Serge restarted distribution pantasks.
  • Bill Earlier today I queued the following data groups for cleanup: STS.20100901, STS.20100902, STS.20100903. The last one (STS.20100904) is still running.

Saturday : 2011-01-22

Sunday : 2011-01-23

  • Bill/04:20 Lots of red on czartool and gangila. ipp051 in swap heck. Disabled md05.refstack label to give some relief
  • Bill/04:30 ipp033 and ippc09 down in ganglia. Nothing on console at all. No response. Power cycled.
  • 04:53 progress graph on czartool shows that things have been rather stalled for several hours. Since the reboots some slight upticks. There may be some stale nfs mounts for ipp033.
  • 04:56 Killed all of the ppStack processes.
  • 17:00 - 20:20 What a mess! Many nodes not able to mount ipp033. Many nebulous log files with storage objects but no instances. Bill is slow, but eventually got it cleaned up. For the most part. No more 100 input stacks until we solve the memory problem.