PS1 IPP Czar Logs for the week 2012.04.02 - 2012.04.08

(Up to PS1 IPP Czar Logs)

Monday : 2012.04.02

Mark is czar

  • 07:00 full night of 600 exposures, data still downloading (slower than normal?)
  • 09:20 data finished downloading ~30min ago, 99% processed.
  • 18:05 Bill added one compliment of wave3 and one of compute to deepstack (staticsky)
  • 18:51 Bill power cycled ipp015 because it died with a hardware problem: "This is not a software problem!"

Tuesday : 2012.04.03

Serge is czar

  • 09:00 No observation last night. No LAP queued
  • 15:00 heather has some stuff for the cluster (jtrp) - low priority.
  • 15:27 ippb02 and ippb03 are now up from the nebulous point of view
  • 21:36 Bill: Restarted registration pantasks
  • 23:55 Mark: stdscience seems to be under-running, going to try a restart. oops, lost cut/paste of the jtrp label, was it MD03.jtrp.v2?

Wednesday : 2012.04.04

  • 06:30 Bill: ipp015 crashed again. Set it to off in nebulous and power cycled it. The nebulous apache servers seem to be running amok. Sst pantasks' to stop to let things calm down.
  • 06:50 ipp015 did not come up after 3 power cycles. Leaving it powerd on.
  • 07:00 restarted apache servers on ippc01 - ippc10 set pantasks' to run
  • 11:35 Serge: rebooted ipp021 (after 203 days) since it has a weird behavior (can't 'ls /export/ipp021.0/nebulous/1c/54' for instance). dmesg claims that process 7543/nfsd is "Tainted".
  • 12:04 Shut down the ipp to switch to ipp-20120404. Started most things up but ran into a config file problem that causes everything to fail. Also one of the new noisemap detrend files is only located on ipp015 so chip processing is stalled. Setting pantasks states to stop until that gets fixed.
  • 13:52 Operations back to normal except that stdscience reverts are off
  • 15:03 Bill deleted ipp015 from the stdscience hosts lists. Since the files on ipp015 are not available from there, there is no need to have it in the lists to throttle the load.
  • 15:27 CZW: Initiated large detrend tests on ippc50 and ippc51 to scan for valid date ranges on noisemap and dark model. These hosts will have a large load as these tests run, but I do not anticipate any problems.
  • 16:00 heather noticed throughout the day that the jtrp labels were added back into stdscience, and that it was reverted once the noisemap/ipp015 problem was resolved. Whoever it was, thanks! :)

Thursday : 2012.04.05

Roy is czar

  • 08:00: no data last night, no LAP. All quiet....
  • 08:30: Haydn to shutdown ippc04, so:
    • Bill removed ippc04 from the list of nebservers in IPP's .tcshrc
    • Bill restarted deepstack manually, as it is not in the list managed by ./check_system.sh
    • Roy shutdown and restarted all other pantasks_servers like this:
    • Actually deepstack was set to stop. It will take awhile for the running jobs to finish
./check_system.sh stop
./check_system.sh shutdown
./check_system.sh start.server
./check_system.sh setup
./check_system.sh run
  • 08:57: Bill rebuilt warps stuck on ipp015 for -warp_id 393755 --skycell_id skycell.1621.031 and skycell.1621.032 These were the cause of the single faulted diff on czartool
  • 09:25 deepstack restarted with wave3 hosts added. 7 czw.SAStest runs reverted
  • 11:30 ipp015 is alive again, so:
neb-host ipp015 up
  • 11:45: ipp015 still DOWN in panstasks, so restarting everything again, as above. ippo15 magically reappeared in pantasks_client.
  • 14:15 Serge: Restarted shuffling (target.on)
  • 18:45 CZW: I've enabled the detrend pantasks to process detrend data over the long weekend.

Friday : 2012.04.06

  • 11:30 Mark: setting up deepstackonly pantasks to temporarily run on ippc20 to use compute2 initially until sure compute3 not to overloaded with ppMerge+psphotStack.

Saturday : 2012.04.07

Sunday : 2012.04.08