PS1 IPP Czar Logs for the week 2014-01-06 - 2014-01-12

(Up to PS1 IPP Czar Logs)

Monday : 2014-01-06

Bill is czar today

  • 10:15 stdscience has finished. Restarted postage stamp pantasks. Added 2 x compute3 and 2 x compute2
  • 10:38 restarted cleanup pantasks adding label goto_cleaned_redo which is to rerun chip cleanup processing (need to implement label priorities...)
  • 13:12 restarted pantasks except for cleanup which is busy and replication which is currently not in the list managed by check_system.sh It is working though
  • 13:45 LAP label added into stdscience
  • 13:55 big camera backlog for lap due to the troubles the other night. Increased pending count to 40 from nominal value 25. ( set.camera.poll 40 )
  • 16:00 set camera poll back to 25

Tuesday : 2014-01-07

Bill is czar today

  • 14:02 stopping processing in preparation for daily restart
    • 15:12 it took a long time for the running camera_exp processes to finish. It seems like the cluster may be bogged down as a side effect of running Chris' file check scripts.
    • Restarted pantasks expect for cleanup. Leaving LAP label enabled but it may not make much progress.
  • 16:51 Chris made some changes and the sluggishness is much reduced perhaps gone. cleanup is still off.

Wednesday : 2014-01-08

  • 11:04 Bill Camera stage seems to be the bottleneck in LAP processing. Increased camera poll limit from 25 to 50.

Thursday : 2014-01-09

  • CZW: 13:10 Just now noticed that czartool has been dead since the mhpcc communications interruption. Restarted, and now I'm going to work on clearing out LAP jobs that are stuck.

Friday : 2014-01-10

  • 10:45 CZW: I noticed things were clogged earlier. I restarted apache on a number of the nebulous server hosts, which has allowed nebulous requests to complete again (neb-touch timed out, and a number of summit copy jobs had similar issues). This doesn't seem to have solved everything, and the load on stsci02 (250-260?) is very suspicious. I've asked Gavin to reboot this (although since it is accessible, we can do command line reboots. I did not realize this before now).
  • 11:45 CZW: That didn't seem to work as well as I'd hoped, as the load on stsci02 immediately shot back up to 258. I'm doing the umount -f thing on all hosts that are accessing stsci02.
  • 14:00 CZW: stsci02 is still running with a load of 260, and the force umounting has done nothing to fix things. The /var/log/messages contain entries like Jan 10 13:45:01 stsci02 [10083.305233] Controller in crit error which google suggests are RAID issues. I'm beginning to suspect that this is causing all disk accesses to hang, and that's causing the problems. I have stsci02 set to repair in nebulous, which should prevent things from writing new data there. Since the remaining nightly data downloaded after I did this, we should be able to complete that without much difficulty.
  • 14:54 CZW: Haydn reports that the raid on stsci02 is having problems, so I've set it to down in nebulous.
  • 15:44 CZW: I"ve set ipp031 to down as well, as that's still a crossmount/symlink thing that has data on stsci02.
  • 16:00 CZW: Gavin re-rebooted stsci02, and it now properly detects the disks. I've set stsci02 and ipp031 back to repair.
  • 17:30 CZW: I think we've finished processing everything that's going to process right now. The diffs that have faulted appear to be dependent on warps that were processed before stsci02 went down the first time.

Saturday : 2014-01-11

Sunday : 2014-01-12