(Up to PS1 IPP Czar Logs)

Monday : 2012.07.09

  • 09:00 Serge: gpc1 replication broken on ipp001. Dropped the database and recreating it.
  • 10:55 Bill: restarted distribution to cause the dropped stages to be added back in
  • 11:02 Bill: updated recipes/psphot.config to remove the limit on the number of peaks (previously 50,000)
  • 11:07 CZW: Stopped stdscience to prepare for a restart.
  • 11:25 CZW: Running again.

Tuesday : 2012.07.10

  • 11:00 Serge: Slave on ipp001 is back
  • 11:40 Mark: restarted stdscience, only 40-100 jobs running. back up to >450 jobs running. doing a large volume of MD07.mehtest updates for a few hours
  • 16:18 CZW: I've launched a second "stdscience" like pantasks in ~ipp/ecliptic. I'm using this to process the w-band warps for the ecliptic plane to allow for warp-stack diffs for MOPS. It is using compute3 and stsci nodes, but I may have oversubscribed these nodes somewhat. The data is processing under the label "ecliptic.rp".

Wednesday : 2012.07.11

  • 15:58 Serge: set stsci06 to repair in nebulous (EAM: it never actually had crashed, it was failing to respond to ganglia because of nfs hangups)
  • 17:05 EAM : after serious hang-ups, it looks like the cluster is now unwedged. it seems that the nfs servers on some of the stsci nodes were failing to respond to other stsci nodes and vice versa. I killed off active jobs as much as possible on those machines and then forced the machines to umount the hung partitions. looks like this eventually worked. it is not clear what caused the initial problem, though.
  • 18:10 Mark: restarted all pantasks, turned detrend off, set neb-host stsci06 up.
  • 19:34 CZW: nfs issues returned. I was able to get hosts to somewhat calm down by doing a "mount | grep stsci | awk '{print("umount -f",$3)}' | tcsh" with sudo. This points to some continuing issue with stsci nodes and NFS.
  • 20:20 haf : nfs again. set stdsci and addstars to stop for the moment... will try czw's trick
  • 20:29 haf: some combination of czw's trick, stopping stdsci, rain dances and animal sacrifices fixed it (for now).
  • 20:29 haf: eam killed all the jobs on stsci05, and did fource unmount stsci08.1 - he's restarting everything except deepstacks (he thinks ppStack on stsci causes some of the problems) - haf thinks it was eam, not her animal sacrifices that fixed it...

Thursday : YYYY.MM.DD

  • 09:20 Mark: starting deepstack w/o stsci nodes
  • 15:00 Serge: ippb03.1 is 99% full (ippb03.2 is at 97%)
    neb-host --volume ippb03.1 repair
    

Friday : 2012.07.13

  • 10:41 Serge: Set ippb03.2 to repair (99% full)
    neb-host --volume ippb03.2 repair
    
  • 15:25 Mark: stoppped stdscience to rebuild ippconfig for adding STATICSKY_DEEPCAL settings to psastro.config. running again.
  • 17:40 Gene+Chris merged things into tag, Gene rebuilt. Restarting all pantask servers again.

Saturday : YYYY.MM.DD

Sunday : YYYY.MM.DD