PS1 IPP Czar Logs for the week YYYY.MM.DD - YYYY.MM.DD

(Up to PS1 IPP Czar Logs)

Monday : 2015.05.11

  • Haydn looking into ipp029 not booting -- had to replace motherboard twice. back to neb-host repair and leaving out of nightly tonight
  • 16:00 MEH: Gene done rsync ipp091,092, neb-host up again
  • 19:40 MEH: stsci13 seems to be unresponsive for the past hour -- power cycling
  • 20:30 EAM: restarting full-force
  • 21:30 MEH: doing some cleanup missed last week from all the system up/down issues

Tuesday : 2015.05.12

  • 10:25 MEH: ipp029 still down, needs to have neb-host down
  • 15:00 CZW: ipp/replication pantasks running warp fpa.res.fits removal. Currently just CNP, but I'll follow with the MD and 3PI data as the files complete.
  • 16:15 CZW: ipp029 back up, back to neb-host repair.
  • 16:45 CZW: ganglia reports the load on the cluster has dropped, which seems to be a symptom of the ff pending list dropping. We've been running a bit faster than previously, so this may be part of the reason (we're going faster than the define step is). I'm going to take this chance to restart the local ff pantasks.
  • 19:55 EAM: I'm stopping pv3ffrt and pv3fflt for a while so I can run a dvomerge of the tycho data into the pv2 dvo. this will also give the ff define jobs to make some headway.

Wednesday : 2015.05.13

  • 04:39 Bill: gpc1 chip processing is about 170 exposures behind with throughput of about 25 exposures. I suspect that part of the problem is that stdscience pantasks has been running for several days. I'm going to restart it.
  • 06:47 Bill: cleared a gpc2 diff fault with difftool -updatediffskyfile -dbname gpc2 -diff_id 1389 -skycell_id skycell.2067.089 -set_quality 42 -fault 0
  • 08:00 EAM: restarted pv3ffrt & pv3fflt
  • 13:50 EAM: rebooting stsci05
  • 14:30 CZW: set ippb0[4-5] to repair in nebulous. They're up and available, so they should be set that way.
  • 17:00 CZW: noticed stsci01 is down. cycling power.

Thursday : 2015.05.14

  • 06:40 EAM: rebooting stsci14.
  • 06:50 Bill: cleared recurring warp fault with warptool -updateskyfile -warp_id 1555968 -skycell_id skycell.1122.043 -set_quality 42 -fault 0

Friday : 2015.05.15

  • 01:45 Bill: I noticed that I left some code commented out in czartool_labels.php Fixed that and the corrected page showed that replication to ipp001 had failed. The problem was that ipp001 was out of space. Ran the following command to prune old database dumps. Need to add a cron job to do this regularly.
    ipp@ipp001:/data/ipp001.0/ipp/mysql-dumps>~/mysql-dump/delete_log_spacing.sh mysql-gpc1 commit
    
  • 07:00 Bill: replicating all of the czardb schema changes took quite a while which isn't surprising since they took a long time to execute on the master. Since replication was restarted the seconds behind increased to 77366. Now that the schema changes have been completed the slave is starting to catch up.
  • 09:24 Bill warptool -updateskyfile -set_quality 42 -fault 0 -warp_id 1556762 -skycell_id skycell.0859.079
  • 17:06 HAF restarting pantasks for the night (Mark's suggestion, thanks mark!)

Saturday : 2015.05.16

Sunday : 2015.05.17

  • 00:50 MEH: registration held up again -- manually cleared, and 11 exposures processing now (stalled for a few hours?)
    regtool -updateprocessedimfile -exp_id 914956 -class_id XY13 -set_state pending_burntool -dbname gpc1
    
  • 00:52 MEH: some WS diffs running for 14ks.. -- logs look like waiting from DB transaction at end.. suspect what also hiccup'd registration -- except seemed to clear a little while after stsci back up
  • 00:54 MEH: stsci16 down, trying power cycle -- back up