PS1 IPP Czar Logs for the week YYYY.MM.DD - YYYY.MM.DD

(Up to PS1 IPP Czar Logs)

Monday : 2016.03.28

  • 16:10 MEH: running SNIaF updates in stdscience until nightly starts, if too much may switch to stdscience_ws using smaller set of only c2 nodes -- switched, running on stsci nodes pushed some data notes to high load/wait states, running much better on stdscience_ws:c2
  • 16:20 MEH: setting default gpc1/ippdb05 mysql connect_timeout=300 now

Tuesday : 2016.03.29

  • 10:00 MEH: SNIaF updates running again in stdscience_ws pantasks
  • 15:20 CZW: New register.pro installed. This version has shorter (10 minute) revert times. I'll restart ipp pantasks later today.
  • 15:57 CZW: Starting ipplanl/pv3update pantasks to redo STSCI updates that probably were eaten by the postage stamp server. There are ~1100 jobs to run between the chip and warp stage, so this is probably on the order of an hour or so of processing.
  • 17:05 CZW: ipp pantasks restarted for the night. ippb03.2 set to repair in nebulous to allow the final pv3 STSCI update to finish. This volume does not have a degraded RAID, so it should be fine like this.
  • 17:15 CZW: ipplanl/pv3update pantasks stopped and shut down.

Wednesday : 2016-03-30

  • 16:30 CZW: neb-cull process on ippb03.0 finally debugged and running. No apparent overloading of ippb03 from this. This is running as ipptest/replication on 200k chunks.
  • 17:00 CZW: stop/restart of ipp pantasks.
  • 22:00 HAF: nebulous hates me... all sorts of nebulous errors (continuing), investigating them now...
  • 22:37 CZW: o7478g0215o wasn't completing because the pztool -copydone command was hung waiting for the gpc1 database (based on looking at the distribution of running times in the summitcopy pantasks). Killed this command, the job failed, and was properly retried.

Thursday : 2016.03.31

  • 00:46 MEH: ipp008,012,013,037 down for >30 minutes... seems like the bank of machines that sometimes power cycles on own but usually it is just ipp013 that stalls.. retry a power cycle on all and see if can recover -- otherwise all are in repair and processing should be recoverable setting neb-host down and killing all the outstanding tasks..
    • non-responsing on initial power cycle, left off for ~20 min and tried again, no response from any of the machines -- will likely need to be physically checked
    • so process of cleaning up stalled processes required to get things running again...
  • HAF: 6:30am : kicked a handful of nebulous errors
  • HAF: 12:30 - 17:00 haydn did a lot of work on machines, I restarted pantasks. Summary of Haydn's work:
     he rebooted ipp085/ipp086
     he replaced bbus on ipp103, ipp097, ipp086 
     he replaced power supply in ipp103
     he replaced 4 failing harddrives
     he fixed ipp037, ipp008, ipp012 and ipp013 , they were all on the same circuit, and experienced a power outage last night
    
    
  • 18:00 CZW: Ran "nightly_science.pl --clean_old --date 2016-04-01 --dbname gpc1 --camera GPC1" as the cleanup run didn't execute today.
  • 22:05 HAF: put 12/13/08/37 in repair

Friday : 2016.04.01

  • HAF 16:00 daily restart of pantasks
  • 23:30 MEH: summitcopy running behind w/ some extra nodes off or other processing or full disks? will try adding a few more nodes and see how it goes

Saturday : YYYY.MM.DD

Sunday : 2016.04.03

  • 14:18 MEH: more SNIaF updates running in ippqub stdscience_ws since c2 nodes less trouble w/ other processing and QUB stamps done until DB dump finished