PS1 IPP Czar Logs for the week 2015.03.02 - 2015.03.09

(Up to PS1 IPP Czar Logs)

Monday : 2015.03.02

  • 08:50 MEH: ipp001 seems to be unresponsive -- Gavin tried manually power cycling, no response and likely hardware fault somewhere.
    • QUB and Durham use this as a replication of gpc1 DB -- will need to setup for ipp002 or ipp003 -- ipp003 is already used for TSS by QUB so should have open ports.
    • ipp0012 is also the external access node?
    • Gavin+Haydn got back up w/ 1 CPU -- all critical data needs to be moved off ASAP like for ipp002
    • ipp001.0 99% full, also needs old mysql dumps cleared
  • 12:45 MEH: restart pstamp
  • 17:00 MEH: regular restart of summitcopy+registration+stdsci for nightly processing
    • summitcopy having repeat 503 timeout faults for a set of files --
    • Craig found t20 behaving strangely, rebooted and files downloaded
  • 17:30 MEH: ippdb00 down to 2.5G.. not good going into the night -- Gene clearing binlogs to date that should be replicated to backup ippdb06,02,c63

Tuesday : YYYY.MM.DD

  • 14:30 CZW: Slowly starting up ipplanl/stdlocal to do some problem fixes before declaring PV3 chip through stack finished and complete.
    • 15:30 CZW: This will involve chip updates, single skycell warp processing, and then stacking. I'm using the LAP.PV3.20140730.local label for this.

Wednesday : 2015.03.04

Thursday : 2015.03.05

  • 11:20 EAM: restarted diff with storage machines NOT included. concept is to let the machines deal with the load if they can and to have pantasks / pcontrol ignore targeting -- ie, the unwanted jobs parameter will not limit the total number for any given machine. this immediately resulted in harassment for ipp093 and ipp095, which I set to repair in nebulous. for now, things are running ok.

Friday : YYYY.MM.DD

Saturday : YYYY.MM.DD

Sunday : YYYY.MM.DD