PS1 IPP Czar Logs for the week YYYY.MM.DD - YYYY.MM.DD

(Up to PS1 IPP Czar Logs)

Monday : 2016.02.15

  • 13:00 EAM : I am going to launch relphot using most ippx nodes and the ipps nodes, and probably some of the ipp080 - ipp104 nodes. I'm restarting stdscience pantasks now to ensure nightly jobs are completed before launching relphot.
  • 13:30 EAM : I have started relphot for the full PV3 master database. this will likely take 24-48 hours.

Tuesday : YYYY.MM.DD

  • 10:30 CZW: ipp/replication pantasks running ippb03.0 OTA replication.

Wednesday : YYYY.MM.DD

Thursday : 2016.02.18

  • 18:00 EAM : dvo operations are using most of the memory for the ipp0XX nodes. I'm stopping the pantaskses and I will remove the storage nodes from the processing set for standard processing.
  • 21:20 EAM : summitcopy was stuck for a while and I had some trouble tracking down the reason. apparently, two exposures had a failure in summitcopy advance. using the command pztool -toadvance -dbname gpc1 -limit 10 showed the problem exposures. in summitcopy pantasks, the command show.books and then show.book pzPendingAdvance revealed the two exposures still on the books as if they were running -- they must have died in a way which pantasks dropped. stopping summitcopy and calling reset.advance cleared that book and made things right again.

Friday : 2016.02.19

  • 09:40 MEH: clearing unrecoverable fault 4 (warp curve of growth) and fault 5 (diff)
    warptool -dbname gpc1 -updateskyfile -set_quality 42 -skycell_id skycell.0534.058 -warp_id 1688791 -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.1550.056 -diff_id 1341246  -fault 0
    
  • 11:30 MEH: will do regular restart of nightly pantasks once WSdiffs finish -- including registration which seems to have crashed on its own?
    • registration was in an odd state because someone started the outdated publishing pantasks which uses the same machine and port.. both pantasks had to be manually killed, now restart can normally be done..
  • 15:20 CZW: Launching md5sum calculations across cluster so we can confirm with STSCI that they have all the updated files correct. This is going to mostly involve the ipp0[6-9]X computers with a few other nodes. I'm going to let it run in waves, so not all of these computers are busy at the same time. It's single thread per-host, so the load should only increase by about 1.
  • 16:04 MEH: ipp017 powered off after most recent crash, to stay off until raid card swap ready
  • 16:12 MEH: /data/ippc19.0/home down to 12GB again... ~ipp has ~45/108GB in log files... old logs need to be run through bzip2 and archived on a more regular basis.. (by other czars..)
  • 20:43 MEH: looks like summitfault -- QUB data so should be left alone -- cleared same way for o7414g0102o (search for it in czarlog)
    o7438g0100o   2016-02-20 06:15:50   110
    
  • 23:41 MEH: ipp058 (and to some level ipp060) large load spikes for some reason (maybe too many randomly targeted jobs there?) -- setting both neb-host repair (from up)
    • issue moved to ipp057

Saturday : 2016.02.20

  • all kinds of mess with nodes filling up -- will add notes when time...
  • 16:32 MEH: regular restarting of nightly pantasks

Sunday : 2016.02.21

  • ipp015,024,028,030 neb-host up in some combination causes summitcopy to run a bit slower like before?
  • 15:35 MEH: manually remove ipp087 from targeted data config (~ipp/psconfig/ipp-20141024.lin64/share/pantasks/modules/ipphosts.mhpcc.config) before regular pantasks restart to see if can handle data w/o BBU until fixed -- like ipp103 should be okay as long as there are enough nodes with available space... so cleaned 20160220 diffims in advance and leaving the warps for QUB updates