PS1 IPP Czar Logs for the week 2015.11.30 - 2015.12.06

Continuing HAF suggestion to improve communication, so we know what's going on better -- adding a list in the czar pages for additional (non-standard) processing so that we all know what's going on.

Daily Czaring:

  • MOPS has repeatedly requested NO changes be made to the ~ipp ops tag without being verified by their test set -- so currently there is a modified ops tag running diffs (WS labels only) as ippqub under ~ippqub/src/stdscience_ws on ippc06 -- if problems (Njobs>100k, power loss on ippc06 etc), it will need to be restarted like a normal nightly processing pantasks (or IF ANY OTHER CHANGES made for the ~ipp pantasks then likely same needs to be made for ~ippqub)
        ./start_server.sh stdscience_ws
    
    • even with the modified ops tag, there are still various files MIA in nebulous and fault problems that requires the daily czar to check and clear them as has been discussed before
    • ps_ud_QUB has also been moved to the ippqub:stdscience_ws pantasks to support updates possibly broken by missing cmf files, chip and warp updates will also be done in this pantasks as well

(Up to PS1 IPP Czar Logs)

Monday : 2015.11.30

  • 04:53 Bill: ganglia says ipp015 has been down for over 1900 seconds. This was causing chip processing faults due to unavailable detrend files. Set ipp015 to down in nebulous.
  • 06:25 EAM : rebooted ipp015 and set to 'repair' in nebulous
  • 08:00 Bill : raised priority of ps_ud_MOPS so the tiny number of update requests that their postage stamp requests need get serviced during nightly processing.
    • MEH: just a note on priority -- the QUB.nightly will get raised above when a trigger for processing is made
  • 16:55 CZW: Starting up ipplanl/pv3update pantasks as ipplanl. Server host is ippc11, and the label is PV3.stsci.update. This will process all of the updates needed for the stsci holes. I've assigned 4x ippx* nodes, and will adjust tomorrow if this is too much/too little/etc.
    • I'm having trouble with my update commands, so this will most likely remain stopped until tomorrow.
  • 17:00 CZW: Restarting ipp pantasks.

Tuesday : 2015.12.01

  • 00:26 MEH: if changes to ~ipp are made, same changes need to be made for ~ippqub... ippc02 needs to be commented out as a nebulous host as well if it is down.. -- doing and restarting ippqub:stdscience_ws pantasks..
  • 14:36 CZW: Just noticed ipp046 is down. Can't ssh, ganglia shows it as down. Cycling power.
  • 17:00 CZW: There was a disk space recovery spike. I moved a set of chip-stage cleanup log files out of the way, as they had "no instance found" errors (likely due to the instance being on the stsci nodes). This cleared the issues with chip cleanup, and allowed a backlog to be cleaned.

Wednesday : 2015.12.02

  • 09:20 Bill: restarted postage stamp server pantasks after integrating a change to the postage stamp parser to allow it to find the correct camRun for a given release.
  • 10:55 EAM: cleared various failures caused by bad data quality issues:
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -fault 0 -skycell_id skycell.1601.095 -diff_id 1282721
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -fault 0 -skycell_id skycell.1073.058 -diff_id 1283177
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -fault 0 -skycell_id skycell.1684.032 -diff_id 1281798
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -fault 0 -skycell_id skycell.1684.032 -diff_id 1281818
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -fault 0 -skycell_id skycell.1684.032 -diff_id 1281878
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -fault 0 -skycell_id skycell.1684.032 -diff_id 1281899
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -fault 0 -skycell_id skycell.1424.051 -diff_id 1281958
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -fault 0 -skycell_id skycell.1424.051 -diff_id 1281976
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -fault 0 -skycell_id skycell.1519.032 -diff_id 1282047
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -fault 0 -skycell_id skycell.2547.098 -diff_id 1282300
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -fault 0 -skycell_id skycell.1601.095 -diff_id 1282733
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -fault 0 -skycell_id skycell.1041.072 -diff_id 1282545
    
  • 16:13 HAF: restart of pantasks for nightly processing
  • 21:14 HAF: registration stalled out, but logged in and it's moving again. I did nothing! I will keep an eye on it.

Thursday : 2015.12.03

  • 6:45 HAF: registration stuck again, logged in, all was fine..

Friday : 2015.12.04

  • 10:50 EAM : some bad quality warps and diffs:
    warptool -dbname gpc1 -updateskyfile -set_quality 42 -fault 0 -skycell_id skycell.2100.035 -warp_id 1647472
    difftool -dbname gpc1 -updatediffskyfile -fault 0 -set_quality 42 -diff_id 1283491 -skycell_id skycell.1129.095
    difftool -dbname gpc1 -updatediffskyfile -fault 0 -set_quality 42 -diff_id 1283914 -skycell_id skycell.0985.057
    difftool -dbname gpc1 -updatediffskyfile -fault 0 -set_quality 42 -diff_id 1283932 -skycell_id skycell.0985.057
    difftool -dbname gpc1 -updatediffskyfile -fault 0 -set_quality 42 -diff_id 1283965 -skycell_id skycell.0988.035
    difftool -dbname gpc1 -updatediffskyfile -fault 0 -set_quality 42 -diff_id 1283966 -skycell_id skycell.0985.057
    difftool -dbname gpc1 -updatediffskyfile -fault 0 -set_quality 42 -diff_id 1283981 -skycell_id skycell.0988.035
    difftool -dbname gpc1 -updatediffskyfile -fault 0 -set_quality 42 -diff_id 1283982 -skycell_id skycell.0901.095
    difftool -dbname gpc1 -updatediffskyfile -fault 0 -set_quality 42 -diff_id 1284016 -skycell_id skycell.0901.095
    difftool -dbname gpc1 -updatediffskyfile -fault 0 -set_quality 42 -diff_id 1284025 -skycell_id skycell.0901.095
    difftool -dbname gpc1 -updatediffskyfile -fault 0 -set_quality 42 -diff_id 1284026 -skycell_id skycell.0988.035
    difftool -dbname gpc1 -updatediffskyfile -fault 0 -set_quality 42 -diff_id 1284030 -skycell_id skycell.1432.047
    difftool -dbname gpc1 -updatediffskyfile -fault 0 -set_quality 42 -diff_id 1284035 -skycell_id skycell.1432.063
    difftool -dbname gpc1 -updatediffskyfile -fault 0 -set_quality 42 -diff_id 1284071 -skycell_id skycell.1432.063
    difftool -dbname gpc1 -updatediffskyfile -fault 0 -set_quality 42 -diff_id 1284133 -skycell_id skycell.1251.066
    difftool -dbname gpc1 -updatediffskyfile -fault 0 -set_quality 42 -diff_id 1284247 -skycell_id skycell.2177.015
    difftool -dbname gpc1 -updatediffskyfile -fault 0 -set_quality 42 -diff_id 1284262 -skycell_id skycell.2177.015
    difftool -dbname gpc1 -updatediffskyfile -fault 0 -set_quality 42 -diff_id 1284325 -skycell_id skycell.2177.015
    

* 19:30 EAM: stopping pantasks to restart

Saturday : 2015.12.05

  • 04:05 Bill: One chip for exposure 108 was was stuck in data_state check_burntool. This has caused registration to be > 500 exposures behind. Fixed that and now the queue is filling up.
  • 07:40 Bill: two more chips got stuck. o7361g0684o, XY75 and o7361g0685o | XY45 Set data_state to pending_burntool
  • 12:40 EAM : set a diff to bad quality:
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -fault 0 -skycell_id skycell.1042.005 -diff_id 1285391
    

Sunday : 2015.12.06

  • 13:50 EAM: ipp046 is down (per Haydn, tried to reboot remotely with no success), so I've put it to nebulous down.
  • 20:50 EAM : pantasks have been going too long. I'm stopping to restart.
  • 21:20 EAM : various ippc machines running ippqub jobs have hung nfs mounts -- i am clearing them out.