PS1 IPP Czar Logs for the week YYYY.MM.DD - YYYY.MM.DD

Continuing HAF suggestion to improve communication, so we know what's going on better -- adding a list in the czar pages for additional (non-standard) processing so that we all know what's going on.

Daily Czaring:

  • MOPS has repeatedly requested NO changes be made to the ~ipp ops tag without being verified by their test set -- so currently there is a modified ops tag running diffs (WS labels only) as ippqub under ~ippqub/src/stdscience_ws on ippc06 -- if problems (Njobs>100k, power loss on ippc06 etc), it will need to be restarted like a normal nightly processing pantasks (or IF ANY OTHER CHANGES made for the ~ipp pantasks then likely same needs to be made for ~ippqub)
        ./ stdscience_ws
    • even with the modified ops tag, there are still various files MIA in nebulous and fault problems that requires the daily czar to check and clear them as has been discussed before
    • ps_ud_QUB has also been moved to the ippqub:stdscience_ws pantasks to support updates possibly broken by missing cmf files, chip and warp updates will also be done in this pantasks as well
  • HAF is addstaring on ipp061 - ipp065, ipp070 -ipp081.

(Up to PS1 IPP Czar Logs)

Monday : 2015-12-07

  • 19:55 CZW: restarted summitcopy and ippqub/stdscience_ws as they were crashed, and the email thing was complaining.

Tuesday : 2015.12.08

  • 02:20 MEH: nightly processing got stuck, oddly some nodes just stalling jobs in reg+stdsci for 2+ hrs.. -- ipp065, 055, 005, 069, 089 (ipp055 common only to both)..
  • 03:10 MEH: registration stuck on o7364g0449o and needed manual revert, ipp087 also having load spikes and high cpu wait so put neb-host repair for a bit (BBU seems fine)
  • 06:50 Bill: Registration is 137 exposures behind. Burntool status shows a faulted imfile. Reverted it and data is flowing again
    % regtool -revertprocessedimfile -exp_id 1001097
          Updated 1 rawImfile
  • 08:49 MEH: looks like stalled exposure o7364g0288o in chip and stalled exposure o7364g0289o in warp -- both should be cleared now
    • o7364g0288o was missing chipImfile, reran the to finish the chipImfile
    • o7364g0289o was missing skycell in warpSkyfile causing revertoverlap to barf
       -> p_psDBRunQuery (psDB.c:812): Database error generated by the server
           Failed to execute SQL query.  Error: Cannot delete or update a parent row: a foreign key constraint fails (`gpc1/warpSkyfile`, CONSTRAINT `warpSkyfile_ibfk_1` FOREIGN KEY (`warp_id`, `skycell_id`, `tess_id`) REFERENCES `warpSkyCellMap` (`warp_id`, `skycell_id`, `tess_id`))
       -> revertoverlapMode (warptool.c:744): unknown psLib error
           database error
      • manually clear faults in db
        update warpSkyCellMap set fault=0 where warp_id=1649894 and fault>0;
  • 09:28 MEH: restarting nightly pantasks that were having problems last night --
  • 10:12 MEH: cleaning fault 5 diff (ref stack fwhm nan..)
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.1702.060 -diff_id 1288693  -fault 0
  • 12:23 MEH: starting full SAS reprocessing w/ PV3 tags under label SAS.20151208 using nightly nodes ~half power -- chip-warp,stack will use ~ipplanl/stdsas pantasks on ippc63 -- will suspend for nightly ~1930 depending on weather
  • 13:06 MEH: ipp046 is back up, need back neb-host repair
  • 20:07 MEH: ipp087 wait cpu dominating, so back to repair

Wednesday : 2015.12.09

  • 08:54 MEH: clearing fault 5 diff (ref FWHM NAN)
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.1234.046 -diff_id 1288757 -fault 0
  • 09:30 MEH: SAS.20151208 stacks running on misc nightly nodes most of day
  • 13:00 MEH: pausing pstamp and ippqub:stdscience to add some ippqub tag changes -- might as well do full restart of nightly pantasks anyways
  • 14:37 MEH: SAS.20151208 WSdiff running under ~ipptest/diffsas on misc nightly nodes + ippsXX
  • 17:45 MEH: SAS.20151208 SS running under ~ippsky/ss_sc_ffsas on ippsX, 1x on ippxX and c2+s4+s5 until nightly starts
  • 20:07 MEH: seeing hanging summit and reg jobs again -- seem to be at db interaction, slowly clearing manually
  • 23:20 MEH: looks like PS1 closed for wind -- starting a MOPS requested reprocessing of the R08S1 chunk observed on night 7364 (12/8 UT, last night) as OSS.20151208redo

Thursday : 2015.12.10

  • 02:30 MEH: with no nightly processing, throwing all nodes into ~ippsky/ss_sc_ffsas for fullforce on SAS
  • 08:30 MEH: SAS 41 seems finished, all nodes returned
  • 13:30 CZW: Making changes to apache on the ippc2X nodes to allow more hosts for STSCI transfers.

Friday : 2015-12-11

  • 17:22 CZW: Restarting ipp pantasks so they'll be fresh.

Saturday : YYYY.MM.DD

Sunday : YYYY.MM.DD

  • 05:20 EAM: cleared a stuck registration:
    regtool -updateprocessedimfile -exp_id 1003331 -class_id XY12 -set_state pending_burntool -dbname gpc1