PS1 IPP Czar Logs for the week 2015.11.24 - 2015.11.25

Continuing HAF suggestion to improve communication, so we know what's going on better -- a list in the czar pages additional (non-standard) processing - so that we all know what's going on.

Daily Czaring:

  • MOPS has repeatedly requested NO changes be made to the ~ipp ops tag without being verified by their test set -- so currently there is a modified ops tag running diffs (WS labels only) as ippqub (was ippmops) under ~ippqub/src/stdscience_ws on ippc06 -- if problems (Njobs>100k, power loss on ippc06 etc), it will need to be restarted like a normal nightly processing pantasks (or IF ANY OTHER ISSUES LIKE THE ~ipp pantasks)
    ./ stdscience_ws
    • even with the modified ops tag, there are still various files MIA in nebulous and as such requires the daily czar to check and clear them as has been discussed before
  • ps_ud_QUB has also been moved to the ippqub:stdscience_ws pantasks to support updates possibly broken by missing cmf files, chip and warp updates will also be done in that pantasks as well

(Up to PS1 IPP Czar Logs)

Monday : 2015.11.23

Tuesday : 2015.11.24

  • 08:00 Bill: Updated to work around problem with chipProcessedImfile.uri being incorrect for some PV3 data.
    • 08:13 Restarted postage stamp server pantasks.
  • 11:00 EAM: Restarted all pantasks to clear errors from before bill's changes
  • 11:32 Bill: stopping pstamp and stdscience pantasks to pick up some changes to pstamp dependency checking and warptool_scmap.sql
    • 11:37 set to run. Change was to fault dependents for warps where the camRun is in state cleaned. The smf is gone so need to say PSTAMP_GONE
    • 11:43 increased dependency checking poll limit from 64 to 1024 to get stdscience working on the backlog of updates.
    • I Need to tweak exec time versus queue depth for this operation. This poll limit would ruin MOPS throughput for stamps if there are lots of jobs that want updates. But on the other hand, mops doesn't often use our server anymore so...
      • MEH: MOPS actually does use the postage stamp server when doing their high prio follow-up for past data and chip level for comets(?) -- requests have been sent to ipp-dev for this when other things seem to get in way..
  • 12:55 MEH: testing restore of missing MD.PV3 files from stsci by running the faulted WSdiffs under ~ippmd/stdscience using the ippqub stdscience_ws c2 nodes since not using for nightly processing -- done

Wednesday : 2015.11.25

  • 08:00 MEH: clearing two fault 5 WSdiff (setting qual 42 since doing it manually..), both have ref FWHM nan again
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.1499.050 -diff_id 1278624  -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.1587.095 -diff_id 1278669  -fault 0
  • 08:30 MEH: clearing old stalled exposure in distribution from 20151118 fault 3 config error by simple revert of fault 3...
  • 08:46 MEH: clearing old stalled WSdiff exposure from 20151118 manually, also fault 5 from ref FWHM nan
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.1503.017 -diff_id 1277656 -fault 0
  • 17:24 MEH: with all the stamp updates, stdscience and pstamp should be restarted before nightly begins
  • 17:30 MEH: start large neb file scans to finish before nightly starts

Thursday : 2015.11.26

  • 13:00 MEH: ipp cluster unresponsive a little after 0930, Sifan contacted people on-site and things seem okay so likely network issue. Going on-site to see if problem can be fixed.
  • 14:16 MEH: Sifan et al. fixed connect, seems to be ippcore crash related --
  • 14:25 EAM: the ippcore switch went offline ~10:45 this morning. sifan went down to mrtc-b and was able to get it back online by power cycling it (under guidance from Gavin and Curt). the ippcore is back up now and working fine, it seems the cluster was never actually down, just the switch.
  • 14:52 MEH: restarting stdscience and pstamp for normal nightly science since been busy doing updates and Njob >100k

Friday : 2015.11.27

  • 08:42 MEH: clearing fault5 diffs (stack ref fwhm NAN):
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.1501.055 -diff_id 1278967  -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.1501.055 -diff_id 1278986  -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.1502.068 -diff_id 1278992  -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.1501.055 -diff_id 1279020  -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.1039.037 -diff_id 1278686  -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.1039.050 -diff_id 1278689  -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.1190.015 -diff_id 1279490  -fault 0
  • 09:22 MEH: Richard asking about two missing exposures for a diff:
    Exposure	Status	Comment
    o7353g0620o 	FAIL (Diff stage) 	OSSR.R10N4.11.Q.i ps1_27_4574 visit 3
    o7353g0621o 	FAIL (Diff stage) 	OSSR.R10N4.11.Q.i ps1_27_4538 visit 4
    • OSSR.R10N4.11.Q.i ps1_27_4574 visit 3 has no matching visit 4
    • OSSR.R10N4.11.Q.i ps1_27_4538 visit 4 had visit 1 done twice, so v1-v1 was done and need to redo the full set for this field (used first visit 1 since coords for second seemed off, zp and sky and fwhm looked the same for both)
  • 16:50 MEH: seeing odd backup of update jobs and registration of pre-nightly darks for ipp049,050 maybe couple others -- looks like heavy(?) rsync.. -- turning off in stdscience, ipp049 is better, ipp050 not
    • then ipp087, so taking it out of processing so nightly jobs dont get stalled..
  • 17:12 MEH: restarting stdscience and pstamp again since they have been busy..

Saturday : 2015.11.28

  • 10:06 MEH: clearing more fault 5 diffims where ref fwhm is nan
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.1233.091 -diff_id 1279746  -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.1247.077 -diff_id 1280080  -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.1247.077 -diff_id 1280099  -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.1247.077 -diff_id 1280149  -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.1247.077 -diff_id 1280156  -fault 0

Sunday : 2015.11.29

  • 10:56 MEH: clearing two fault 5 -> VectorFitPolynomial?1DOrd (psMinimizePolyFit.c:633): unknown psLib error
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.1963.004 -diff_id 1280278 -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.1966.064  -diff_id 1280309 -fault 0
    • ref FWHM is NAN
      difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.0916.052 -diff_id 1280348 -fault 0
  • 12:20 MEH: long list of missing MOPS diffs are not recoverable w/o raising the cam FWHM limit from <3" to <4" (61/68 could be recovered)
  • 17:47 MEH: doing necessary regular restart of nightly pantasks