PS1 IPP Czar Logs for the week 2014.11.03 -- 2014.11.09

(Up to PS1 IPP Czar Logs)

Monday : 2014.11.03

  • 16:00 HAF : removed ipp008 from the hosts input file in ippconfig ( gene has declared it too flaky to be used)
  • 20:30 EAM : added ipp008 to the ignore list (to avoid overloading the machine); restarting all servers
  • 20:40 EAM : restarting stdlocal as well. I've also added a 'storage.hosts.half.on,storage.hosts.half.off' pair so we can keep the storage hosts somewhat loaded.

Tuesday : 2014.11.04

  • 11:48 Bill : started pantasks from ~ippsky/staticsky using the hi memory nodes to run full force tests on part of the SAS area. Turned them off in stdlocal
  • 21:55 EAM : the processing on ippsky/staticsky on himem nodes appears to be done, so I'm putting them back in stdlocal for the night.
  • 10:00 MEH: MD needs nodes for processing, will be using them (to be clear himem=ippsXX nodes) then off and on -- took a long time for the stacks to clear... finished for a bit so now back into lanl stdlocal

Wednesday : 2014.11.05

  • 08:40 MEH: ippsXX nodes off in lanl stdlocal -- now on again
  • 13:40 MEH: ippsXX nodes off in lanl stdlocal -- now on again in stdlocal
  • 15:03 Bill: added label sas.fftest.20141104 to the distribution pantasks
  • 15:26 EAM : stopping stdlocal stacks to restart the pantasks
  • 16:20 MEH: set stdlocal to stop as wasn't loading anything anyways (needs a restart) -- also taking the ippsXX nodes for MD again now
  • 17:15 MEH: ippsXX nodes back on in stdlocal
  • 20:15 MEH: nightly processing hanging up.. many timeouts.. stopping stdlocal, removing update labels
    • stdsci chips picked up fairly quickly, update labels causing problem for stdsci?
    • ippdb01 gpc1 probably needs to be restarted in morning
    • stsci00 also started heavy load ~2000 and may be related to problem as well
    • odd/unannounced script (do_check.pl) running on ippc18 during nightly, does it need to be?
    • still seeing many timeouts in summitcopy, stdlocal off again -- summitcopy+registration timeouts much less as often..
  • 21:05 MEH: stdlocal run again -- things seem better.. so something else unknown running?

Thursday : 2014.11.06

  • 05:56 Bill restarted the pstamp pantasks
  • 07:45 Bill: stdscience was idle so added the update labels back in. The chip poll task seems to be taking forever. This is likely due to the fact that so many chipRuns are in state update. If the situation doesn't improve I will set them to goto_cleaned.
    • set them to be cleaned. The ones that are needed will be set back to update by the dependency checker
  • 12:10 MEH: looks like pstamp has been down since 09:41, not sure if crash or if someone doing something with it.
    • 12:29 Bill: must have crashed. restarted it.
  • 20:05 EAM : Ken says PS1 is down for the night (rotator problems), so I am going to turn on all storage nodes for processing on stdlocal (after stopping and restarting it first).

Friday : 2014.11.07

  • 11:20 MEH: ippsXX off in lanl stdlocal -- in use by ~ippmd/stdscience pantask -- if stopped, put note in czarlog or will likely get set to run again

Saturday : 2014.11.08

  • 09:55 EAM : the system has generally been sluggish : night time processing was way behind this morning, lots of timeouts on the pantaskses, loading of the ippMonitor page was very sluggish. I stopped all processing in the morning (~6:30) and restarted the ippdb01 mysql server. (I also rebooted ipp036, which had crashed). this helped a little, but things still seemed to be slow. I've just stopped all non-nightly science processing. I have also added 4xc2 hosts in ipp/stdscience to increase the throughput for nightly.
  • 10:05 EAM : requeued a failed diff with:
    ippdb01: difftool -dbname gpc1 -definewarpwarp -exp_id 813873 -template_exp_id 813886 -backwards -set_workdir neb://@HOST@.0/gpc1/OSS.nt/2014/11/08 -set_dist_group SweetSpot -set_label OSS.nightlyscience -set_data_group OSS.20141108 -set_reduction SWEETSPOT -simple -rerun
    604168 new neb://@HOST@.0/gpc1/OSS.nt/2014/11/08 OSS.nightlyscience OSS.20141108 SweetSpot SWEETSPOT  2014-11-08T20:05:29.278386 RINGS.V3 T T 0  0.000000 nan nan nan nan 1 
    604169 new neb://@HOST@.0/gpc1/OSS.nt/2014/11/08 OSS.nightlyscience OSS.20141108 SweetSpot SWEETSPOT  2014-11-08T20:05:29.278386 RINGS.V3 T T 0  0.000000 nan nan nan nan 1 
    
  • 11:55 EAM : 2 diffs were failing on the stamp rejection, in the vector fit. I thought we fixed this? Here is an example error:
        Data error code: 36b6
    rror in subtraction:
     -> VectorFitPolynomial1DOrd (psMinimizePolyFit.c:633): unknown psLib error
         Could not solve linear equations.
     -> psVectorFitPolynomial1D (psMinimizePolyFit.c:758): unknown psLib error
         Could not fit polynomial.  Returning NULL.
     -> psVectorClipFitPolynomial1D (psMinimizePolyFit.c:934): unknown psLib error
         Could not fit polynomial.  Returning false.
     -> pmSubtractionRejectStamps (pmSubtraction.c:1014): invalid data
         Unable to measure statistics for deviations.
     -> pmSubtractionMatch (pmSubtractionMatch.c:848): invalid data
         Unable to reject stamps.
     -> ppSubMatchPSFs (ppSubMatchPSFs.c:466): Problem in data values
         Unable to match images.
     -> ppSubLoop (ppSubLoop.c:113): Problem in data values
         Unable to match PSFs.
    found 4 leaks at ppSub
    Number of leaks to display: 500
    # func at (file:line)  ID: X  Ref: X
    pmSubtractionRejectStamps at (pmSubtraction.c:978)  ID: 2557013  Ref: 1
    vectorAlloc at (psVector.c:78)  ID: 2557014  Ref: 1
    pmSubtractionRejectStamps at (pmSubtraction.c:979)  ID: 2557015  Ref: 1
    vectorAlloc at (psVector.c:78)  ID: 2557016  Ref: 1
    Unable to perform ppSub: 5 at /home/panstarrs/ipp/psconfig/ipp-20141024.lin64/bin/diff_skycell.pl line 507.
    Running [/home/panstarrs/ipp/psconfig/ipp-20141024.lin64/bin/difftool -diff_id 604081 -skycell_id skycell.0793.089 -fault 5 -adddiffskyfile -dtime_script 31.9999873638153 -hostname ippc13 -path_base neb://stsci14.0/gpc1/OSS.nt/2014/11/08/RINGS.V3/skycell.0793.089/RINGS.V3.skycell.0793.089.dif.604081 -dbname gpc1]...
    

I set these to bad quality:

stsci19: difftool -updatediffskyfile -fault 0 -set_quality 36b6 -diff_id 604081 -skycell_id skycell.0793.089 -dbname gpc1
stsci19: difftool -updatediffskyfile -fault 0 -set_quality 36b6 -diff_id 604237 -skycell_id skycell.0434.099 -dbname gpc1
  • 16:20 MEH: turning ippsXX on in lanl stdlocal

Sunday : 2014.11.09

  • 08:10 EAM : nightly processing was running somewhat slowly. i've stopped ipplanl/stdlocal because i suspect the stacktool command is overloading the database.