PS1 IPP Czar Logs for the week 2014.05.12 - 2014.05.19

(Up to PS1 IPP Czar Logs)

Monday : 2014.05.12

mark is czar

  • 07:30 MEH: 3x c2 in staticsky too much again, -1x c2
  • 08:00 MEH: looks like LAP is stalled from too many faults now again -- warps
    --- missing smf 
     neb://any/gpc1/ThreePi.nt/2010/03/13/o5268g0101o.147355/o5268g0101o.147355.cm.60313.smf
    2014/05/12 08:12:52 | ippc18 | FATAL | Nebulous::Client::find_instances - unhandled fault - database error: no instances available for key: neb:///gpc1/ThreePi.nt/2010/03/13/o5268g0101o.147355/o5268g0101o.147355.cm.60313.smf at /usr/lib64/perl5/site_perl/5.8.8/Nebulous/Server.pm line 1838, <DATA> line 12.
    unhandled fault - database error: no instances available for key: neb:///gpc1/ThreePi.nt/2010/03/13/o5268g0101o.147355/o5268g0101o.147355.cm.60313.smf at /usr/lib64/perl5/site_perl/5.8.8/Nebulous/Server.pm line 1838, <DATA> line 12.
    [ippc18:~] mhuber% whichnode neb://any/gpc1/ThreePi.nt/2010/03/13/o5268g0101o.147355/o5268g0101o.147355.cm.60313.smf
    duplicate variable name: LOGFORMAT, removed
     at /data/ippc18.0/home/mhuber/IPP//ipp-20140506.lin64/lib/PS/IPP/Config.pm line 96
    ipp020.0X not available
    
    --- missing cm masks is a common one from Saturday
    neb://any/gpc1/ThreePi.nt/2010/05/22/o5338g0138o.170714/o5338g0138o.170714.cm.80535.XY74.mk.fits
    nebulous key: neb://any/gpc1/ThreePi.nt/2010/05/22/o5338g0138o.170714/o5338g0138o.170714.cm.80535.XY74.mk.fits not found at /data/ippc18.0/home/mhuber/IPP//ipp-20140506.lin64/bin/neb-stat line 47.
      
    neb://any/gpc1/ThreePi.nt/2010/05/22/o5338g0156o.170732/o5338g0156o.170732.cm.80553.XY40.mk.fits -- >12 pixel FWHM so qual and warp cannot be updated w/ newer IPP
    neb://any/gpc1/ThreePi.nt/2010/06/13/o5360g0134o.180687/o5360g0134o.180687.cm.86770.XY64.mk.fits -- >12 pixel FWHM so qual and warp cannot be updated w/ newer IPP
    neb://any/gpc1/ThreePi.nt/2010/07/23/o5400g0123o.195530/o5400g0123o.195530.cm.98586.XY74.mk.fits -- >12 pixel FWHM so qual and warp cannot be updated w/ newer IPP
    neb://any/gpc1/ThreePi.nt/2011/03/15//o5635g0486o.310739/o5635g0486o.310739.cm.184246.XY25.mk.fits -- >12 pixel FWHM so qual and warp cannot be updated w/ newer IPP
    neb://any/gpc1/ThreePi.nt/2011/03/18//o5638g0304o.312937/o5638g0304o.312937.cm.185878.XY65.mk.fits -- >12 pixel FWHM so qual and warp cannot be updated w/ newer IPP
    
    --> probably need to set these exposures to drop -- doing @11:00
    
    laptool -updateexp -set_data_state drop -dbname gpc1 -lap_id 24202 -exp_id 170732
    etc..
    
    
    -- memory err --  	o6638g0084o 	686017 	987119 	956504 	933140 	936092 	skycell.2600.049 
    Memory Block ID: 8916146 @ 0x7fb0c1ca0010
    	Previous Block: (nil) Next Block: 0xe782010
    	Free function: (nil)
    	Size: 82559792 Reference count: 1 Persistent: No
    	Posts: deadbeef deadbeef 0
    	Allocated in psFitsScaleForDisk at (psFitsScale.c:971)
    		by Thread ID 140397226923760
    	Memory block is corrupted; buffer overflow detected.
    	Caught in p_psMemCheckCorruption at (psMemory.c:1319).
    	Caught in imageFree at (psImage.c:75).
    p_psMemDecrRefCounter (psMemory.c:1150) Unsafe to Continue
    Aborting.  Error stack:
    
    --> Bill has run similar case manually before to clear -- no time to deal with 
    
    
    
  • 08:20 MEH: camera
    -- cannot find expected output file
    neb://any/gpc1/ThreePi.nt/2010/02/19/o5246g0081o.137078/o5246g0081o.137078.cm.48403.XY66.mk.fits -- astrom problem orig log, so cant update
    neb://any/gpc1/ThreePi.nt/2010/03/13/o5268g0085o.147339/o5268g0085o.147339.cm.57623.XY01.mk.fits -- astrom problem orig log, so cant update
    neb://any/gpc1/ThreePi.nt/2010/03/17/o5272g0068o.148529/o5272g0068o.148529.cm.60368.XY01.mk.fits -- astrom problem orig log, so cant update
    neb://any/gpc1/ThreePi.nt/2010/03/20/o5275g0396o.149871/o5275g0396o.149871.cm.61184.XY01.mk.fits -- astrom problem orig log, so cant update
    neb://any/gpc1/ThreePi.nt/2010/03/20/o5275g0408o.149883/o5275g0408o.149883.cm.61196.XY01.mk.fits -- astrom problem orig log, so cant update
    neb://any/gpc1/ThreePi.nt/2010/05/31/o5347g0277o.175474/o5347g0277o.175474.cm.83450.XY01.mk.fits -- astrom problem orig log, so cant update
    neb://any/gpc1/ThreePi.nt/2010/05/31/o5347g0291o.175488/o5347g0291o.175488.cm.83463.XY03.mk.fits -- astrom problem orig log, so cant update
    neb://any/gpc1/ThreePi.nt/2010/06/19/o5366g0183o.183985/o5366g0183o.183985.cm.90037.XY01.mk.fits -- astrom problem orig log, so cant update
    
    neb://any/gpc1/ThreePi.nt/2010/03/17/o5272g0080o.148541/o5272g0080o.148541.cm.60380.XY01.mk.fits -- unclear why, update issue
    neb://any/gpc1/ThreePi.nt/2010/06/27/o5374g0280o.187535/o5374g0280o.187535.cm.92462.XY01.mk.fits -- unclear, update issue
    
    
  • 13:40 MEH: stacks being made, add more power from stdsci
    s3: -5 stdsci, +3 stk 
    
  • 18:30 MEH: reconfig for nightly science -- regular restart of stdsci and the s3 off in stack, might as well restart summitcopy and registration as well

Tuesday : 2014.05.13

  • 12:27 removed LAP label from survey.relstack task. It is failing because there be duplicate stacks again.
  • 12:30 CZW stopped lap tasks in stdscience. Something has gone crazy in the past couple of days.
  • 15:35 CZW I restarted stdscience and forgot to note it here. PV2 processing is now running under the LAP.ThreePi?.20130717.pole label due to intractable issues with the pole in the previous label. I'll merge the two labels together once the processing has finished.
  • 16:45 CZW I've disabled lap.cleanup. I realized that if this wasn't done, lap would clear all the chips it's producing, and those are needed to make the CNP.V3 warps.

Wednesday : 2014-05-14

  • 10:30 CZW: ipp060 seems to have crashed, with no apparently useful information on the console:
    <Jan/31 12:30 pm>?
    <Jan/31 12:30 pm>
    <Jan/31 12:30 pm>This is ipp060.ifa.hawaii.edu (Linux x86_64 3.7.6) 12:30:09
    
    • power cycled, seems to be coming back up without much trouble.
    • the ethernet cable was the problem.
  • 12:30 CZW: Restart stdscience.

Thursday : 2014.05.15

  • 1&:34 CZW: restaring stdscience.

Friday : 2014.05.16

  • 07:15 EAM : nightly science diffs are far behind, but chips are done, so I'm turning chip off for now.
  • 14:30 CZW: restarted stdscience.

Saturday : 2014.05.17

  • 07:15 EAM : turned chip off again while nightly science is dominated by warps and diffs

Sunday : 2014.05.18

  • 08:10 EAM : 2 machines died yesterday (ipp036 & ipp042) and we did not notice until this morning. I have tried to power cycle them, but they are not coming up from the console (maybe they have an invalid bios setting). I turned them off in nebulous, so processing is moving along ok. I have removed the LAP pole label (LAP.ThreePi?.20130717.pole) for now so that nightly science can make progress. I've also bumped up the default stdscience allocation on the storage nodes.