PS1 IPP Czar Logs for the week YYYY.MM.DD - YYYY.MM.DD

(Up to PS1 IPP Czar Logs)

Monday : 2014.12.22

  • 07:40 MEH: warp fault -- cannot build growth curve (psf model is invalid everywhere)
    warptool -dbname gpc1 -updateskyfile -fault 0 -set_quality 42 -warp_id 1297048 -skycell_id skycell.2460.067
    
  • 13:05 MEH: ipp006-031 neb-host repair, limit space available and try to help limit data across 10G link w/ note reflecting such (as discussed at morning meeting). ipp033-053 already in repair for similar reasons.
    • all except ipp008,012,013,014,016,018,109,020,021,037 which are actually on ippcore
  • 18:00 CZW: restarting stdlocal, as it's old and not spawning jobs efficiently. -- be sure to re-adjust the poll to not overload the 10G link with chip processing
  • 23:40 MEH: doesn't look like balance was done? poll>400 for chips and 10G overload harassing other processing (good thing no nightly), rate lower than before with fewer stacks. 8x:x nodes (not sure when changed) looks to be trying 6 stacks at a time.. probably not a good default to have.. -- setting back to 6x and reducing poll for chips to where 10G link isn't being as overloaded

Tuesday : 2014.12.23

  • 09:05 MEH: ipp015 unresponsive, nothing on console, minimal load -- power cycle

Wednesday : YYYY.MM.DD

  • 01:40 MEH: regtool on ipp062 stalling registration for past 1.5hr.. 20 exposures behind -- killed job and moving again
  • 2:10 MEH: ippc28 crash -- GPF and something odd w/ swap /dev/sdb? -- try to powercycle but taking out of stdsci
    <Dec/24 01:42 am>[1507866.922239] end_request: I/O error, dev sdb, sector 198065
    <Dec/24 01:42 am>[1507866.928211] Read-error on swap-device (8:16:198073)
    
    • power cycled but not fully booting, stalling on * Starting local ...
  • 06:15 MEH: fixed fault 5 diff
    difftool -updatediffskyfile -fault 0 -set_quality 42 -diff_id 625690 -skycell_id skycell.1243.051 -dbname gpc1
    

Thursday : YYYY.MM.DD

  • 06:00 MEH: nightly registration behind, stdlocal poll 200->150
  • 06:25 EAM: ipp086 is getting overloaded by nfs and getting behind. nightly science is waiting for jobs that want ipp086. putting it in repair for now. update: putting stdlocal to control run reap while nightly science catches up a bit.
    • 086 load due to being in WT case, something is odd w/ battery stats
  • 07:35 EAM : stdlocal back to run
  • 08:10 MEH: diffim fault 5 clear
    difftool -updatediffskyfile -fault 0 -set_quality 42 -diff_id 625928 -skycell_id skycell.2048.094 -dbname gpc1
    
  • 21:30 MEH: looks like stdlocal load has stalled nightly processing to the auto-shutoff point in stdlocal chip-warp -- setting stdlocal poll to 150 so if nightly catches up, both can maybe run more easily together
  • 10:05 MEH: looks like stdsci polls also not staying filled -- needed its regular restart
    • ipp084 like 086 also in WT case and having load issues -- may need to trigger relearn cycle again..
    • w/ stdlocal auto-stopped, network over 10G link is underused ~5-6 Gb/s with just nightly and md processing

Friday : YYYY.MM.DD

Saturday : 2014.12.27

  • 20:45 MEH: looke like ippc01 has no diskspace left, nothing has been processing well for a while.. stopping apache to reset log.. -- looks like most of day was lost to faulted processing..
    • also summitcopy+registration+stdsci could use a restart

Sunday : YYYY.MM.DD

  • 08:05 EAM : restarting stdlocal (~300k jobs)
  • 21:35 HAF: registration jammed up -- 3 jobs were hung on exp 156 and 159 - I reran them on stsci00 and it cleared it up (not sure why they stalled). Then there were a few of the usual jammed things from regpeek.pl,
    ipp_apply_burntool_single.pl --camera GPC1 --exp_id 845004 --class_id XY20 --this_uri neb://ipp095.0/gpc1/20141229/o7020g0157o/o7020g0157o.ota20.fits --continue 10 --previous_uri neb://ipp095.0/gpc1/20141229/o7020g0156o/o7020g0156o.ota20.fits --dbname gpc1 --verbose
    ipp_apply_burntool_single.pl --camera GPC1 --exp_id 845004 --class_id XY67 --this_uri neb://ipp083.0/gpc1/20141229/o7020g0157o/o7020g0157o.ota67.fits --continue 10 --previous_uri neb://ipp083.0/gpc1/20141229/o7020g0156o/o7020g0156o.ota67.fits --dbname gpc1 --verbose
    register_imfile.pl --exp_id 845005 --tmp_class_id ota04 --tmp_exp_name o7020g0159o --uri neb://ipp093.0/gpc1/20141229/o7020g0159o/o7020g0159o.ota04.fits --logfile neb://ipp093.0/gpc1/20141229/o7020g0159o.845005/o7020g0159o.845005.reg.ota04.log --bytes 24531840 --md5sum aa35430a1eefab88cbfb645020259a65 --sunset 03:30:00 --sunrise 17:30:00 --summit_dateobs 2014-12-29T06:09:01.000000 --dbname gpc1 --verbose
    regtool -updateprocessedimfile -exp_id 845003 -class_id XY67 -set_state pending_burntool -dbname gpc1
     regtool -updateprocessedimfile -exp_id 845005 -class_id XY04 -set_state pending_burntool -dbname gpc1
    
    • generally caused by unbalanced stdlocal overload at night -- now that in auto-shutoff due to nightly backup, should be smoother processing.
    • 23:24 HAF: registration stuck again, i reran this:
      register_imfile.pl --exp_id 845130 --tmp_class_id ota13 --tmp_exp_name o7020g0284o --uri neb://ipp086.0/gpc1/20141229/o7020g0284o/o7020g0284o.ota13.fits --logfile neb://ipp086.0/gpc1/20141229/o7020g0284o.845130/o7020g0284o.845130.reg.ota13.log --bytes 22518720 --md5sum b257a6df9226292a6de7b93881373ce7 --sunset 03:30:00 --sunrise 17:30:00 --summit_dateobs 2014-12-29T08:16:14.000000 --dbname gpc1 --verbose 
      
  • 23:34 HAF: yay caught up summitcopy /registration