Monday : 2013.10.07

  • 09:20 Bill restarted pstamp pantasks
  • 17:30 CZW: ipp051 and ipp063 had to be power cycled, as they had memory issues due to galactic plane.
  • 22:50 MEH: surprise, registration stuck.. 70 exposures behind -- restarting registration seems to have cleared
  • 00:10 MEH: then 2 OSS chips stalling on ipp028 for ~3ks.. restarted stdsci, STS and LAP label out
    • stsci10-19 also down and not off.. something was odd..
  • 01:20 MEH: OSS warps MIA.. restarting stdsci again found them. also five OSS had fault 5
    difftool -dbname gpc1 -updatediffskyfile -set_quality 14006 -skycell_id skycell.1501.041 -diff_id 483695 -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 14006 -skycell_id skycell.1413.083 -diff_id 483713 -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 14006 -skycell_id skycell.1413.095 -diff_id 483713 -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 14006 -skycell_id skycell.1322.020 -diff_id 483726 -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 14006 -skycell_id skycell.1322.090 -diff_id 483726 -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 14006 -skycell_id skycell.1414.007 -diff_id 483727 -fault 0

Tuesday : 2013.10.08

mark is czar

  • 07:00 MEH: nightly mostly done, STS back in. 3PI fault 5 diffim
    difftool -dbname gpc1 -updatediffskyfile -set_quality 14006 -skycell_id skycell.1324.041 -diff_id 483889 -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 14006 -skycell_id skycell.1332.035 -diff_id 483964 -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 14006 -skycell_id skycell.1233.080 -diff_id  483855 -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 14006 -skycell_id skycell.1234.079 -diff_id  483859 -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 14006 -skycell_id skycell.1235.098 -diff_id  483863 -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 14006 -skycell_id skycell.1331.020 -diff_id  483978 -fault 0
  • 09:20 MEH: ipp034 back up, put back into neb-host repair. ipp0016 put into repair over weekend because problematic, put back up now after rebooted sunday..
  • 09:25 adding LAP label back in now --
  • 09:40 MEH: continuing general preventative maintenance -- restarting long running pantasks (summit, cleanup, distribution, publishing, stack; did registration last night) -- been month+ since last log archive to clear up space on ippc18.0
    • cleanup doesn't have lossycomp on by default, need to lossycomp.on
    • ippc18 disk is overused, cannot rsync the logs ippc18.0 to ippc18.1 w/o over-driving load and killing processing.. suspect too many accounts are writing too many logs to home disk. have to set rsync to <2000KPS
  • 09:50 MEH: ipp049 had long running (>2hr) pswarp and ppImage, killed and reverted but looks like another pair ppImage stalling --
  • 10:00 MEH: working in nodes that have been recently fixed or out of processing for disk work etc
    • ipp031,032 -- no indication cannot just use as compute type nodes while disks are down still (should they be down still or put into repair?)
    • ipp034 -- was in wave2_weak as missing half RAM, fixed so put back into full use
    • ipp041 -- mobo had lost 1 CPU so was just datanode, fixed so put back into full use
    • ipp047 -- mobo fried, replaced and was in wave3_weak for testing. seems okay so back into full use -- dvo/ipptopsps seems to be overloading it (mysql ram issue??) so back to wave3_weak
    • ippc12 -- broken for while, fixed and put back into full compute use
  • 11:15 MEH: still no LAP stacks, putting 2x compute3 from stack into pstamp for MOPS (18.5k jobs)
    • possibly a lot of data for MOPS on ipp017 requested at once? driving load >200.. settling down now. need to check if also was from access by UMD for high throughput tests at that time
  • 12:00 MEH: addressing low disk space on ippc0x machines due to /tmp/nebulous_server.log
  • 12:35 MEH: MOPS stamps finished, restarting pstamp and putting the 2x compute3 back to stack
  • 13:00 MEH: LAP has worked through the stalled updates and many stacks being loaded. switching STS reprocessing label back to its original 202 priority (STS prio is above LAP)
  • 13:50 MEH: ipp047 has be on/off being horrible overloaded in RAM and load.. probably also need powercycle -- finally got a login and killed off ppImage, Heather also stopping jobs and seems to be back -- have to monitor the RAM use by mysql
    • ipp049 is a terrible RAM use state by mysql (>90%) and cannot do much.. all machines with MYSQL need to be monitored and mysql restarted why significantly using RAM...
  • 14:00 MEH: looks like ippdb01 has died.. -- nothing on console, powercycle seems to be rebooting now (will take a terribly long time for RAM check)
  • 16:00 MEH: MD10z stack fault 5 due to all input warps having bad PSF. just setting to quality 42, probably should be 13006 (PPSTACK_ERR_DATA)
    stacktool -dbname gpc1 -updatesumskyfile -stack_id 2809682 -fault 5 -set_quality 42
    stacktool -updaterun -dbname gpc1 -stack_id 2809682  -set_state full
  • 16:10 MEH: ippmonitor hasn't been updating because czarpoll crashed when ippdb01 went down. restarted on ippc11
  • 17:40 MEH: ipp032 looks like it went down.. -- ganglia doesn't indicate any overload so likely is down.. -- not rebooting on powercycle..
  • 18:30 MEH: stdsci restart for nightly science, dropping STS label for the night and will let LAP run as is easier on system with updates
  • 23:00 MEH: more diffim fault 5 from marginal data and psLib error
    difftool -dbname gpc1 -updatediffskyfile -set_quality 14006 -skycell_id skycell.056 -diff_id 484115  -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 14006 -skycell_id skycell.083 -diff_id 484115  -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 14006 -skycell_id skycell.053 -diff_id 484119  -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 14006 -skycell_id skycell.036 -diff_id 484126  -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 14006 -skycell_id skycell.072 -diff_id 484139  -fault 0

Wednesday : 2013.10.09

mark is czar

  • 07:00 MEH: registration stalled again, regpeek showed was stuck with (couple time XY62 last night and then few times this morning to get the data registered)
    o6574g0216o  XY62 0 check_burntool neb://ipp041.0/gpc1/20131009/o6574g0216o/o6574g0216o.ota62.fits	#??? regtool -updateprocessedimfile -exp_id 664070 -class_id XY62 -set_state pending_burntool -dbname gpc1
    • 72 remaining exposure now clearing
    • MD01 warp also fault 3 and clearing, making stacks
  • 09:10 MEH: nightly through except two exposures wedged in warp stage, will deal with when actually get into office... -- after another restart of stdsci finally cleaered..
    • LAP warps only to build up stack pool before switching back to STS -- LAP update order appears scrambled, maybe after new RA chunk added, many are short 10-20 warps to finish and make stacks and need chips so this may take hour or so to start getting enough through
  • 11:10 not getting to point for stacks.. will re-add LAP label during nightly processing, so STS running again
  • 11:30 continuing to archive old pantasks logs
  • 14:20 MEH: no LAP stacks, throwing 2x compute3 into PSS for MOPS
  • 18:30 MEH: doing stdsci restart before nightly science, then will drop STS label for LAP processing until stack triggered (likely in morning)
  • 23:30 MEH: registration more often getting stuck and having to manually do
    regtool -updateprocessedimfile -exp_id 664212 -class_id XY62 -set_state pending_burntool -dbname gpc1
    regtool -updateprocessedimfile -exp_id 664263 -class_id XY05 -set_state pending_burntool -dbname gpc1
    o6575g0217o  XY62 0 check_burntool neb://ipp041.0/gpc1/20131010/o6575g0217o/o6575g0217o.ota62.fits	#??? regtool -updateprocessedimfile -exp_id 664358 -class_id XY62 -set_state pending_burntool -dbname gpc1
    o6575g0218o  XY62 0 check_burntool neb://ipp041.0/gpc1/20131010/o6575g0218o/o6575g0218o.ota62.fits	#??? regtool -updateprocessedimfile -exp_id 664360 -class_id XY62 -set_state pending_burntool -dbname gpc1
    o6575g0219o  XY62 0 check_burntool neb://ipp041.0/gpc1/20131010/o6575g0219o/o6575g0219o.ota62.fits	#??? regtool -updateprocessedimfile -exp_id 664361 -class_id XY62 -set_state pending_burntool -dbname gpc1
    o6575g0269o  XY62 0 check_burntool neb://ipp041.0/gpc1/20131010/o6575g0269o/o6575g0269o.ota62.fits	#??? regtool -updateprocessedimfile -exp_id 664411 -class_id XY62 -set_state pending_burntool -dbname gpc1

  • 07:00 MEH: 181 exposures backed up in registration, had to manually clear
    o6575g0333o  XY62 0 check_burntool neb://ipp041.0/gpc1/20131010/o6575g0333o/o6575g0333o.ota62.fits	#??? regtool -updateprocessedimfile -exp_id 664475 -class_id XY62 -set_state pending_burntool -dbname gpc1
    o6575g0363o  XY62 0 check_burntool neb://ipp041.0/gpc1/20131010/o6575g0363o/o6575g0363o.ota62.fits	#??? regtool -updateprocessedimfile -exp_id 664504 -class_id XY62 -set_state pending_burntool -dbname gpc1
    o6575g0448o  XY62 0 check_burntool neb://ipp041.0/gpc1/20131010/o6575g0448o/o6575g0448o.ota62.fits	#??? regtool -updateprocessedimfile -exp_id 664590 -class_id XY62 -set_state pending_burntool -dbname gpc1
    • example reg log may indicate mysql/db problem
      Unable to perform 1 at /home/panstarrs/ipp/psconfig/ipp-20130712.lin64/bin/ line 525
              main::my_die_for_update('Unable to perform 1', 664475, '\'XY62\'', 2) called at /home/panstarrs/ipp/psconfig/ipp-20130712.lin64/bin/ line 356
      Running: /home/panstarrs/ipp/psconfig/ipp-20130712.lin64/bin/regtool -updateprocessedimfile -exp_id 664475 -class_id 'XY62' -fault 2 -set_state pending_burntool  -hostname ipp041 -dbname gpc1
       -> psDBAlloc (psDB.c:166): Database error originated in the client library
           Failed to connect to database.  Error: Lost connection to MySQL server at 'reading authorization packet', system error: 0
       -> regtoolConfig (regtoolConfig.c:471): (null)
           Can't configure database
       -> main (regtool.c:71): (null)
           failed to configure
  • 08:10 MEH: nightly finally registered, and since up babysitting this, diffim fault 5 on OSS fix
    difftool -dbname gpc1 -updatediffskyfile -set_quality 14006 -skycell_id skycell.1149.043 -diff_id 484649 -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 14006 -skycell_id skycell.1149.044 -diff_id 484649 -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 14006 -skycell_id skycell.1148.056 -diff_id 484655 -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 14006 -skycell_id skycell.1147.071 -diff_id 484666 -fault 0

  • 09:10 MEH: and STS label back in since plenty of LAP stack to run.
  • 23:30 MEH: pausing processing to rebuild ippconfig. deepstack running deep stacks now, 2x compute3 from stack as usual. will keep an eye off and on (very long running jobs). if causes a problem then feel free to stop and kill ppStack jobs (just leave a note in czarlog).

Friday : 2013-10-11

  • 12:00 CZW: I just noticed ipp058 was down. The console shows a lot of information about crashed programs and system calls, so this looks like some variety of kernel panic rather than a hardware failure. Cycling power.

Saturday : 2013.10.12

  • 7am HAF ipp006's mysql fell over? ipptopsps not happy
  • 20:55 MEH: stdsci in dire need of restart and camera revert (still fault) to hopefully start making more stacks. so stuck (server is busy...33), can't even get to stop
    • camera faults looks to be from missing files ie
           Unable to access file neb://ipp041.0/gpc1/LAP.ThreePi.20120706/2012/08/13/o5950g0388o.442215/ nebclient.c:535 nebFind() - no instances found
  • 21:10 MEH: diff fault 5
    difftool -dbname gpc1 -updatediffskyfile -set_quality 14006 -skycell_id skycell.076 -diff_id 484815  -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 14006 -skycell_id skycell.077 -diff_id 484815  -fault 0

