PS1 IPP Czar Logs for the week 2014.12.15 - 2014.12.21

(Up to PS1 IPP Czar Logs)

Monday : 2014.12.15

  • 08:00 MEH: cleanup must be turned back on, the few number of nodes available for data is killing processing again
  • 11:00 CZW: I've updated the database to prevent the exposures from the dead pixelserver from clogging up summitcopy:
    UPDATE pzDownloadExp SET state = 'drop' WHERE summit_id = 832987;
    UPDATE pzDownloadExp SET state = 'drop' WHERE summit_id = 832988;
    UPDATE pzDownloadExp SET state = 'drop' WHERE summit_id = 832989;
    UPDATE pzDownloadExp SET state = 'drop' WHERE summit_id = 832990;
    UPDATE pzDownloadExp SET state = 'drop' WHERE summit_id = 832991;
    UPDATE pzDownloadExp SET state = 'drop' WHERE summit_id = 832992;
    
  • 14:00 CZW: Restarting stdlocal. I'm going to see what happens with the "default" set of hosts.
    • 14:07 CZW: default set is underpowered. Turning on xnodes with xnode.hosts.on macro.
  • 14:50 MEH: summitcopy stop -- camera group reporting ramping overload -- anyone else doing summitcopy type stuff?
  • 14:45 CZW: I noticed more failing exposures in pzDownloadExp table, and dropped the camera controlled exposures in state 'run':
    UPDATE pzDownloadExp SET state = 'drop' WHERE summit_id = 834453;
    UPDATE pzDownloadExp SET state = 'drop' WHERE summit_id = 834454;
    UPDATE pzDownloadExp SET state = 'drop' WHERE summit_id = 834455;
    UPDATE pzDownloadExp SET state = 'drop' WHERE summit_id = 834456;
    UPDATE pzDownloadExp SET state = 'drop' WHERE summit_id = 834457;
    
  • 14:55 MEH: Haydn working on ipp077, will be up/down and shouldn't be used until he gives the all-clear
    • back online -- setting to repair only so not to overload until we catchup
  • 15:30 MEH: cleanup diff on only, doing double loading and poll 120 -- seems okay, will help move the larger size diffs from past few days out of the way
  • 16:00 CZW: With ipp077 back online, the LANL remote stack prep commands are no longer failing. I've reverted the LAP.PV3.20140730.bigmem stacks locally, to see if adding stacks impacts the processing rate.
  • 16:25 CZW: Restarting the ipp user pantasks. I will add 4x c2 to stdscience and ensure that cleanup is prioritizing the larger diffs.
  • 16:55 CZW: ipp071 had a load spike, probably related to stdlocal having the stack poll set too high. I've set it to repair, and I'll move it back to up when the stack npend number drops.
  • 18:20 CZW: After hearing back from Gene, I've added the stsciXX nodes to 'up' in nebulous to see if that helps with our processing throughput.
  • 18:25 CZW: Load jumped on stsci machines running Gene's rsyncs. I'm hesitant to keep them up for nightly processing with this added load. Back to repair, and I'll try again tomorrow when it's easier to monitor.
  • 18:30 MEH: ippx001 new kernel behavior essentially no different to old kernel version (though the overloading xNNN w/ stacks seems to work well) -- need more info on what the network to these machines is doing
  • 18:35 MEH: will manually add new storage nodes to summitcopy as done for past nights -- however, before adding, seeing a large number of fault events in summitcopy, something wrong? stsci overload? stdlocal heavily loaded still?
    • ota having timeout seem to download fine -- stdlocal stop.. -- it needs to be rebalanced for nightly processing anyways
    • looks like a select set of ota for every exposure
      http://conductor.ifa.hawaii.edu/ds/gpc1/o7007g0001d/o7007g0001d04.fits
      http://conductor.ifa.hawaii.edu/ds/gpc1/o7007g0001d/o7007g0001d32.fits
      http://conductor.ifa.hawaii.edu/ds/gpc1/o7007g0001d/o7007g0001d71.fits
      http://conductor.ifa.hawaii.edu/ds/gpc1/o7007g0001d/o7007g0001d14.fits
      http://conductor.ifa.hawaii.edu/ds/gpc1/o7007g0001d/o7007g0001d45.fits
      http://conductor.ifa.hawaii.edu/ds/gpc1/o7007g0001d/o7007g0001d41.fits
      http://conductor.ifa.hawaii.edu/ds/gpc1/o7007g0001d/o7007g0001d36.fits
      http://conductor.ifa.hawaii.edu/ds/gpc1/o7007g0001d/o7007g0001d60.fits
      
    • not always o7007g0011d has a slightly different set and only 7 that timeout --
  • 20:10 MEH: restarting ~ippmd/stdscience/ptolemy.rc at normal half power for nightly
    • not adding new storage nodes back to summitcopy since camera group allowing more throughput, avoid overloading tonight
  • 22:10 MEH: dont see stdlocal running or fully adjusted for nightly running -- turning down to 2x:c2, 3x:x or too many jobs trying to run
  • 00:05 MEH: massive faults and registration delay from db00 transcaction issue around
    141215 23:44:46InnoDB: Warning: cannot find a free slot for an undo log. Do you have too
    
    • nothing specifically different running?? what else was running around this time?

Tuesday : 2014.12.16

  • 07:30 MEH: ipp083 in bad state, causing the severe drop in processing rate ~0400 -- put to repair, stdlocal, ippmd, cleanup stop
    /bin/tcsh: Too many open files in system
    Connection to ipp083 closed.
    
    • seen before, sometimes clears on own, sometimes needs a reboot -- can log in and yes something went strange --
      [843651.114681] VFS: file-max limit 19809058 reached
      
      Dec 16 07:35:00 ipp083 [843346.823876] VFS: file-max limit 19809058 reached
      Dec 16 17:35:00 ipp083 portmap[15243]: warning: cannot open /etc/hosts.allow: Too many open files in system
      Dec 16 17:35:00 ipp083 portmap[15243]: error: bad option name: "^C"
      Dec 16 07:35:00 ipp083 [843346.830817] VFS: file-max limit 19809058 reached
      
      Dec 16 07:40:33 ipp083 nrpe[16177]: Daemon shutdown
      Dec 16 07:40:33 ipp083 nrpe[16179]: Network server accept failure (23: Too many open files in system)
      Dec 16 07:40:33 ipp083 nrpe[16179]: Cannot remove pidfile '/var/run/nrpe/nrpe.pid' - check your privileges.
      
    • more odd than seen before, but we probably want to fix whatever is hitting the huge file-max limit -- processing is moving more again -- cleanup back run
  • 08:10 MEH: and another exposure to be manually reverted in registration
    regtool -dbname gpc1 -revertprocessedexp -exp_id 839383
    
  • 09:10 MEH: should be okay to put ipp083 neb-host up again -- nightly ~finished so stdlocal back run also -- nope, back to repair, this was the cause of the problem at 4am and then things broke from there..
    Dec 16 04:22:15 ipp083 MR_MONITOR[16550]: <MRMON195> Controller ID:  0   BBU disabled; changing WB logical drives to WT, Forced WB VDs are not affected
    
  • 09:15 MEH: Haydn fixed ipp077 disk yesterday, was in repair fine and processing caught up -- try neb-host up for a while
  • 09:20 MEH: ipp071 had raid overload, battery looks to be charged again so try neb-host up
  • 09:25 MEH: another bad fault 5 OSS WW diffim
    difftool -updatediffskyfile -fault 0 -set_quality 42 -diff_id 622241 -skycell_id skycell.0720.020 -dbname gpc1
    
  • 09:45 MEH: need to test IDLE node case in stdlocal, stack.off for bit -- need to do other things.. stack.on but will not be modifying for daytime allocation other than bumped up 5x:x
  • 11:10 MEH: ippx100 is also being deleted from normal use to be monitored w/ ippmd processing
  • 11:30 CZW: stsci storage experiment on again.
  • 13:15 CZW: storage nodes on in stdlocal to see if that improves rate. Nebulous connections are down and stably so so that shouldn't be a problem.
    • once stsci available it looks like x001 and x100 are more similar to ipps -- after talking to Gavin this could easily make sense, he is making a diagram for where things are plugged into where.
  • 15:00 CZW: chip and warp cleanup off to get the diffs through again. We're not really in a space crunch with the stsci nodes open, but getting space cleared on the ipp07-8X storage nodes will help keep the number of available target nodes high if/when we need to take the stsci nodes out for DVO operations.
  • 18:20 CZW: Leaving stsci up for the night. Will turn on warp/chip cleanup and reduce stdlocal power when I get home.
  • 19:35 MEH: starting new storage node (s4+5) processing test during nightly processing -- long running jobs from ~ippmd/deepstack/ptolemy.rc

Wednesday : 2014.12.17

  • 09:35 MEH: monitoring some high load/wait systems ipp085,075
  • 10:30 MEH: pause to reduce all but deepstack processing for 10 minutes -- back to run -- deepstacks also suffering from network topo
  • 11:55 MEH: ipp083 BBU reports okay still but it seems to still be in the WT cache state and acts as an anchor to processing -- leaving in repair
  • 12:45 MEH: ipps03 unresponsive -- not too heavily loaded w/ deep stacks -- will deal with it
  • 14:15 MEH: chip+warp cleanup off to focus on diffs until finished --
  • 20:20 MEH: nightly processing would do a lot better if the polling was fully loaded.. doing the necessary regular restart -- also needs 4x:c2
  • 21:00 MEH: cleanup diff finished, chip+warp back on
  • 22:45 MEH: nightly slowly catching up. pstamp also could use a restart --

Thursday : 2014-12-18

  • 04:50 CZW: registration behind. Many failing reg.imfile jobs. Logfile points to full /tmp/nebulous_server.log. Stopped apache on ippc09 and ippc07, removed log file, restarted apache. Jobs appear to be catching up, so this looks to have resolved the major blockage. The other servers should probably be restarted in the next few days, but none had disk usages above ~77%, so they were not causing the same issues that c07/09 were.
  • 11:55 MEH: attempting to ease some processing and network issues, will modify the ippTasks/ipphosts.mhpcc.config
    • nightly and MD will target the new (and older storage nodes with space) for OTA to skycell since MD stacking done on those machines and nightly isn't going away on stsci nodes in the end of PV3
    • Gene will target stdlocal OTA to skycell to stsci nodes only
  • 12:05 MEH: ippsvn wiki disk issue -- ipp002.0 disk full.. oops we need to clean up stuff there before uploading large images to the wiki
  • 14:30 MEH: only a local manual change to ippTasks/ipphosts.mhpcc.config for nightly until sure isn't causing any problem tonight -- restarting summit+reg+stdsci
  • 14:50 CZW: restart stdlanl, as 500+ pending fake jobs indicates it's decided to stop working at maximum.
  • 15:00 CZW: xnode.hosts.on in stdlocal to get the processing rate up.
  • 15:00 MEH: cleanup has diffs to do, chip+warp .off until later -- not many so back to normal cleanup
  • 15:38 Bill: pstamp working directory on /data/ippc30.1 has run out of space. Changed preserve time for requests from 14 days to 10 (see site.config). To accelerate the process ran the command as ~ipp
    pstamp_queue_cleanup.pl --preserve-days 10 --verbose
    
  • 15:47 Bill restarted pstamp pantasks
  • 16:55 EAM : Haydn is changing batteries on ipp086,87,88, and maybe 84, and a disk on ipp077. I stopped processing but a number of stdlocal stacks were going to take a long time. I have stopped them (kill -STOP) on the following machines:
    21147 stack_skycell.pl   ippx054
    21142 stack_skycell.pl   ippc56
    21152 stack_skycell.pl   ippc27
    21151 stack_skycell.pl   ippx058
    21154 stack_skycell.pl   ippx084
    21155 stack_skycell.pl   ippx074
    21143 stack_skycell.pl   ippc28
    21149 stack_skycell.pl   ippx019
    21144 stack_skycell.pl   ippc27
    21146 stack_skycell.pl   ippc06
    21148 stack_skycell.pl   ippc09
    21141 stack_skycell.pl   ippx050
    21150 stack_skycell.pl   ippc50
    21145 stack_skycell.pl   ippx046
    21153 stack_skycell.pl   ippx093
    21140 stack_skycell.pl   ippx082
    
  • 19:45 MEH: ipp077 disk replaced so neb-host back up, ipp082 battery replaced and charging but still in WT state so repair
  • 19:50 MEH: nightly is struggling.. stdlocal has 6x:c2, if >2-3 stacks will greatly stall nightly processing also using those nodes so reducing to 2x, some idle anyways..

Friday : 2014.12.19

  • 7:23 haf: bunch of faulted things , clearing them manually regtool -dbname gpc1 -revertprocessedimfile -exp_id_begin 840400
  • 7:31 haf: why are we so behind both on registration and summicopy? It's light outside! We should be done!
    • overloading stdlocal is overfills a network link that some of the storage and compute nodes used by nightly processing? to help speed things, try stopping stdlocal
    • some newer data nodes not in repair while in WT state and can get wedged -- ipp086 only one left so fixed that
  • 08:15 MEH: ipp082 is back to WB, so neb-host up -- ipp086 still working on BBU charging
  • 11:45 MEH: new storage group s4+5 have done summitcopy, stdsci, deep stack -- trying them w/ added to cleanup -- still quite a backlog of nightly from early December
    • 2x:s3+2x:s4+5 start to see some extra work on the new storage disks -- all OSS chips from earlier in Dec ~8k still? maybe doing ~400/hr w/ this config
  • 16:15 CZW: Just so it's noted: LANL is down until next week, as the computer people started patches late on a Friday, didn't get them done, and now are keeping the cluster we use down until they can finish. I've turned off all tasks in the stdlanl pantasks to avoid unnecessary work, with only the lap tasks remaining (lap.initial is run from here, and is needed for new runs to be queued and processed in both stdlocal and stdlanl).
  • 16:23 HAF: Haydn is doing repair work on ipp083/84/86 - I set those to down in neb-host, and I also controller host off'd them on ippmd/stdscience and ipp/cleanup
    • MEH: work done, neb-host repair until battery charged and WB enabled
  • 20:15 MEH: nightly was slipping rapidly, all other processing off for a bit
    • add s5 to summitcopy since targeting those disks anyways
    • a large problem w/ overloading the stdlocal is interruption of summitcopy and registration, so if those back up the trigger (chip+cam+warp+diff) to turn off stdlocal is never reached -- if can isolate then can load more into stdlocal in case weather interrupts nightly
  • 21:50 MEH: isolating summitcopy+registration seems more successful, in both manually do
    • hosts off s0+s1, add s5 if needed, manually controller host on 008,012,013,014,016,018,019,020,021,037 (a mix of s1 group) -- numbers are similar to base level and may be able to add ipps or ippc20-28(?)
    • suspect registration is still having a problem because ipp067-071 is still targeted and across the overloaded 10G link
  • 00:30 MEH: looks like stdlocal has triggered chip-warp off automatically because >50 chips in stdsci -- summitcopy pretty much caught up but chip processing isn't keeping up and dtime is slower -- turning down ippmd some to see if recovers due to overuse of the new datanodes w/o the larger r/wsize mounts (more space so more targeted)

Saturday : 2014.12.20

  • 11:20 MEH: reallocation for summitcopy+registration seemed to help, will make autoloading and restart them+stsci today w/ data targeting modifications also
  • 13:30 MEH: Haydn and Sifan brought new storage nodes ipp090,093,095 online
    • Chris put into nebulous
  • 19:05 MEH: stdlocal overloading the 10G link is pushing stdsci to large pileup in chip, halfway to point of full auto-shutdown of stdlocal. overloading doesn't gain anything, found last night if chip ~50 then kept link just about full -- will set to that again to see how goes. when stacks and warps are available, they will run at larger poll values
    • summitcopy seems to be keeping up w/ observations
    • ipp083,084,086 isn't looking good for batteries again -- relearn cycles continue -- trying ipp086 up for a while, un-targeted and may be light enough to not overload it
  • 20:50 MEH: looks like registration crash again on its own ~20:26.. restarting
    [1226461.857447] pantasks_server[13903]: segfault at b09008 ip 0000000000408a4e sp 00
    000000414e7f20 error 4 in pantasks_server[400000+16000] 
    
  • 22:40 MEH: Richard reported not being able to make stamps... looks like it is full again, don't know how to clear. restarted and took out the large MOPS.2 and MPE labels for a while as it looks like some automated cleanup is running, just slowly

Sunday : 2014.12.21

  • 08:55 MEH: some oddly long >3ks running nightly chips -- all other processing stop until cleared -- had to manually clear ~6 a mix of chiptool -addprocessedimfile, warptool -addwarped, difftool -adddiffskyfile -- just a DB connection issue over the overloaded 10G link? all happened (also stdlocal and ippmd has some) from ~0730 so maybe it isn't just the 10G link overload..
    • maybe this case needs a timeout, 6ks is excessive
  • 09:00 MEH: ipp084,086 battery relearn was successful, WB mode back on so set neb-host up. ipp083 failed but is untargeted so hopefully more minimal writes and have neb-host up for no (may need to go into repair at night to not stall things)
  • 09:10 MEH: some stalled nightly warp+diffims -- cannot build growth curve (psf model is invalid everywhere)
    warptool -dbname gpc1 -updateskyfile -fault 0 -set_quality 42 -warp_id 1292968 -skycell_id skycell.0990.059
    
    difftool -updatediffskyfile -fault 0 -set_quality 42 -diff_id 625044 -skycell_id skycell.0992.055 -dbname gpc1
    
  • 15:20 MEH: have another neb target adjustment for stdsci, will restart nightly pantasks with it before observations start
    • of course, get things restarted and ipp083 relearn+charge finishes and is ready to be normally used as well.. reconfig and restart once again..
  • 20:00 MEH: nightly processing was all faults, had to turn down stdlocal poll (chips) to 50. otherwise will backup and stdlocal will just auto-shutoff
  • 21:50 MEH: ipp059 unresponsive ~2130.. nothing on console, was under normal load -- powercycle
    • again the Xtool stalling events ~2117 causing also a large backup in summitcopy and registration, probably stdsci as well but haven't had chance to check yet..