Processing Throughput Tiger Team Notes

Proposals

  • Overtasking gpc1 on ippdb01: load on ippdb01 regularly +20, starting early August.
    • test turing off updates over Friday/weekend 2011.09.09-2011.09.10. didn't seem to improve rates?
    • setup representative test queries and time delay for different processing loads
    • ccdtemp queries running rampant? Chris looking into if necessary and removing.
      • FPA.TEMP concept was being drawn from the database, instead of the image headers. This seems to have been unnecessary for a long time, and switching back to a header based value does not appear to have any effect on the images or photometry for a small sample. Removing these calls to the database have cut the load from 15-20 to ~3, largely removing potential database load issues from the list of possible bottlenecks.
    • It appears mysql supports SQL_CACHE and SQL_NO_CACHE hints on select statements. If we are needlessly filling memory with cached queries that will never be repeated, we can flag this as such, and see if this helps. Similarly, hinting that load-like statements (select * from blahRun where state = 'new';) should be cached may help as well.

  • Polling times/rates: are the polling rates causing extra load and delay with DB queries unnecessarily?
    • Bill testing by changing times in update

  • Each pantasks input file has a different definition of wave1, wave2, compute, etc. These should be unified into a single list that is included into each input file. This will make managing untrustworthy hosts easier. Beyond this, we should re-evaluate the distribution of hosts among the various servers to ensure we are properly allocating processing power. It appears that each unit of load corresponds to being included in one pantasks one time. This suggests that we are manually underutilizing the upgraded servers.
    • Chris implemented 2011.09.12@1500 - see ~ipp/ippconfig/pantasks_hosts.input, evaluation of improvement/issues continue

  • Stage delay/timings: are there possible areas to speed up
    • is pswarp stage a significant delay in processing? Ken asks could going to lanczos2 be acceptable.
    • series stepping through unused labels in pantasks like in stdscience cause hiccups (sharp drop when finished, slow rise as reloads) in overall load? Mark dropped MD03,04,05,06,07,08, PI, STD.nightlyscience from label and survey.WSdiff,SSdiff,magic,destreak,dist as they aren't in regular use 9/19
  • IPP meeting 9/19 Ken asks about ways of tracking server load with processing.
    • possibly load ganglia details into czardb to pair processing poll times with ganglia reported states/status. Mark looking into the rrd format conversion/interface to feed to Roy.
    • ganglia can also pull in other measures for monitoring such as from parsed apache logs and mysql stats as well as more sophisticated event overlays and graph aggregation. Mark also looking into more and setup a simple addition to the ippdb01 ganglia page as a learning example using gmetric (from http://codeinthehole.com/archives/8-Monitoring-MySQL-with-Ganglia-and-gmetric.html). There is a gmetric repository of examples for additional ideas
    • raid throughputs: Gene has noticed the R/W to the raid doesn't seem much faster than to a single disk. Mark looking into.
      • using multiple tests (dd, bonnie, iozone) seems to indicate w/in RAM caching have ~2-400MB/s for writes, large files drop to ~2x single SATA disk speeds with single process, ~80MB/s. Multiple processes seem to degrade the rate linearly for large files, smaller files better. Some conflicting rates to work out.
      • need to look more at latencies since filesizes are << RAM, order 30MB or less.
      • adding iostat reports to ganglia would be useful to track the IO during full processing with OTA/skycells processed locally and to verify tests.
      • more details on the raid config for active machines will help as well
  • ippc18 - odd 5-10min oscillation behavior in cluster load reducing throughput ~40-50% from 2am-6am. was rsync ippc18->ippc19 mirror. Bill rotated logs so rsync only takes a fraction of the time. Rsync also moved to start at 6am so if happens again, isn't buried in wee morning hours. Rsync niced and a bandwidth limit of 10MB/s also added. Toublesome that an rsync could degrade processing and may be symtom of more fundamental problem/failure.

Action items from 2011.08.29

  • Chris
    • tidy / finish disk usage stuff (regular job to be run on all storage nodes)
      • Coded, and cronjobs created. Undiagnosed bug in cronjobs is preventing completion.
    • LAP / magic bug : magic does not queue multiple runs for the same label & exp_id
      • Workaround done, will be fixed in next LAP code.
    • LAP / smf destreak bug : warp-stack runs destreak the same exp_id in multiple cam_ids; these should be merged somehow (either smf level or streak level)
      • Will be fixed in next LAP code by ensuring only one processing of an exposure exists, and delaying warp-stack diffs until they are complete.
    • apache targets via ipp/.cshrc
      • Done.
    • ROC (at low priority)
  • Bill
    • streaksremove / ppMops table memory fix
    • log file rename
    • smf in pstamp / web interface
    • event monitoring
  • Mark
    • most large-count bad diffs had telescope tracking glitch / optical aberration
      • examine diff stats from moments - no clear indicator, finished preparing sample set to look at 9/6
      • other ways to automatically reject these?
    • MD02 starting - still trying to finish up, izy ready as of 9/6

Notes from 2011.08.25

  • Bottlenecks
    • Need to catch files with storage object but no instance.
      • Bill says that this "should" only be an issue with log files. Suggestion for that is to move old logfile to logfile.timestamp
      • That way repeated failures can be diagnosed (editors note: somewhat but not in ippMonitor ....)
    • Long running jobs
      • Need a way to have pantasks stop and fault jobs that are stuck due to
        • nfs hangs
        • alogrithm problems (should not be re-run)
      • Talked about the "Event monitoriing" hooks that Bill was working on.
    • When a nebulous apache servers does not respond it causes problems throughout the system because the requests don't get serviced
      • Suggestion to investigate more robust load balancing solution. Apache, hardware solutions, etc
      • short term perhaps target apache requests to a server based on the client node. That way errors would be localized to a subset of machines
    • ipp014
      • New RAID card on order. Turned off as processing node and for nebulous output (repair mode)
    • memory usage
      • Mark is investigating ppMops explosion (related to diff source explosion)
      • Bill is investigating higher than expected memory usage by streaksremove
  • Disk space
    • We are still using more space than predicted. Chris has scripts running to evaluate.
  • What about long term storage on data store. Do we need to keep the clean camera and diff runs forever?
    • Bill created a web interface to get the smfs for Dave Monet to use from Manoa. Can this be hardened and put on the pstamp server?